0% found this document useful (0 votes)

68 views7 pages

HW 2

This document contains instructions for Homework 2 in the course CSIT5210 Data Mining and Knowledge Discovery. It discusses: 1. A coupon that students can use to waive one question and receive full marks for it. The coupon is unique to each student. 2. Question 1 (20 marks) defines intervals of varying lengths along attributes according to data distribution for density-based subspace clustering. It asks how the Apriori-like algorithm could be adapted for this approach. 3. Question 2 (20 marks) covers KL-transform for subspace clustering and relates data points between parts (a) and (b). 4. Question 3 (20 marks) shows user data

Uploaded by

calvinlam12100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views7 pages

HW 2

Uploaded by

calvinlam12100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CSIT5210 Data Mining and Knowledge Discovery (Fall Semester 2023)

Homework 2
Deadline: 7 Nov, 2023 3pm
(Please hand in during lecture.)
Full Mark: 100 Marks

Coupon Instructions:
1. You can use a coupon to waive any question you want and obtain full marks for this question.
2. You can waive at most one question in each assignment.
3. You can also answer the question you will waive. We will also mark it but will give full marks to this
question.
4. The coupon is non-transferrable. That is, the coupon with a unique ID can be used only by the student
who obtained it in class.
5. Please staple the coupon to the submitted assignment.
6. Please write down the question no. you want to waive on the coupon.

Q1 [20 Marks]
We are given l attributes, namely A1, A2, …, Al, and n data points. Suppose that n = 2k where k is a positive
integer  2. Given a data point x, we denote the value of attribute A for x to be x.A. Assume that two different
data points have different attribute values for each attribute.

In the density-based subspace clustering, we learnt that each (grid) unit can be represented by p intervals if
we consider p attributes only where p  l. “A2=[1, 10], A6=[21, 30]” is an example of the representation of a
unit if we consider two attributes, A2 and A6. The number of attributes involved for this unit is 2. Besides, the
length of the interval “Ai=[a, b]” is defined to be b-a where a and b are two real numbers. The volume of a
unit is defined to be the product of the lengths of all intervals involved for this unit. Recall that in this density-
based subspace clustering, we want to find all subspaces which contain dense units. Formally, given a
subspace S in the result, there exists a unit such that the attributes involved for this unit are the attributes
involved for S and this unit is dense.

However, in this density-based subspace clustering, the length of each interval along an attribute is fixed.
Motivated by this observation, we want to define some intervals with varying lengths along an attribute
according to the data distribution. In this way, we can define a unit based on these intervals.

According to the data, we can generate a set of intervals according to a function Generate (to be described).
According to these intervals, we can define units. Before we describe function Generate, we give some
concepts as follows.

We split a set G of points into two parts according to an attribute A such that there exist four numbers y1, y2,
y3, y4  A where
 y1 < y2 < y3 < y4
 y1 = minxG x.A
 y2 is equal to the attribute value of a data point in G
 y3 is also equal to the attribute value of a data point in G
 y4 = maxxG x.A
 the total number of data points in G which attribute values on A are at most y2 is equal to the total
number of data points in G which attribute values on A are at least y3
 there are no data points in G which attribute values on A are greater than y2 and smaller than y3

1/7
Let Left(G, A) be a set of data points in G which attribute values on A are at most y2 (described above).
Let Right(G, A) be a set of data points in G which attribute values on A are at least y3 (described above).
Let SplitValue(G, A) be [(y2-y1)+(y4-y3)]/2 (where y1, y2, y3 and y4 were described above).

Now, we define a function Generate which takes a set G of data points as an input and outputs a set of
intervals.

Function Generate(G)

if |G|  4
X
for each i  [1, l] do
y1  minxG x.Ai
y4  maxxG x.Ai
intervali  [y1, y4]
X  X  “Ai = intervali”
else
Ai  the attribute A which has the greatest value of SplitValue(G, A) (among all attributes)
Gleft  Left(G, Ai)
Gright  Right(G, Ai)
Xleft  Generate(Gleft)
Xright  Generate(Gright)
X  Xleft  Xright
return X

(a) Is it possible that the intervals along a single attribute which are generated by function Generate are
overlapping? Why?
(b) Under this new definition of intervals, we have a new definition of “dense” as follows.
Suppose that we define that a unit is dense if this unit contains at least 4 data points and there exists
an interval of this unit where the length of this interval is smaller than or equal to a support threshold
t where t is a non-negative real number and a user parameter. Can we still adopt the Apriori-like
algorithm for finding all subsapces containing dense units? If yes, please describe how to adopt the
algorithm. Otherwise, please give reasons why it cannot be adopted.
(c) This part is independent of part (b).
Under this new definition of intervals, we have a new definition of “dense” as follows.
Suppose that we define that a unit is dense if this unit contains at least 4 data points and the volume of
this unit is smaller than or equal to ts where t is a non-negative real number and a user parameter, and
s is the number of attributes involved for this unit. Can we still adopt the Apriori-like algorithm for
finding all subsapces containing dense units? If yes, please describe how to adopt the algorithm.
Otherwise, please give reasons why it cannot be adopted.
(d) What are the advantages of using this approach for subspace clustering compared with the density-
based subspace clustering you learnt in class?

2/7
Q2 [20 Marks]
(a) Consider a set P containing the following four 2-dimensional data points.

a:(6, 6), b:(8, 8), c:(5, 9), d:(9, 5)

We can make use of the KL-Transform to find a transformed subspace containing a cluster. Let L be the
total number of dimensions in the original space and K be the total number of dimensions in the projected
subspace. Please illustrate the KL-transform technique with the above example when L=2 and K=1.

(b) Consider a set Q containing the following four 2-dimensional data points.

e:(5, 5), f:(9, 9), g:(3, 11), h:(11, 3)

(i) Let p = (xp, yp) be a point in P and q = (xq, yq) be a point in Q. In fact, we could express xq in a linear
form involving xp such that xq = α . xp + β where α and β are 2 real numbers. Similarly, we could
express yq in the same linear form involving yp. Please write down the values of α and β.
(ii) Similar to Part (a), we want to make use of the KL-Transform to find a transformed subspace
containing a cluster for the set Q where L = 2 and K = 1. One “straightforward” or “naïve” method
is to use the same method in Part (a) to obtain the answer. Is it possible to make use of the result in
Part (a) and the result in Part (b)(i) to obtain the answer very quickly? If yes, please explain briefly
and give the answer. There is no need to give a formal proof. A brief description it accepted. If no,
please give an explanation briefly. In this case, derive the answer by using the method in Part (a).

(c) Consider Part (a). It is independent of Part (b). In Part (a), we know that there are 4 points.
Suppose that we have 4 additional points which are identical to the original 4 points. That is, we have
the following 4 additional points. Totally, we have 8 data points.

(6, 6), (8, 8), (5, 9), (9, 5)

One “straightforward” or “naïve” method is to use the same method in Part (a) to obtain the answer. Is it
possible to make use of the result in Part (a) to obtain the answer very quickly? If yes, please explain
briefly and give the answer. There is no need to give a formal proof. A brief description it accepted. If no,
please give an explanation briefly. In this case, derive the answer by using the method in Part (a).

(d) Consider two random variables X and Y with the following probabilistic table.

X\Y 1 2 3
1 0 1/8 1/8
2 1/2 0 1/8
3 1/8 0 0

(i) Calculate the conditional entropy of H(X|Y) by using the original definition of the conditional
entropy.
(ii) Calculate H(X|Y) as
- xA yB p(x, y) log p(x|y)
where A = {1, 2, 3} and B = {1, 2, 3}.

3/7
Q3 [20 Marks]
The following shows a history of users with attributes “Study_CS” (i.e., Studying Computer Science (CS)),
“Age” and “Income”. We also indicate whether they will buy Bitcoin or not in the last column.
No. Study_CS Age Income Buy_Bitcoin
1 yes old fair yes
2 yes middle fair yes
3 no young fair yes
4 no young high yes
5 yes old low no
6 yes young low no
7 no young fair no
8 no middle low no

(a) We want to train a C4.5 decision tree classifier to predict whether a new user will buy Bitcoin or not.
We define the value of attribute Buy_Bitcoin to be the label of a record.
(i) Please find a C4.5 decision tree according to the above example. In the decision tree, whenever
we process (1) a node containing at least 80% records with the same label or (2) a node containing
at most 2 records, we stop to process this node for splitting.
(ii) Consider a new young user studying CS whose income is fair. Please predict whether this new
user will buy Bitcoin or not.
(b) What is the difference between the C4.5 decision tree and the ID3 decision tree? Why is there a difference?
(c) We know how to compute the impurity measurement of an attribute A under the ID3 decision tree, denoted
by Imp-ID3(A). We also know how to compute the impurity measurement of an attribute A under the
CART decision tree, denoted by Imp-CART(A). Consider two attributes A and B. Is it always true that if
Imp-CART(A) > Imp-CART(B), then Imp-ID3(A) > Imp-ID3(B)? If yes, please show that it is true.
Otherwise, please give a counter example showing that this is not true and then explain it.

4/7
Q4 [20 Marks]

We have the following Bayesian Belief Network.

AP = Yes P = Yes
0.3 0.6

SIR = Yes
AP = Yes AcutePancreatitis (AP) Pneumonia (P)
0.7
P = Yes
AP = Yes
0.45
P = No
AP = No
0.55 SystemicInflammationReaction (SIR)
P = Yes
AP = No
0.2
P = No

WhiteBloodCell (WBC)
WBC = High
SIR = Yes 0.6
SIR = No 0.3

Suppose that there is a new patient. We know that

(1) he has acute pancreatitis
(2) he has pneumonia
(3) his result of white blood cell is low
We would like to know whether he is likely to have systemic inflammation reaction.
Acute Pancreatitis Pneumonia White Blood Cell Systemic Inflammation Reaction
Yes Yes Low ?
(a) Please use Bayesian Belief Network classifier with the use of Bayesian Belief Network to predict whether
he is likely to have systemic inflammation reaction.
(b) Although Bayesian Belief Network classifier does not have an independent assumption among all
attributes (compared with the naïve Bayesian classifier), what are the disadvantages of this classifier?

5/7
Q5 [20 Marks]

We are given two data points with 2 different timestamps.

At the timestamp t = 1, we have a data point (x1, x2, y) where (x1, x2) = (0.3, 0.6) and y = 0.2.
At the timestamp t = 2, we have a data point (x1, x2, y) where (x1, x2) = (0.1, 1.0) and y = 0.4.
Here, x1 and x2 are 2 input variables. y is the output variable.

(a) Consider the traditional LSTM model. Initially, we have the following internal weight vectors and bias
variables as follows.

0.8
𝑊 0.4 bf = 0.2
0.1
0.9
𝑊 0.8 bi = 0.5
0.7
0.4
𝑊 0.2 ba = 0.3
0.1
0.6
𝑊 0.4 bo = 0.2
0.1

In the model, we have the following status variables. For each t = 1, 2, ….

1. forget gate variable ft
2. input gate variable it
3. input activation variable at
4. internal state variable st
5. output gate variable ot
6. final output variable yt

Suppose that y0 = 0 and s0 = 0.

Consider the input forward propagation step only.

(i) What are the values of the above status variables when t = 1 and when t = 2? Please show each
answer up to 4 decimal places.
(ii) What are the errors of the final output variables when t = 1 and when t = 2? Please show each
answer up to 4 decimal places.

6/7
(b) Consider the GRU model. Initially, we have the following internal weight vectors and bias variables as
follows.

0.3
𝑊 0.2 br = 0.5
0.1
0.4
𝑊 0.3 ba = 0.1
0.1
0.4
𝑊 0.2 bu = 0.1
0.1

In the model, we have the following status variables. For each t = 1, 2, ….

1. reset gate variable rt
2. input activation variable at
3. update gate variable ut
4. final output variable yt

Suppose that y0 = 0.

Consider the input forward propagation step only.

(c) What is the major disadvantage of the traditional neural network model compared with the recurrent
neural network model?

7/7

Gate 2025
No ratings yet
Gate 2025
33 pages
Matlab Code 3
No ratings yet
Matlab Code 3
29 pages
DMG Exam 3
No ratings yet
DMG Exam 3
3 pages
Set-B - CT2 - AnswerKey
No ratings yet
Set-B - CT2 - AnswerKey
10 pages
Quiz 1
No ratings yet
Quiz 1
5 pages
MT2023 Sol
No ratings yet
MT2023 Sol
8 pages
Exam SRM Sample Questions
No ratings yet
Exam SRM Sample Questions
71 pages
(LASER) survival8-DM An DSAD-2-print Pending
No ratings yet
(LASER) survival8-DM An DSAD-2-print Pending
29 pages
Data Mining Notes
No ratings yet
Data Mining Notes
31 pages
Final Compre - Solutions - Updated FoDS
No ratings yet
Final Compre - Solutions - Updated FoDS
12 pages
DM Endsem 2023-1
No ratings yet
DM Endsem 2023-1
4 pages
10 EST Solution
No ratings yet
10 EST Solution
16 pages
Exam DM 071214 Ans
No ratings yet
Exam DM 071214 Ans
7 pages
Assignment 6: Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 6: Introduction To Machine Learning Prof. B. Ravindran
3 pages
EE4146 Test1 202324 Semb Solution
No ratings yet
EE4146 Test1 202324 Semb Solution
7 pages
Uct633 Est 23
No ratings yet
Uct633 Est 23
3 pages
Data Mining Practice Final Exam Solutions: True/False Questions
100% (1)
Data Mining Practice Final Exam Solutions: True/False Questions
5 pages
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
100% (1)
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
5 pages
Ass1 Solns
No ratings yet
Ass1 Solns
15 pages
Mid Term
No ratings yet
Mid Term
12 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
13 pages
COMP 1003&1433 Midterm (Tuesday)
No ratings yet
COMP 1003&1433 Midterm (Tuesday)
8 pages
Minervino y Trench - Cracking
No ratings yet
Minervino y Trench - Cracking
322 pages
HW 1
No ratings yet
HW 1
5 pages
Data Mining
No ratings yet
Data Mining
7 pages
Major 2020
No ratings yet
Major 2020
2 pages
DM-I Q Paper 2024
No ratings yet
DM-I Q Paper 2024
12 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
Homework 7
0% (1)
Homework 7
5 pages
6 (4 Files Merged)
0% (1)
6 (4 Files Merged)
4 pages
Forallxyork
No ratings yet
Forallxyork
199 pages
Exam dm1 121017 Ans
No ratings yet
Exam dm1 121017 Ans
8 pages
Sample Quiz1 Questions
No ratings yet
Sample Quiz1 Questions
8 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
7 pages
DataMining - Workbook MCQ
No ratings yet
DataMining - Workbook MCQ
16 pages
Uct633 MST e Mar25
No ratings yet
Uct633 MST e Mar25
2 pages
Assignment 2 Slot8 TTS3208 Summer
No ratings yet
Assignment 2 Slot8 TTS3208 Summer
11 pages
DM 2019
No ratings yet
DM 2019
7 pages
White Paper - Integration of ECM With LLMs Using LlamaIndex
No ratings yet
White Paper - Integration of ECM With LLMs Using LlamaIndex
11 pages
Data Mining Algorithms - Exam 23/24
No ratings yet
Data Mining Algorithms - Exam 23/24
5 pages
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
No ratings yet
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
11 pages
Midterm F07 Solutions
No ratings yet
Midterm F07 Solutions
4 pages
Answer Midterm Exam Data Mining1 2021 - 2022
100% (2)
Answer Midterm Exam Data Mining1 2021 - 2022
4 pages
Is Zc415 Ec-2r First Sem 2013-2014
No ratings yet
Is Zc415 Ec-2r First Sem 2013-2014
2 pages
Temp File Chapter 2
No ratings yet
Temp File Chapter 2
40 pages
Data Mining - Sem 3 - Assignment - 2
No ratings yet
Data Mining - Sem 3 - Assignment - 2
5 pages
640005
No ratings yet
640005
4 pages
Compre FoDS
No ratings yet
Compre FoDS
2 pages
Endsem ML Makeup AK - 1
No ratings yet
Endsem ML Makeup AK - 1
7 pages
Endsem ML Regular AK
No ratings yet
Endsem ML Regular AK
7 pages
Pearson BTec - Professional Practice Batch 03 Sem 01
No ratings yet
Pearson BTec - Professional Practice Batch 03 Sem 01
40 pages
Mid Semester Regular-DM
No ratings yet
Mid Semester Regular-DM
3 pages
Reading and Writing COMPARE-AND-CONTRAST-WRITTEN-TEXTS-2
No ratings yet
Reading and Writing COMPARE-AND-CONTRAST-WRITTEN-TEXTS-2
19 pages
Bachelor Thesis ToM
No ratings yet
Bachelor Thesis ToM
36 pages
Compre FoDS
No ratings yet
Compre FoDS
3 pages
Integrated Unit Lesson Plan
No ratings yet
Integrated Unit Lesson Plan
9 pages
Hill PPT 13e chp01
No ratings yet
Hill PPT 13e chp01
40 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
5 pages
Mid-Semester Regular Data Mining QP v1 PDF
No ratings yet
Mid-Semester Regular Data Mining QP v1 PDF
2 pages
Sociograms: Tracking Relationships in Fiction or Nonfiction: What Is It?
No ratings yet
Sociograms: Tracking Relationships in Fiction or Nonfiction: What Is It?
3 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
(Fall 2011) CS-402 Data Mining - Final Exam-SUB - v03
No ratings yet
(Fall 2011) CS-402 Data Mining - Final Exam-SUB - v03
6 pages
Data Mining Comprehensive Exam - Regular PDF
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
3 pages
Difficulties For Vietnamese When Pronouncing English: Final Consonants
No ratings yet
Difficulties For Vietnamese When Pronouncing English: Final Consonants
38 pages
Teaching Social Studies Final Project
No ratings yet
Teaching Social Studies Final Project
6 pages
OJT Endorsement Letter and Evaluation Instrument
No ratings yet
OJT Endorsement Letter and Evaluation Instrument
3 pages
Jahnavi IITPATNA
100% (1)
Jahnavi IITPATNA
1 page
Data Mining Practice Final Sol
No ratings yet
Data Mining Practice Final Sol
5 pages
Unit 2
No ratings yet
Unit 2
16 pages
University Students' Perception On Autonomous Learning: A Case of Private University, Thailand
No ratings yet
University Students' Perception On Autonomous Learning: A Case of Private University, Thailand
13 pages
Film Analysis Assignment
No ratings yet
Film Analysis Assignment
4 pages
Data Mining Exercises - Solutions
No ratings yet
Data Mining Exercises - Solutions
5 pages
VI. Reflection: Daily Lesson Plan - Mathematics - Grade 7 Caloocan High School S.Y. 2017 - 2018 Third Grading Period
No ratings yet
VI. Reflection: Daily Lesson Plan - Mathematics - Grade 7 Caloocan High School S.Y. 2017 - 2018 Third Grading Period
2 pages
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
No ratings yet
Final Exam, Data Mining (CEN 871) : Name Surname: Student's ID
2 pages
Group 3 ECESCI NOTES
No ratings yet
Group 3 ECESCI NOTES
3 pages
Mock Test-Foundation-Of-Education
No ratings yet
Mock Test-Foundation-Of-Education
3 pages
IS328 Final Exam
No ratings yet
IS328 Final Exam
12 pages
Risk Perception and Health Behavior
No ratings yet
Risk Perception and Health Behavior
9 pages
Time Duration: 1 Hour: Daily Lesson Plan
No ratings yet
Time Duration: 1 Hour: Daily Lesson Plan
6 pages
Is There A Reason For Everything
No ratings yet
Is There A Reason For Everything
2 pages
Exam DUT 070816 Ans
No ratings yet
Exam DUT 070816 Ans
5 pages
Donal S R PDF
No ratings yet
Donal S R PDF
4 pages
Child and Adolescent Development
No ratings yet
Child and Adolescent Development
7 pages
1 Module Reading in Philippine History
No ratings yet
1 Module Reading in Philippine History
3 pages
Beauty Is A Verb
No ratings yet
Beauty Is A Verb
5 pages
Tense and Aspects
No ratings yet
Tense and Aspects
6 pages
Data Warehousing and DatabySRS
No ratings yet
Data Warehousing and DatabySRS
8 pages
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

HW 2

Uploaded by

HW 2

Uploaded by

CSIT5210 Data Mining and Knowledge Discovery (Fall Semester 2023)

a:(6, 6), b:(8, 8), c:(5, 9), d:(9, 5)

e:(5, 5), f:(9, 9), g:(3, 11), h:(11, 3)

(6, 6), (8, 8), (5, 9), (9, 5)

We have the following Bayesian Belief Network.

Suppose that there is a new patient. We know that

We are given two data points with 2 different timestamps.

In the model, we have the following status variables. For each t = 1, 2, ….

Suppose that y0 = 0 and s0 = 0.

Consider the input forward propagation step only.

In the model, we have the following status variables. For each t = 1, 2, ….

Consider the input forward propagation step only.

You might also like