0% found this document useful (0 votes)
68 views7 pages

HW 2

This document contains instructions for Homework 2 in the course CSIT5210 Data Mining and Knowledge Discovery. It discusses: 1. A coupon that students can use to waive one question and receive full marks for it. The coupon is unique to each student. 2. Question 1 (20 marks) defines intervals of varying lengths along attributes according to data distribution for density-based subspace clustering. It asks how the Apriori-like algorithm could be adapted for this approach. 3. Question 2 (20 marks) covers KL-transform for subspace clustering and relates data points between parts (a) and (b). 4. Question 3 (20 marks) shows user data

Uploaded by

calvinlam12100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views7 pages

HW 2

This document contains instructions for Homework 2 in the course CSIT5210 Data Mining and Knowledge Discovery. It discusses: 1. A coupon that students can use to waive one question and receive full marks for it. The coupon is unique to each student. 2. Question 1 (20 marks) defines intervals of varying lengths along attributes according to data distribution for density-based subspace clustering. It asks how the Apriori-like algorithm could be adapted for this approach. 3. Question 2 (20 marks) covers KL-transform for subspace clustering and relates data points between parts (a) and (b). 4. Question 3 (20 marks) shows user data

Uploaded by

calvinlam12100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CSIT5210 Data Mining and Knowledge Discovery (Fall Semester 2023)

Homework 2
Deadline: 7 Nov, 2023 3pm
(Please hand in during lecture.)
Full Mark: 100 Marks

Coupon Instructions:
1. You can use a coupon to waive any question you want and obtain full marks for this question.
2. You can waive at most one question in each assignment.
3. You can also answer the question you will waive. We will also mark it but will give full marks to this
question.
4. The coupon is non-transferrable. That is, the coupon with a unique ID can be used only by the student
who obtained it in class.
5. Please staple the coupon to the submitted assignment.
6. Please write down the question no. you want to waive on the coupon.

Q1 [20 Marks]
We are given l attributes, namely A1, A2, …, Al, and n data points. Suppose that n = 2k where k is a positive
integer  2. Given a data point x, we denote the value of attribute A for x to be x.A. Assume that two different
data points have different attribute values for each attribute.

In the density-based subspace clustering, we learnt that each (grid) unit can be represented by p intervals if
we consider p attributes only where p  l. “A2=[1, 10], A6=[21, 30]” is an example of the representation of a
unit if we consider two attributes, A2 and A6. The number of attributes involved for this unit is 2. Besides, the
length of the interval “Ai=[a, b]” is defined to be b-a where a and b are two real numbers. The volume of a
unit is defined to be the product of the lengths of all intervals involved for this unit. Recall that in this density-
based subspace clustering, we want to find all subspaces which contain dense units. Formally, given a
subspace S in the result, there exists a unit such that the attributes involved for this unit are the attributes
involved for S and this unit is dense.

However, in this density-based subspace clustering, the length of each interval along an attribute is fixed.
Motivated by this observation, we want to define some intervals with varying lengths along an attribute
according to the data distribution. In this way, we can define a unit based on these intervals.

According to the data, we can generate a set of intervals according to a function Generate (to be described).
According to these intervals, we can define units. Before we describe function Generate, we give some
concepts as follows.

We split a set G of points into two parts according to an attribute A such that there exist four numbers y1, y2,
y3, y4  A where
 y1 < y2 < y3 < y4
 y1 = minxG x.A
 y2 is equal to the attribute value of a data point in G
 y3 is also equal to the attribute value of a data point in G
 y4 = maxxG x.A
 the total number of data points in G which attribute values on A are at most y2 is equal to the total
number of data points in G which attribute values on A are at least y3
 there are no data points in G which attribute values on A are greater than y2 and smaller than y3

1/7
Let Left(G, A) be a set of data points in G which attribute values on A are at most y2 (described above).
Let Right(G, A) be a set of data points in G which attribute values on A are at least y3 (described above).
Let SplitValue(G, A) be [(y2-y1)+(y4-y3)]/2 (where y1, y2, y3 and y4 were described above).

Now, we define a function Generate which takes a set G of data points as an input and outputs a set of
intervals.

Function Generate(G)

if |G|  4
X
for each i  [1, l] do
y1  minxG x.Ai
y4  maxxG x.Ai
intervali  [y1, y4]
X  X  “Ai = intervali”
else
Ai  the attribute A which has the greatest value of SplitValue(G, A) (among all attributes)
Gleft  Left(G, Ai)
Gright  Right(G, Ai)
Xleft  Generate(Gleft)
Xright  Generate(Gright)
X  Xleft  Xright
return X

(a) Is it possible that the intervals along a single attribute which are generated by function Generate are
overlapping? Why?
(b) Under this new definition of intervals, we have a new definition of “dense” as follows.
Suppose that we define that a unit is dense if this unit contains at least 4 data points and there exists
an interval of this unit where the length of this interval is smaller than or equal to a support threshold
t where t is a non-negative real number and a user parameter. Can we still adopt the Apriori-like
algorithm for finding all subsapces containing dense units? If yes, please describe how to adopt the
algorithm. Otherwise, please give reasons why it cannot be adopted.
(c) This part is independent of part (b).
Under this new definition of intervals, we have a new definition of “dense” as follows.
Suppose that we define that a unit is dense if this unit contains at least 4 data points and the volume of
this unit is smaller than or equal to ts where t is a non-negative real number and a user parameter, and
s is the number of attributes involved for this unit. Can we still adopt the Apriori-like algorithm for
finding all subsapces containing dense units? If yes, please describe how to adopt the algorithm.
Otherwise, please give reasons why it cannot be adopted.
(d) What are the advantages of using this approach for subspace clustering compared with the density-
based subspace clustering you learnt in class?

2/7
Q2 [20 Marks]
(a) Consider a set P containing the following four 2-dimensional data points.

a:(6, 6), b:(8, 8), c:(5, 9), d:(9, 5)

We can make use of the KL-Transform to find a transformed subspace containing a cluster. Let L be the
total number of dimensions in the original space and K be the total number of dimensions in the projected
subspace. Please illustrate the KL-transform technique with the above example when L=2 and K=1.

(b) Consider a set Q containing the following four 2-dimensional data points.

e:(5, 5), f:(9, 9), g:(3, 11), h:(11, 3)

(i) Let p = (xp, yp) be a point in P and q = (xq, yq) be a point in Q. In fact, we could express xq in a linear
form involving xp such that xq = α . xp + β where α and β are 2 real numbers. Similarly, we could
express yq in the same linear form involving yp. Please write down the values of α and β.
(ii) Similar to Part (a), we want to make use of the KL-Transform to find a transformed subspace
containing a cluster for the set Q where L = 2 and K = 1. One “straightforward” or “naïve” method
is to use the same method in Part (a) to obtain the answer. Is it possible to make use of the result in
Part (a) and the result in Part (b)(i) to obtain the answer very quickly? If yes, please explain briefly
and give the answer. There is no need to give a formal proof. A brief description it accepted. If no,
please give an explanation briefly. In this case, derive the answer by using the method in Part (a).

(c) Consider Part (a). It is independent of Part (b). In Part (a), we know that there are 4 points.
Suppose that we have 4 additional points which are identical to the original 4 points. That is, we have
the following 4 additional points. Totally, we have 8 data points.

(6, 6), (8, 8), (5, 9), (9, 5)

One “straightforward” or “naïve” method is to use the same method in Part (a) to obtain the answer. Is it
possible to make use of the result in Part (a) to obtain the answer very quickly? If yes, please explain
briefly and give the answer. There is no need to give a formal proof. A brief description it accepted. If no,
please give an explanation briefly. In this case, derive the answer by using the method in Part (a).

(d) Consider two random variables X and Y with the following probabilistic table.

X\Y 1 2 3
1 0 1/8 1/8
2 1/2 0 1/8
3 1/8 0 0

(i) Calculate the conditional entropy of H(X|Y) by using the original definition of the conditional
entropy.
(ii) Calculate H(X|Y) as
- xA yB p(x, y) log p(x|y)
where A = {1, 2, 3} and B = {1, 2, 3}.

3/7
Q3 [20 Marks]
The following shows a history of users with attributes “Study_CS” (i.e., Studying Computer Science (CS)),
“Age” and “Income”. We also indicate whether they will buy Bitcoin or not in the last column.
No. Study_CS Age Income Buy_Bitcoin
1 yes old fair yes
2 yes middle fair yes
3 no young fair yes
4 no young high yes
5 yes old low no
6 yes young low no
7 no young fair no
8 no middle low no

(a) We want to train a C4.5 decision tree classifier to predict whether a new user will buy Bitcoin or not.
We define the value of attribute Buy_Bitcoin to be the label of a record.
(i) Please find a C4.5 decision tree according to the above example. In the decision tree, whenever
we process (1) a node containing at least 80% records with the same label or (2) a node containing
at most 2 records, we stop to process this node for splitting.
(ii) Consider a new young user studying CS whose income is fair. Please predict whether this new
user will buy Bitcoin or not.
(b) What is the difference between the C4.5 decision tree and the ID3 decision tree? Why is there a difference?
(c) We know how to compute the impurity measurement of an attribute A under the ID3 decision tree, denoted
by Imp-ID3(A). We also know how to compute the impurity measurement of an attribute A under the
CART decision tree, denoted by Imp-CART(A). Consider two attributes A and B. Is it always true that if
Imp-CART(A) > Imp-CART(B), then Imp-ID3(A) > Imp-ID3(B)? If yes, please show that it is true.
Otherwise, please give a counter example showing that this is not true and then explain it.

4/7
Q4 [20 Marks]

We have the following Bayesian Belief Network.

AP = Yes P = Yes
0.3 0.6

SIR = Yes
AP = Yes AcutePancreatitis (AP) Pneumonia (P)
0.7
P = Yes
AP = Yes
0.45
P = No
AP = No
0.55 SystemicInflammationReaction (SIR)
P = Yes
AP = No
0.2
P = No

WhiteBloodCell (WBC)
WBC = High
SIR = Yes 0.6
SIR = No 0.3

Suppose that there is a new patient. We know that


(1) he has acute pancreatitis
(2) he has pneumonia
(3) his result of white blood cell is low
We would like to know whether he is likely to have systemic inflammation reaction.
Acute Pancreatitis Pneumonia White Blood Cell Systemic Inflammation Reaction
Yes Yes Low ?
(a) Please use Bayesian Belief Network classifier with the use of Bayesian Belief Network to predict whether
he is likely to have systemic inflammation reaction.
(b) Although Bayesian Belief Network classifier does not have an independent assumption among all
attributes (compared with the naïve Bayesian classifier), what are the disadvantages of this classifier?

5/7
Q5 [20 Marks]

We are given two data points with 2 different timestamps.


At the timestamp t = 1, we have a data point (x1, x2, y) where (x1, x2) = (0.3, 0.6) and y = 0.2.
At the timestamp t = 2, we have a data point (x1, x2, y) where (x1, x2) = (0.1, 1.0) and y = 0.4.
Here, x1 and x2 are 2 input variables. y is the output variable.

(a) Consider the traditional LSTM model. Initially, we have the following internal weight vectors and bias
variables as follows.

0.8
𝑊 0.4 bf = 0.2
0.1
0.9
𝑊 0.8 bi = 0.5
0.7
0.4
𝑊 0.2 ba = 0.3
0.1
0.6
𝑊 0.4 bo = 0.2
0.1

In the model, we have the following status variables. For each t = 1, 2, ….


1. forget gate variable ft
2. input gate variable it
3. input activation variable at
4. internal state variable st
5. output gate variable ot
6. final output variable yt

Suppose that y0 = 0 and s0 = 0.

Consider the input forward propagation step only.


(i) What are the values of the above status variables when t = 1 and when t = 2? Please show each
answer up to 4 decimal places.
(ii) What are the errors of the final output variables when t = 1 and when t = 2? Please show each
answer up to 4 decimal places.

6/7
(b) Consider the GRU model. Initially, we have the following internal weight vectors and bias variables as
follows.

0.3
𝑊 0.2 br = 0.5
0.1
0.4
𝑊 0.3 ba = 0.1
0.1
0.4
𝑊 0.2 bu = 0.1
0.1

In the model, we have the following status variables. For each t = 1, 2, ….


1. reset gate variable rt
2. input activation variable at
3. update gate variable ut
4. final output variable yt

Suppose that y0 = 0.

Consider the input forward propagation step only.


(i) What are the values of the above status variables when t = 1 and when t = 2? Please show each
answer up to 4 decimal places.
(ii) What are the errors of the final output variables when t = 1 and when t = 2? Please show each
answer up to 4 decimal places.

(c) What is the major disadvantage of the traditional neural network model compared with the recurrent
neural network model?

7/7

You might also like