HW 2
HW 2
Homework 2
Deadline: 7 Nov, 2023 3pm
(Please hand in during lecture.)
Full Mark: 100 Marks
Coupon Instructions:
1. You can use a coupon to waive any question you want and obtain full marks for this question.
2. You can waive at most one question in each assignment.
3. You can also answer the question you will waive. We will also mark it but will give full marks to this
question.
4. The coupon is non-transferrable. That is, the coupon with a unique ID can be used only by the student
who obtained it in class.
5. Please staple the coupon to the submitted assignment.
6. Please write down the question no. you want to waive on the coupon.
Q1 [20 Marks]
We are given l attributes, namely A1, A2, …, Al, and n data points. Suppose that n = 2k where k is a positive
integer 2. Given a data point x, we denote the value of attribute A for x to be x.A. Assume that two different
data points have different attribute values for each attribute.
In the density-based subspace clustering, we learnt that each (grid) unit can be represented by p intervals if
we consider p attributes only where p l. “A2=[1, 10], A6=[21, 30]” is an example of the representation of a
unit if we consider two attributes, A2 and A6. The number of attributes involved for this unit is 2. Besides, the
length of the interval “Ai=[a, b]” is defined to be b-a where a and b are two real numbers. The volume of a
unit is defined to be the product of the lengths of all intervals involved for this unit. Recall that in this density-
based subspace clustering, we want to find all subspaces which contain dense units. Formally, given a
subspace S in the result, there exists a unit such that the attributes involved for this unit are the attributes
involved for S and this unit is dense.
However, in this density-based subspace clustering, the length of each interval along an attribute is fixed.
Motivated by this observation, we want to define some intervals with varying lengths along an attribute
according to the data distribution. In this way, we can define a unit based on these intervals.
According to the data, we can generate a set of intervals according to a function Generate (to be described).
According to these intervals, we can define units. Before we describe function Generate, we give some
concepts as follows.
We split a set G of points into two parts according to an attribute A such that there exist four numbers y1, y2,
y3, y4 A where
y1 < y2 < y3 < y4
y1 = minxG x.A
y2 is equal to the attribute value of a data point in G
y3 is also equal to the attribute value of a data point in G
y4 = maxxG x.A
the total number of data points in G which attribute values on A are at most y2 is equal to the total
number of data points in G which attribute values on A are at least y3
there are no data points in G which attribute values on A are greater than y2 and smaller than y3
1/7
Let Left(G, A) be a set of data points in G which attribute values on A are at most y2 (described above).
Let Right(G, A) be a set of data points in G which attribute values on A are at least y3 (described above).
Let SplitValue(G, A) be [(y2-y1)+(y4-y3)]/2 (where y1, y2, y3 and y4 were described above).
Now, we define a function Generate which takes a set G of data points as an input and outputs a set of
intervals.
Function Generate(G)
if |G| 4
X
for each i [1, l] do
y1 minxG x.Ai
y4 maxxG x.Ai
intervali [y1, y4]
X X “Ai = intervali”
else
Ai the attribute A which has the greatest value of SplitValue(G, A) (among all attributes)
Gleft Left(G, Ai)
Gright Right(G, Ai)
Xleft Generate(Gleft)
Xright Generate(Gright)
X Xleft Xright
return X
(a) Is it possible that the intervals along a single attribute which are generated by function Generate are
overlapping? Why?
(b) Under this new definition of intervals, we have a new definition of “dense” as follows.
Suppose that we define that a unit is dense if this unit contains at least 4 data points and there exists
an interval of this unit where the length of this interval is smaller than or equal to a support threshold
t where t is a non-negative real number and a user parameter. Can we still adopt the Apriori-like
algorithm for finding all subsapces containing dense units? If yes, please describe how to adopt the
algorithm. Otherwise, please give reasons why it cannot be adopted.
(c) This part is independent of part (b).
Under this new definition of intervals, we have a new definition of “dense” as follows.
Suppose that we define that a unit is dense if this unit contains at least 4 data points and the volume of
this unit is smaller than or equal to ts where t is a non-negative real number and a user parameter, and
s is the number of attributes involved for this unit. Can we still adopt the Apriori-like algorithm for
finding all subsapces containing dense units? If yes, please describe how to adopt the algorithm.
Otherwise, please give reasons why it cannot be adopted.
(d) What are the advantages of using this approach for subspace clustering compared with the density-
based subspace clustering you learnt in class?
2/7
Q2 [20 Marks]
(a) Consider a set P containing the following four 2-dimensional data points.
We can make use of the KL-Transform to find a transformed subspace containing a cluster. Let L be the
total number of dimensions in the original space and K be the total number of dimensions in the projected
subspace. Please illustrate the KL-transform technique with the above example when L=2 and K=1.
(b) Consider a set Q containing the following four 2-dimensional data points.
(i) Let p = (xp, yp) be a point in P and q = (xq, yq) be a point in Q. In fact, we could express xq in a linear
form involving xp such that xq = α . xp + β where α and β are 2 real numbers. Similarly, we could
express yq in the same linear form involving yp. Please write down the values of α and β.
(ii) Similar to Part (a), we want to make use of the KL-Transform to find a transformed subspace
containing a cluster for the set Q where L = 2 and K = 1. One “straightforward” or “naïve” method
is to use the same method in Part (a) to obtain the answer. Is it possible to make use of the result in
Part (a) and the result in Part (b)(i) to obtain the answer very quickly? If yes, please explain briefly
and give the answer. There is no need to give a formal proof. A brief description it accepted. If no,
please give an explanation briefly. In this case, derive the answer by using the method in Part (a).
(c) Consider Part (a). It is independent of Part (b). In Part (a), we know that there are 4 points.
Suppose that we have 4 additional points which are identical to the original 4 points. That is, we have
the following 4 additional points. Totally, we have 8 data points.
One “straightforward” or “naïve” method is to use the same method in Part (a) to obtain the answer. Is it
possible to make use of the result in Part (a) to obtain the answer very quickly? If yes, please explain
briefly and give the answer. There is no need to give a formal proof. A brief description it accepted. If no,
please give an explanation briefly. In this case, derive the answer by using the method in Part (a).
(d) Consider two random variables X and Y with the following probabilistic table.
X\Y 1 2 3
1 0 1/8 1/8
2 1/2 0 1/8
3 1/8 0 0
(i) Calculate the conditional entropy of H(X|Y) by using the original definition of the conditional
entropy.
(ii) Calculate H(X|Y) as
- xA yB p(x, y) log p(x|y)
where A = {1, 2, 3} and B = {1, 2, 3}.
3/7
Q3 [20 Marks]
The following shows a history of users with attributes “Study_CS” (i.e., Studying Computer Science (CS)),
“Age” and “Income”. We also indicate whether they will buy Bitcoin or not in the last column.
No. Study_CS Age Income Buy_Bitcoin
1 yes old fair yes
2 yes middle fair yes
3 no young fair yes
4 no young high yes
5 yes old low no
6 yes young low no
7 no young fair no
8 no middle low no
(a) We want to train a C4.5 decision tree classifier to predict whether a new user will buy Bitcoin or not.
We define the value of attribute Buy_Bitcoin to be the label of a record.
(i) Please find a C4.5 decision tree according to the above example. In the decision tree, whenever
we process (1) a node containing at least 80% records with the same label or (2) a node containing
at most 2 records, we stop to process this node for splitting.
(ii) Consider a new young user studying CS whose income is fair. Please predict whether this new
user will buy Bitcoin or not.
(b) What is the difference between the C4.5 decision tree and the ID3 decision tree? Why is there a difference?
(c) We know how to compute the impurity measurement of an attribute A under the ID3 decision tree, denoted
by Imp-ID3(A). We also know how to compute the impurity measurement of an attribute A under the
CART decision tree, denoted by Imp-CART(A). Consider two attributes A and B. Is it always true that if
Imp-CART(A) > Imp-CART(B), then Imp-ID3(A) > Imp-ID3(B)? If yes, please show that it is true.
Otherwise, please give a counter example showing that this is not true and then explain it.
4/7
Q4 [20 Marks]
AP = Yes P = Yes
0.3 0.6
SIR = Yes
AP = Yes AcutePancreatitis (AP) Pneumonia (P)
0.7
P = Yes
AP = Yes
0.45
P = No
AP = No
0.55 SystemicInflammationReaction (SIR)
P = Yes
AP = No
0.2
P = No
WhiteBloodCell (WBC)
WBC = High
SIR = Yes 0.6
SIR = No 0.3
5/7
Q5 [20 Marks]
(a) Consider the traditional LSTM model. Initially, we have the following internal weight vectors and bias
variables as follows.
0.8
𝑊 0.4 bf = 0.2
0.1
0.9
𝑊 0.8 bi = 0.5
0.7
0.4
𝑊 0.2 ba = 0.3
0.1
0.6
𝑊 0.4 bo = 0.2
0.1
6/7
(b) Consider the GRU model. Initially, we have the following internal weight vectors and bias variables as
follows.
0.3
𝑊 0.2 br = 0.5
0.1
0.4
𝑊 0.3 ba = 0.1
0.1
0.4
𝑊 0.2 bu = 0.1
0.1
Suppose that y0 = 0.
(c) What is the major disadvantage of the traditional neural network model compared with the recurrent
neural network model?
7/7