Comp 1942 finalExamQuestion-2019
Comp 1942 finalExamQuestion-2019
Instructions:
(1) Please answer all questions in the answer sheet.
(2) You can use a calculator.
Question Paper
1/14
COMP1942 Question Paper
Q1 (20 Marks)
(a) Given a dataset with the following transactions in binary format, and the support threshold = 2.
R A Y M N
0 1 1 0 0
0 1 1 0 1
1 0 0 0 0
1 1 1 0 1
1 0 1 0 1
After we perform the join step and the prune step in the Apriori algorithm, we obtain a set C of itemsets.
Then, we need to do the counting step for C (i.e., we need to find the frequency of each itemset in C).
Finally, we output all itemsets in C with frequency at least a given support threshold as a part of the final
output. Why do we need to do the counting step? That is, why can't we simply output C as a part of the
final output? You can use the above dataset for illustration.
k d:8
a
k:5
f
a:3
f:2
2/14
COMP1942 Question Paper
Q2 (20 Marks)
S = Yes
0.4
Smoke (S)
A = Yes
S = Yes 0.35
S = No 0.25 Asthma (A)
LC = Yes P = Yes
A = Yes 0.85 A = Yes 0.75
A = No 0.2 A = No 0.3
Please use Bayesian classifier with the use of Bayesian Belief Network to predict whether he is likely to
smoke.
3/14
COMP1942 Question Paper
Q3 (20 Marks)
(a) We know how to compute the impurity measurement of an attribute A under the ID3 decision tree,
denoted by Imp-ID3(A). We also know how to compute the impurity measurement of an attribute A
under the CART decision tree, denoted by Imp-CART(A). Consider two attributes A and B. Is it always
true that if Imp-CART(A) > Imp-CART(B), then Imp-ID3(A) > Imp-ID3(B)? If yes, please show that it
is true. Otherwise, please give a counter example showing that this is not true and then explain it.
(b) In XLMiner, suppose that we want to perform k-means clustering. We have to specify some parameters
in the following input dialog box shown here (and other input dialog boxes not shown here). Note that
there is an unknown number “A” and another unknown number “B” in the following input dialog box.
Both number “A” and number “B” are inputted by a user.
After that, we execute XLMiner at Raymond’s machine and obtain the following output O1 from
XLMiner.
4/14
COMP1942 Question Paper
5/14
COMP1942 Question Paper
Q4 (20 Marks)
(a) In class, we learnt “Sequential K-means Clustering” and “Forgetful Sequential K-means Clustering”.
What is the scenario or application that “Forgetful Sequential K-means Clustering” is better used
compared with “Sequential K-means Clustering”?
(b) Consider eight data points.
The following matrix shows the pairwise distances between any two points.
1 2 3 4 5 6 7 8
1 0
2 11 0
3 5 13 0
4 12 2 14 0
5 7 17 1 18 0
6 13 4 15 5 20 0
7 9 15 12 16 15 19 0
8 11 20 12 21 17 22 30 0
Please use the agglomerative approach to group these points with distance group average linkage.
Draw the corresponding dendrogram for the clustering. You are required to specify the distance metric
in the dendrogram.
6/14
COMP1942 Question Paper
Q5 (20 Marks)
(a) The insurance company is given a table with five input attributes, namely Race, Gender, Married, Income
and Child, and one target attribute, namely Insurance. Based on this table, the insurance company
constructed three classifiers based on different criteria, namely Classifier 1, Classifier 2 and Classifier 3.
Classifier 1
root
income=high income=low
Prediction:
0% Yes
100% No
child=yes child=no
Prediction: Prediction:
100% Yes 0% Yes
0% No 100% No
Classifier 2
root
gender=male gender=female
Prediction:
100% Yes
0% No
race=white race=black
Prediction: Prediction:
0% Yes 100% Yes
100% No 0% No
Classifier 3
root
income=high income=low
Prediction:
100% Yes
0% No
Married=yes Married=no
Prediction: Prediction:
100% Yes 0% Yes
0% No 100% No
7/14
COMP1942 Question Paper
Consider a group of 3 classifiers called an “ensemble” studied in class. Consider a new customer. All
input attribute values of this new customer are known to the insurance company. The company uses
this “ensemble” to do the prediction and predicts that this new customer will not buy an insurance
policy. Suppose that we are very “curious” about the input attribute values of this new customer. What
we know about the new customer is that he or she has a low income and the “predicted” result is that
this new customer will not buy an insurance policy. We also know the 3 exact classifiers used in the
insurance company. Is it possible for us to find the values of some input attribute values of this
customer? If yes, please state (1) all these input attribute values and (2) all input attribute(s) that could
not be found with their values. Otherwise, please write down the reason why we could not find those
values.
(b) Consider that we want to conduct an experiment on a particular chemical. We want to test whether this
chemical will have any reaction with another chemical of a fixed amount when the temperature is kept to
be a certain value and the weight of this chemical is adjusted to be another certain value. The following
table shows the experimental results. This table contains 2 numeric attributes, namely temperature and
weight, and one binary attribute, namely react. Each record in the following table corresponds to a
chemical test.
We want to predict whether the chemical will have any reaction when the temperature is equal to 10 and
the weight of this chemical is equal to 4. Suppose that we want to use a 3-nearest neighbor classifier and
we adopt the Euclidean distance as a distance measurement between two given points/records. What is the
prediction? Please write down the prediction (i.e., Yes or No) and the record IDs of the corresponding 3
nearest neighbors.
8/14
COMP1942 Question Paper
Q6 (20 Marks)
Consider the following table with three attributes where “No. of Phones” and “No. of Laptops” are input
attributes and “Buy_NintendoSwtich” is the target attribute. Each tuple corresponds to a customer. An
attribute “Record ID” denotes the ID of each record.
Record ID No. of Phones No. of Laptops Buy_ NintendoSwtich
1 0 0 No
2 0 1 No
3 1 0 No
4 1 1 Yes
(a) Rewrite the above table such that values “Yes” and “No” in attribute “Buy_NintendoSwtich” are
mapped to values 1 and 0, respectively.
(b) Consider a neural network containing a single neuron where x1 = “No. of Phones”, x2 = “No. of
Laptops” and y = “Buy_NintendoSwtich”.
input
x1 w1 output
w2 y
x2 Neuron
Initially, we set the values of w1, w2 and b to be 0.1 where b is a bias value in the neuron.
What are the final values of w1, w2 and b after these five instances?
9/14
COMP1942 Question Paper
Q7 (20 Marks)
Suppose that c is a positive real number where we do not know the exact value.
(a) Consider the four 2-dimensional data points:
We can make use of PCA for dimensionality reduction. In dimensionality reduction, given an L-
dimensional data point, we want to transform this point to a K-dimensional data point where K < L such
that the information loss during the transformation is minimized. Suppose that L = 2 and K = 1.
Please illustrate with the above example.
(b) Consider the four 2-dimensional data points:
We can make use of PCA for dimensionality reduction. In dimensionality reduction, given an L-
dimensional data point, we want to transform this point to a K-dimensional data point where K < L such
that the information loss during the transformation is minimized. Suppose that L = 2 and K = 1.
Can we make use of the answers in part (a) to perform the dimensionality reduction? If yes, please write
down each transformed data point. If no, please write down the reasons why we cannot make use of the
answers of part (a).
(c) Consider the four 2-dimensional data points:
We can make use of PCA for dimensionality reduction. In dimensionality reduction, given an L-
dimensional data point, we want to transform this point to a K-dimensional data point where K < L such
that the information loss during the transformation is minimized. Suppose that L = 2 and K = 1.
Can we make use of the answers in part (a) to perform the dimensionality reduction? If yes, please write
down each transformed data point. If no, please write down the reasons why we cannot make use of the
answers of part (a).
10/14
COMP1942 Question Paper
Q8 (20 Marks)
We are given the following adjacency matrix according to four sites, namely p, q, r and s.
p q r s
𝑝 1 1 1 0
𝑞 1 0 0 1
𝑟 0 1 1 0
𝑠 0 1 0 0
(a) Is it possible to find the corresponding stochastic matrix? If yes, write down the stochastic matrix.
Otherwise, please explain it.
(b) We are given the following 12 webpages, namely w1, w2, …, w12.
Raymond Linfei Hao
Chan Chen Liu
w1 w2 w3
w4 w5 w6
w7 w8 w9
The query terms typed by the user are "Raymond" and "Wong".
(i) What is the root set in this query? Please list the webpages in this set.
(ii) What is the base set in this query? Please list the webpage in this set.
11/14
COMP1942 Question Paper
Q9 (20 Marks)
In class, we learnt that we are given the following table describing the scenario that “parts are bought from
suppliers and then sold to customers at a sale price SP”.
Then, we could answer a query like “for each customer, find the sum of the sale prices (SP) (i.e., find
SUM(SP))”. Suppose the total number of records in the output of this query is 0.1M. In the class, we learnt
that we represent this query and its output by “c 0.1M”. We also learnt how to derive (or obtain) the output of
query “c” from the output of query “sc”. We also learnt how to construct the following graph (or figure) due
to this derivation. Suppose that we materialize the outputs of all queries.
psc 6M
pc 4M ps 0.8M sc 2M
none 1
(a) In this question, consider another scenario that we are given the following transactions. Suppose that the
support threshold is equal to 1.
A B C D
1 1 0 1
1 1 0 1
0 0 1 0
1 1 1 1
1 0 1 0
Now, we would like to answer a query like “finding all frequent itemsets such that each frequent itemset
contain at least one item from set {A, B, D}”. Two examples of this output are {A, B} and {B, C} but {C}
12/14
COMP1942 Question Paper
is not in this output. Note that the frequency of each frequent itemset is not needed in the output of this
query. Suppose that the total number of frequent itemsets in this output is a number x (to be found by you in
this question). Then, similar to what we learnt in the class, we represent this query and its output by “{A, B,
D} x”. Based on the concept we learnt in class, we could derive (or obtain) the output of query {A, D} from
the output of query {A, B, D}. We could construct the following graph due to the derivation. Note that each
variable x (with a subscript) in the following corresponds to a number to be found by you in this question.
Suppose that we materialize the outputs of all queries.
{A, B, C, D} xABCD
{A, B} xAB {A, C} xAC {A, D} xAD {B, C} xBC {B, D} xBD {C, D} xCD
none xnone
(i) Please state all frequent itemsets (i.e., query “{A, B, C, D}”).
You are not required to give the frequency of each itemset.
(ii) Please find the value of each variable x (e.g., xABCD, xABC and xnone).
(iii) Assume that we do not consider “none xnone”. Now, suppose that 4 views (instead of all views) are to
be materialized (other than the top view). Apply the greedy algorithm and find the resulting views.
(Note: For each iteration/selection in the greedy algorithm, if there are ties, just pick the query in the
lexicographical order (or alphabetical order) (e.g., {A, B} is ordered before {A, C} in this ordering)).
(b) In Part (a), we did not consider any Apriori property to construct the graph. Now, we want to use the
Apriori property learnt in class to “reduce” the total size of the storage.
(i) We know that the Apriori property is in the form of “if <itemset 1> is frequent, then <itemset 2> is
frequent.” What is the relationship between <itemset 1> and <itemset 2>?
(ii) Consider this transactional dataset only. Due to this Apriori property, we do not need to store all
frequent itemsets for each query. We could store fewer frequent itemsets for each query. For example,
before we use the Apriori property, for a particular query, we store the set S of all frequent itemests
for this query, and when we answer this query, we obtain set S and return this as the output. However,
after we use this Apriori property, for this query, we store a subset S’ of S for this query, and when
we answer this query, we obtain set S’, derive S’’ based on S’ and return this derived set S’’ as the
output (where S’’ is equal to S). In the above, x (with a subscript) denotes the total number of all
frequent itemsets for each query. Let y (with a subscript) denote the total smallest possible number of
13/14
COMP1942 Question Paper
all frequent itemsets “stored” for each query. Please give the value of each variable y (e.g., yABCD,
yABC and ynone).
Consider a classification problem where the target attribute contains two possible values, “Yes” and “No”.
We are given a training dataset. We generate a classifier based on this dataset. We find that there are exactly
30 tuples which target attribute values are predicted as “Yes”, and there are exactly 20 tuples which target
attribute values are predicted as “No”. We also know that the f-measure of this classifier is 16/22 and the
accuracy of this classifier is 0.70 (out of 1.0) (i.e., 70%).
(a) Is it a must that we can find the number of false positives? If yes, please write down the number of
false positives. Otherwise, please elaborate why we cannot find it.
(b) Is it a must that we can find the precision of this classifier? If yes, please write down the precision of
this classifier. Otherwise, please elaborate why we cannot find it.
(c) Is it a must that we can find the recall of this classifier? If yes, please write down the recall of this
classifier. Otherwise, please elaborate why we cannot find it.
(d) Is it a must that we can find the specificity of this classifier? If yes, please write down the specificity
of this classifier. Otherwise, please elaborate why we cannot find it.
(e) Is it a must that we can find the decile-wise lift chart of this classifier? If yes, please write down the
decile-wise lift chart of this classifier. Otherwise, please elaborate why we cannot find it.
End of Paper
14/14