Comp 1942 finalExamQuestion-2016
Comp 1942 finalExamQuestion-2016
Instructions:
(1) Please answer all questions in Part A in the answer sheet.
(2) You can optionally answer the bonus question in Part B in the answer sheet. You can obtain additional
marks for the bonus question if you answer it correctly.
(3) You can use a calculator.
Question Paper
1/11
COMP1942 Question Paper
(a) The following shows an FP-tree which is constructed from a set of transactions. Let the support
threshold be 1. Please write down the corresponding transactions which are used to generate the FP-tree.
item Head of node-link
root
b
b:9 a:2
a
c
a:5 c:1 c:1
c:2
(b) In the Apriori algorithm, we know how to find some sets L1, C2, L2, ….
(i) Is it always true that the number of itemsets in L2 is smaller than or equal to the number of itemsets
in C2? If yes, please explain it. Otherwise, please give a counter example.
(ii) Is it always true that the number of itemsets in C2 is larger than or equal to the number of itemsets in
L1? If yes, please explain it. Otherwise, please give a counter example.
Q2 (20 Marks)
2/11
COMP1942 Question Paper
Q3 (20 Marks)
Q4 (20 Marks)
LC = Yes
FH = Yes FamilyHistory (FH) Smoker (S)
0.7
S = Yes
FH = Yes
0.45
S = No
FH = No
0.55 LungCancer (LC)
S = Yes
FH = No
0.2
S = No
PositiveXRay (PR)
PR = Yes
LC = Yes 0.85
LC = No 0.45
3/11
COMP1942 Question Paper
Q5 (20 Marks)
(a) We are given a table with six input attributes, namely Race, Gender, Education, Married, Income and
Child, and one target attribute, namely Insurance. Based on this table, we construct three classifiers
based on different criteria, namely Classifier 1, Classifier 2 and Classifier 3.
Classifier 1
root
income=high income=low
Prediction:
0% Yes
100% No
child=yes child=no
Prediction: Prediction:
100% Yes 0% Yes
0% No 100% No
Classifier 2
root
gender=male gender=female
Prediction:
100% Yes
0% No
race=white race=black
Prediction: Prediction:
0% Yes 100% Yes
100% No 0% No
Classifier 3
root
education=high education=low
Prediction:
100% Yes
0% No
Married=yes Married=no
Prediction: Prediction:
100% Yes 0% Yes
0% No 100% No
4/11
COMP1942 Question Paper
Consider a group of classifiers called an “ensemble” studied in class. Suppose that we want to
predict whether a male married customer with his “black” race who has high education and high
income with a child will buy an insurance policy. What is the overall predicted result (i.e., whether
the customer will buy an insurance policy)? Please elaborate how you obtain the overall predicted
result.
We want to predict the target attribute of the new record with the input attribute 1 equal to 5 and the input
attribute 2 equal to 2. Suppose that we want to use a 3-nearest neighbor classifier and we adopt the
Euclidean distance as a distance measurement between two given points. What is the target attribute of
this record? Please write down the target attribute of this record and the record IDs of the corresponding
3 nearest neighbors.
(c) We know how to compute the impurity measurement of an attribute A under the ID3 decision tree,
denoted by Imp-ID3(A). We also know how to compute the impurity measurement of an attribute A
under the CART decision tree, denoted by Imp-CART(A). Consider two attributes A and B. Is it always
true that if Imp-CART(A) > Imp-CART(B), then Imp-ID3(A) > Imp-ID3(B)? If yes, please show that it
is true. Otherwise, please give a counter example showing that this is not true and then explain it.
5/11
COMP1942 Question Paper
Q6 (20 Marks)
(a) In lecture notes, we know that a neural network containing only one neuron can only solve linearly
separable problems.
Suppose that we want to solve a non-linearly separable problem.
In lecture notes, we know that we can solve this problem by using a multi-layer perceptron (i.e., a neural
network which contains multiple layers and each layer contains some neurons).
Suppose that we still want to use the neural network containing only one neuron.
Is it possible to solve this non-linearly separable problem by some "additional" steps (e.g., data
preprocessing)? If yes, please give a method to solve this problem using this neural network. If no,
please give some reasons why this neural network cannot solve this problem by any additional steps.
(b) There are 10 data points in the dataset, namely data points 1, 2, …, 10. When we use the XLMiner
software to perform “Hierarchical Clustering”, we obtain the following result. In class, we learnt how to
analyze the table in the result. Suppose that we want to find two clusters. Please give all data points in
each of these two clusters.
6/11
COMP1942 Question Paper
Q7 (20 Marks)
Consider a classification problem where the target attribute contains two possible values, “Yes” and “No”.
We are given a training dataset. We generate a classifier based on this dataset. We find that there are exactly
ten tuples which target attribute values are predicted as “Yes”, and there are exactly eighteen tuples which
target attribute values are predicted as “No”. We also know that the specificity of this classifier is 0.85 (or
85%) and the precision of this classifier is 0.70 (or 70%).
(a) Is it a must that we can find the accuracy of this classifier? If yes, please write down the accuracy of
this classifier. Otherwise, please elaborate why we cannot find it.
(b) Is it a must that we can find the recall of this classifier? If yes, please write down the recall of this
classifier. Otherwise, please elaborate why we cannot find it.
(c) Is it a must that we can find the f-measure of this classifier? If yes, please write down the f-measure
of this classifier. Otherwise, please elaborate why we cannot find it.
(d) Is it a must that we can find the number of false negatives? If yes, please write down the number of
false negatives. Otherwise, please elaborate why we cannot find it.
(e) Is it a must that we can find the decile-wise lift chart of this classifier? If yes, please write down the
decile-wise lift chart of this classifier. Otherwise, please elaborate why we cannot find it.
Q8 (20 Marks)
We are given the following 4 data points: (6, 6), (8, 8), (5, 9), (9, 5). Use PCA to reduce from two
dimensions to one dimension for each of these 4 data points. In this part, please show your steps.
7/11
COMP1942 Question Paper
Q9 (20 Marks)
Consider a table T: (part, supplier, customer, price) where "part" is an attribute for parts, "supplier" is an
attribute for suppliers, "customer" is an attribute for customers and "price" is an attribute for prices. A
record (p, s, c, x) means that the part p, supplied by supplier s and bought by customer c, has its price x.
Suppose that the total size of this table is 10GB. We materialize this table.
(a) Consider the following six queries, namely Q1, Q2, Q3, Q4, Q5 and Q6.
Q1: We want to find the total price (or the sum of the prices) for each combination of part and customer.
Q2: We want to find the total price (or the sum of the prices) for each part.
Q3: We want to find the total number of records in T for each combination of part and customer.
Q4: We want to find the total number of records in T for each part.
Q5: We want to find the average price for each combination of part and customer.
Q6: We want to find the average price for each part.
Suppose that we materialize the answers of Q1, Q3 and Q5. Each of these answers occupies 1GB
storage.
We know that we can find the answer of Q2 from the answer of Q1 only in class.
(i) Is it a must that we can find the answer of Q4 from the answer of Q3 only? If yes, please explain it.
Otherwise, please give what kinds of additional information (in addition to the answer of Q3) we can
use with the minimum overall access cost (in terms of the total size of all materialized views
accessed) and explain it.
(ii) Is it a must that we can find the answer of Q6 from the answer of Q5 only? If yes, please explain it.
Otherwise, please give what kinds of additional information (in addition to the answer of Q5) we can
use with the minimum overall access cost (in terms of the total size of all materialized views
accessed) and explain it.
(b) Consider a classification problem for the table with two input attributes, namely A1 and A2, and one
target attribute Y, containing 200 records.
(i) In the support vector machine, we learnt that we want to maximize the margin in a classification
problem. We learnt that the margin is equal to
2
2 2
w1 w2
where w1 and w2 are two variables to be found. In class, we learnt that we need to re-write the
objective function as w12 + w22 and then we want to minimize this objective function. Why do we
need to re-write this objective function?
(ii) In the support vector machine, how many constraints are there in form of Y(w1A1 + w2A2 + b) ≥ 1
where w1, w2 and b are three variables to be found?
8/11
COMP1942 Question Paper
w1 w2 w3
w4 w5 w6
w7 w8 w9
The query terms typed by the user are "Raymond" and "Wong".
(i) What is the root set in this query? Please list the webpages in this set.
(ii) What is the base set in this query? Please list the webpage in this set.
9/11
COMP1942 Question Paper
We are given a sequence of data points (where each data point is a single item) which come in real time.
Let s be the support threshold (in fraction) for finding frequent items (e.g., 20%). That is, consider a
sequence of n data points, if the frequency of an item over this sequence is at least sn, then this item is a
frequent item.
Consider the first problem of finding frequent items on the sequence from the first data point to the k-th data
point where k is a positive integer.
We are given an algorithm called SpaceSaving. Given a sequence S of data points, the summary stored in
the Space-Saving algorithm for this set is denoted by SpaceSaving(S). Let X = SpaceSaving(S). X contains
two components. The first component denoted by X.E contains a list of entries each in the form of
(e, f, ) where e is the item number, f is the frequency of the item (recorded after this entry is created in the
summary) and is the maximum possible error in f. The second component denoted by X.p is equal to the
variable pe used in algorithm Space-Saving (which is the greatest possible frequency error (i.e., the greatest
possible difference between f and fo where f is the frequency stored in the entry of an item and fo is the
(actual) frequency of the item over the sequence S)).
For each item stored in the summary X in the form of (e, f, ), the estimated frequency of this item is
defined to be f +. For each item not stored in the summary X, the estimated frequency of this item is
defined to be X.p. For each item, the relative error in the estimated frequency of this item for algorithm
SpaceSaving is defined to be (f’ – fo)/fo where f’ is the estimated frequency of this item and fo is the (actual)
frequency of this item over the sequence S.
Let Y be the greatest possible number of entries stored in the memory used by algorithm SpaceSaving. It is
shown that the greatest relative error in the estimated frequency of an item for algorithm SpaceSaving is
equal to 1/Y.
We want to make use of the result from the first problem to address the second problem to be described next.
Consider the second problem of finding frequent items over a sliding window (i.e., finding frequent items
on the sequence from the k-th data point to the k’-th data point where k and k’ are two positive integers and
k < k’). Assume that we use the batch-based approach (which will be elaborated next) for this purpose. Let
B be the batch size. The first B-th data points form the first batch. The next B-th data points form the second
batch. We can also form other batches for the remaining data points. Let Bi be the i-th batch. Whenever we
reach the boundary of the batch (i.e., whenever we finish reading the last data point in the batch and are
ready to read the first data point in the next batch), we want to return all frequent items over the 4 recent
batches. Define N = 4B.
10/11
COMP1942 Question Paper
Given an item e and a summary X, h(e, X) is equal to f + if there exists an entry (e, f, ) for e in X.E.
It is equal to X.p otherwise.
Note that the above symbol “” means that the content at the right hand side of this symbol is assigned to
the content at the left hand side of this symbol.
Suppose that the memory size for M2 is 4096 bytes. (Note: You can regard that “byte” is a storage unit in
the memory). Consider a summary X. Each entry in the form of (e, f, ) which is stored in the first
component of X (i.e., X.E) occupies 12 bytes. The second component of X (i.e., X.p) occupies 4 bytes.
For each item, what is the greatest relative error in the estimated frequency of this item for the above
algorithm? Please show your steps. Given an item e in an entry, the relative error in the estimated frequency
of this item is defined to be (g(e) – fo)/fo where fo is the (actual) frequency of this item over the sequence of
the 4 recent batches.
End of Paper
11/11