0% found this document useful (0 votes)
22 views14 pages

02450ex Fall2018

This document provides information about an examination for an introduction to machine learning and data mining course. The exam will take place on December 18th, 2018 from 9 AM to 1 PM. All aids are permitted. The exam consists of multiple choice questions worth 3 points for a correct answer, -1 point for a wrong answer, and 0 points for answering "don't know". Students should submit their answers digitally if possible, or use the answer sheet provided.

Uploaded by

navistories
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views14 pages

02450ex Fall2018

This document provides information about an examination for an introduction to machine learning and data mining course. The exam will take place on December 18th, 2018 from 9 AM to 1 PM. All aids are permitted. The exam consists of multiple choice questions worth 3 points for a correct answer, -1 point for a wrong answer, and 0 points for answering "don't know". Students should submit their answers digitally if possible, or use the answer sheet provided.

Uploaded by

navistories
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Technical University of Denmark

Written examination: December 18th 2018, 9 AM - 1 PM.

Course name: Introduction to Machine Learning and Data Mining.

Course number: 02450.

Aids allowed: All aids permitted.

Exam duration: 4 hours.

Weighting: The individual questions are weighted equally.

Please hand in your answers using the electronic file. Only use this page in the case where digital
handin is unavailable. In case you have to hand in the answers using the form on this sheet, please follow
these instructions:
Print name and study number clearly. The exam is multiple choice. All questions have four possible answers
marked by the letters A, B, C, and D as well as the answer “Don’t know” marked by the letter E. Correct answer
gives 3 points, wrong answer gives -1 point, and “Don’t know” (E) gives 0 points.
The individual questions are answered by filling in the answer fields with one of the letters A, B, C, D, or E.

Answers:

1 2 3 4 5 6 7 8 9 10

11 12 13 14 15 16 17 18 19 20

21 22 23 24 25 26 27

Name:

Student number:

PLEASE HAND IN YOUR ANSWERS DIGITALLY.


USE ONLY THIS PAGE FOR HAND IN IF YOU ARE
UNABLE TO HAND IN DIGITALLY.

1 of 14
No. Attribute description Abbrev.
x1 intercolumnar distance interdist
x2 upper margin upperm
x3 lower margin lowerm
x4 exploitation exploit
x5 row number row nr.
x6 modular ratio modular
x7 interlinear spacing interlin
x8 weight weight
x9 peak number peak nr.
x10 modular ratio/ interlinear spacing mr/is
y Who copied the text? Copyist

Table 1: Description of the features of the Avila Bible


dataset used in this exam. The dataset has been
extracted from images of the ’Avila Bible’, an XII
century giant Latin copy of the Bible. The prediction
task consists in associating each pattern to one of three
copyist (copyist refers to the monk who copied the text
in the bible), indicated by the y-value. Note that only
a subset of the dataset is used. The dataset used here
consist of N = 525 observations and the attribute y is
discrete taking values y = 1, 2, 3 corresponding to the
three different copyists. Figure 1: Plot of observations x2 , x3 , x9 , x10 of the
Avila Bible dataset of Table 1 as percentile plots.
Question 1.
The main dataset used in this exam is the Avila Bible
dataset1 shown in Table 1.
In Figure 1 and Figure 2 are shown respectively
percentile plots and boxplots of the Avila Bible dataset
based on the attributes x2 , x3 , x9 , x10 found in
Table 1. Which percentile plots match which boxplots? 1.2

1
A. Boxplot 1 is mr/is, Boxplot 2 is lowerm, Boxplot
0.8
3 is upperm and Boxplot 4 is peak nr.
0.6

B. Boxplot 1 is upperm, Boxplot 2 is lowerm, Boxplot 0.4


3 is peak nr. and Boxplot 4 is mr/is
0.2

C. Boxplot 1 is upperm, Boxplot 2 is peak nr., Box- 0

plot 3 is mr/is and Boxplot 4 is lowerm -0.2

-0.4
D. Boxplot 1 is mr/is, Boxplot 2 is lowerm, Boxplot
3 is peak nr. and Boxplot 4 is upperm 1 2 3 4

E. Don’t know.
Figure 2: Boxplots corresponding to the variables
plotted in Figure 1 but not necessarily in that order.

1
Dataset obtained from https://archive.ics.uci.edu/ml/
datasets/Avila

2 of 14
Question 2.
A Principal Component Analysis (PCA) is carried
out on the Avila Bible dataset in Table 1 based on the
attributes x1 , x3 , x5 , x6 , x7 .
The data is standardized by (i) substracting the
mean and (ii) dividing each column by its standard
deviation to obtain the standardized matrix X̃. A
singular value decomposition is then carried out on
the standardized matrix to obtain the decomposition
U SV T = X̃
 
0.04 −0.12 −0.14 0.35 0.92
 0.06
 0.13 0.05 −0.92 0.37 

−0.03
V = −0.98 0.08 −0.16 −0.05 (1)
−0.99 0.03 0.06 −0.02 0.07 
−0.07 −0.05 −0.98 −0.11 −0.11

 
14.4 0.0 0.0 0.0 0.0 Figure 3: Black dots show attributes x5 and x7 of the
 0.0 8.19 0.0 0.0 0.0 
  Avila Bible dataset from Table 1. The two points cor-
 0.0 0.0 7.83 0.0 0.0 
S= 
responding to the colored markers indicate two specific
 0.0 0.0 0.0 6.91 0.0 
observations A, B.
0.0 0.0 0.0 0.0 6.01

Which one of the following statements is true?

A. The variance explained by the first principal com-


ponent is greater than 0.45

B. The variance explained by the first four principal


components is less than 0.85

C. The variance explained by the last four principal


components is greater than 0.56

D. The variance explained by the first three principal


components is less than 0.75

E. Don’t know.

Question 3.
Consider again the PCA analysis fo the Avila Bible
dataset. In Figure 3 the features x5 and x7 from Figure 4: Candidate plots of the observations and
Table 1 are plotted as black dots. We have indicated path shown in Figure 3 projected onto the first two
two special observations as colored markers (Point A principal components considered in Equation (1). The
and Point B). colored markers still refer to points A and B, now
We can imagine that the dataset, along with the in the coordinate system corresponding to the PCA
two special observations, is projected onto the first two projection.
principal component directions given in V as computed
earlier (see Equation (1)). Which one of the four plots

3 of 14
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 K nearest neighbors. What is the average relative
o1 0.0 2.91 0.63 1.88 1.02 1.82 1.92 1.58 1.08 1.43
o2 2.91 0.0 3.23 3.9 2.88 3.27 3.48 4.02 3.08 3.47 density for observation o4 for K = 2 nearest neighbors?
o3 0.63 3.23 0.0 2.03 1.06 2.15 2.11 1.15 1.09 1.65
o4 1.88 3.9 2.03 0.0 2.52 1.04 2.25 2.42 2.18 2.17
o5 1.02 2.88 1.06 2.52 0.0 2.44 2.38 1.53 1.71 1.94
A. 1.0
o6 1.82 3.27 2.15 1.04 2.44 0.0 1.93 2.72 1.98 1.8
o7 1.92 3.48 2.11 2.25 2.38 1.93 0.0 2.53 2.09 1.66 B. 0.71
o8 1.58 4.02 1.15 2.42 1.53 2.72 2.53 0.0 1.68 2.06
o9 1.08 3.08 1.09 2.18 1.71 1.98 2.09 1.68 0.0 1.48
o10 1.43 3.47 1.65 2.17 1.94 1.8 1.66 2.06 1.48 0.0
C. 0.68

D. 0.36
Table 2: The pairwise
qP Euclidian distances, d(oi , oi ) =
M 2 E. Don’t know.
kxi − xj k2 = k=1 (xik − xjk ) between 10 obser-
vations from the Avila Bible dataset (recall M = 10).
Each observation oi corresponds to a row of the data Question 5.
matrix X of Table 1 (the data has been standardized). Suppose a GMM model is applied to the Avila Bible
The colors indicate classes such that the black observa- dataset in the processed version shown in Table 2. The
tions {o1 , o2 , o3 } belongs to class C1 (corresponding to GMM is constructed as having K = 3 components,
copyist one), the red observations {o4 , o5 , o6 , o7 , o8 } and each component k of the GMM is fitted by letting
belongs to class C2 (corresponding to copyist two), and it’s mean vectors µk be equal to the location of the
the blue observations {o9 , o10 } belongs to class C3 (cor- observations:
responding to copyist three). o 7 , o8 , o9
(i.e. each observation corresponds to exactly one mean
in Figure 4 shows the correct PCA projection? vector) and setting the covariance matrix equal to
Σk = σ 2 I where I is the identity matrix:
A. Plot A −d(oi ,µk )2
1
N (oi ; µk , Σk ) = p e 2σ2
B. Plot B |2πΣk |

where | · | is the determinant. The components of the


C. Plot C
GMM are weighted evenly.
D. Plot D If σ = 0.5, and denoting the density of the GMM as
p(x), what is the density as evaluated at observation
E. Don’t know. o3 ?

A. p(o3 ) = 0.048402
Question 4. To examine if observation o4 may be an
B. p(o3 ) = 0.076
outlier, we will calculate the average relative density
based on euclidean distance and the observations given C. p(o3 ) = 0.005718
in Table 2 only. We recall that the KNN density and
average relative density (ard) for the observation xi are D. p(o3 ) = 0.114084
given by:
E. Don’t know.
1
densityX \i (xi , K) = 1 P 0)
,
K x0 ∈NX \i (xi ,K) d(xi , x

densityX \i (xi , K)
ardX (xi , K) = 1 P ,
K xj ∈NX \i (xi ,K) densityX \j (xj , K)

where NX \i (xi , K) is the set of K nearest neighbors


of observation xi excluding the i’th observation, and
ardX (xi , K) is the average relative density of xi using

4 of 14
Figure 6: Dendrogram 1 from Figure 5 with a cutoff
indicated by the dotted line, thereby generating 3
clusters.

Question 7.
Consider dendrogram 1 from Figure 5. Suppose we
apply a cutoff (indicated by the black line) thereby gen-
Figure 5: Proposed hierarchical clustering of the 10 erating three clusters. We wish to compare the quality
observations in Table 2. of this clustering, Q, to the ground-truth clustering, Z,
indicated by the colors in Table 2. Recall the normal-
Question 6. A hierarchical clustering is applied ized mutual information of the two clusterings Z and
to the 10 observations in Table 2 using minimum Q is defined as
linkage. Which of the dendrograms shown in Figure 5 MI[Z, Q]
corresponds to the clustering? NMI[Z, Q] = p p
H[Z] H[Q]
A. Dendrogram 1 where MI is the mutual information and H is the
entropy. Assuming we always use an entropy based
B. Dendrogram 2 on the natural logarithm,
C. Dendrogram 3 n
X
H=− pi log pi , log(e) = 1,
D. Dendrogram 4 i=1

E. Don’t know. what is the normalized mutual information of the two


clusterings?

A. NMI[Z, Q] ≈ 0.313
B. NMI[Z, Q] ≈ 0.302
C. NMI[Z, Q] ≈ 0.32
D. NMI[Z, Q] ≈ 0.274
E. Don’t know.

5 of 14
x9 -interval y=1 y=2 y=3 Question 10. Consider the split in Table 3. Suppose
we build a classification tree with only this split and
x9 ≤ 0.13 108 112 56
evaluate it on the same data it was trained on. What
0.13 < x9 58 75 116
is the accuracy?
Table 3: Proposed split of the Avila Bible dataset
A. Accuracy is: 0.64
based on the attribute x9 . We consider a 2-way split
where for each interval we count how many observa- B. Accuracy is: 0.29
tions belonging to that interval has the given class la-
bel. C. Accuracy is: 0.35

D. Accuracy is: 0.43


Question 8. Consider the distances in Table 2 based
on 10 observations from the Avila Bible dataset. The E. Don’t know.
class labels C1 , C2 , C3 (see table caption for details)
will be predicted using a k-nearest neighbour classifier Question 11. Suppose s1 and s2 are two text docu-
based on the distances given in Table 2. Suppose we use ments containing the text:
leave-one-out cross validation (i.e. the observation that  
is being predicted is left out) and a 1-nearest neighbour the bag of words representation
s1 =
classifier (i.e. k = 1). What is the error rate computed should not give you a hard time
for all N = 10 observations? remember the representation should
n o
s2 =
be a vector
4
A. error rate = 10
The documents are encoded using a bag-of-words en-
9 coding assuming a total vocabulary size of M = 10000.
B. error rate = 10
No stopwords lists or stemming is applied to the
2
C. error rate = 10 dataset. What is the cosine similarity between doc-
D. error rate = 6 uments s1 and s2 ?
10

E. Don’t know. A. cosine similarity of s1 and s2 is 0.047619

B. cosine similarity of s1 and s2 is 0.000044


Question 9.
Suppose we wish to build a classification tree based C. cosine similarity of s1 and s2 is 0.000400
on Hunt’s algorithm where the goal is to predict Copy-
D. cosine similarity of s1 and s2 is 0.436436
ist which can belong to three classes, y = 1, y = 2, y =
3. The first split we consider is a two-way split based E. Don’t know.
on the value of x9 into the intervals indicated in Ta-
ble 3. For each interval, we count how many observa-
tions belong to each of the three classes and the result
is indicated in Table 3. Suppose we use the classifica-
tion error impurity measure, what is then the purity
gain ∆?

A. ∆ ≈ 0.485

B. ∆ ≈ 0.078

C. ∆ ≈ 0.566

D. ∆ ≈ 1.128

E. Don’t know.

6 of 14
Figure 7: Output of a logistic regression classifier
trained on 7 observations from the dataset. Figure 8: Proposed ROC curves for the logistic regres-
sion classifier in Figure 7.
Question 12. Consider again the Avila Bible dataset.
We are particularly interested in predicting whether a Question 13.
bible copy was written by copyist 1, and we therefore To evaluate the classifier Figure 7, we will use the
wish to train a logistic regression classifier to distin- area under curve (AUC) of the reciever operator char-
guish between copyist one vs. copyist two and three. acteristic (ROC) curve as computed on the 7 observa-
To simplify the setup further, we select just 7 obser- tions in Figure 7. In Figure 8 is given four proposed
vations and train a logistic regression classifier using ROC curves, which one of the curves corresponds to
only the feature x8 as input (as usual, we apply a simple the classifier?
feature transformation to the inputs to add a constant
feature in the first coordinate to handle the intercept A. ROC curve 1
term). To be consistent with the lecture notes, we la-
B. ROC curve 2
bel the output as y = 0 (corresponding to copyist one)
and y = 1 (corresponding to copyist two and three). C. ROC curve 3
In Figure 7 is shown the predicted output probability
an observation belongs to the positive class, p(y = D. ROC curve 4
1|x8 ). What are the weights?
E. Don’t know.
 
−0.93
A.
1.72
 
−2.82
B.
0.0
 
1.36
C.
0.4
 
−0.65
D.
0.0
E. Don’t know.

7 of 14
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Question 15.
o1 1 1 0 0 0 1 0 0 0 1 Consider the binarized version of the Avila Bible
o2 1 0 0 0 0 0 0 0 0 0 dataset shown in Table 4.
o3 1 1 0 0 0 1 0 0 0 1
o4 0 1 1 1 0 0 0 1 1 0 The matrix can be considered as representing N =
o5 1 1 0 0 0 1 0 0 0 1 10 transactions o1 , o2 , . . . , o10 and M = 10 items
o6 0 1 1 1 0 0 1 1 1 0 f1 , f2 , . . . , f10 . Which of the following options repre-
o7 1 1 1 0 0 1 1 1 1 0 sents all (non-empty) itemsets with support greater
o8 0 1 1 1 0 1 1 0 0 1
o9 0 0 0 0 1 1 1 0 1 1
than 0.55 (and only itemsets with support greater than
o10 1 0 0 0 0 1 1 1 1 0 0.55)?

Table 4: Binarized version of the Avila Bible dataset. A. {f1 }, {f2 }, {f6 }, {f7 }, {f9 }, {f10 }, {f1 , f6 },
Each of the features fi are obtained by taking a feature {f2 , f6 }, {f6 , f10 }
xi and letting fi = 1 correspond to a value xi greater
B. {f1 }, {f2 }, {f6 }
than the median (otherwise fi = 0).The colors indicate
classes such that the black observations {o1 , o2 , o3 } C. {f1 }, {f2 }, {f3 }, {f4 }, {f6 }, {f7 }, {f8 }, {f9 },
belongs to class C1 (corresponding to copyist one), the {f10 }, {f1 , f2 }, {f2 , f3 }, {f2 , f4 }, {f3 , f4 }, {f1 , f6 },
red observations {o4 , o5 , o6 , o7 , o8 } belongs to class {f2 , f6 }, {f2 , f7 }, {f3 , f7 }, {f6 , f7 }, {f2 , f8 },
C2 (corresponding to copyist two), and the blue obser- {f3 , f8 }, {f7 , f8 }, {f2 , f9 }, {f3 , f9 }, {f6 , f9 },
vations {o9 , o10 } belongs to class C3 (corresponding to {f7 , f9 }, {f8 , f9 }, {f1 , f10 }, {f2 , f10 }, {f6 , f10 },
copyist three). {f2 , f3 , f4 }, {f1 , f2 , f6 }, {f2 , f3 , f7 }, {f2 , f3 , f8 },
{f2 , f3 , f9 }, {f6 , f7 , f9 }, {f2 , f8 , f9 }, {f3 , f8 , f9 },
Question 14. We again consider the Avila Bible {f7 , f8 , f9 }, {f1 , f2 , f10 }, {f1 , f6 , f10 }, {f2 , f6 , f10 },
dataset from Table 1 and the N = 10 observations we {f2 , f3 , f8 , f9 }, {f1 , f2 , f6 , f10 }
already encountered in Table 2. The data is processed
D. {f1 }, {f2 }, {f3 }, {f6 }, {f7 }, {f8 }, {f9 }, {f10 },
to produce 10 new, binary features such that fi = 1
{f1 , f2 }, {f2 , f3 }, {f1 , f6 }, {f2 , f6 }, {f6 , f7 },
corresponds to a value xi greater than the median2 ,
{f7 , f9 }, {f8 , f9 }, {f2 , f10 }, {f6 , f10 }, {f1 , f2 , f6 },
and we thereby arrive at the N × M = 10 × 10 binary
{f2 , f6 , f10 }
matrix in Table 4. Suppose we train a naı̈ve-Bayes
classifier to predict the class label y from only the E. Don’t know.
features f1 , f2 , f6 . If for an observations we observe
Question 16. We again consider the binary matrix
f1 = 1, f2 = 1, f6 = 0
from Table 4 as a market basket problem consisting
what is then the probability that y = 1 according to of N = 10 transactions o1 , . . . , o10 and M = 10 items
the Naı̈ve-Bayes classifier? f1 , . . . , f10 .
What is the confidence of the rule {f1 , f3 , f8 , f9 } →
50 {f
A. pNB (y = 1|f1 = 1, f2 = 1, f6 = 0) = 77 2 , f6 , f7 }

25 1
B. pNB (y = 1|f1 = 1, f2 = 1, f6 = 0) = 43 A. Confidence is 10
5
C. pNB (y = 1|f1 = 1, f2 = 1, f6 = 0) = 11 B. Confidence is 1
10 1
D. pNB (y = 1|f1 = 1, f2 = 1, f6 = 0) = 19 C. Confidence is 2

E. Don’t know. 3
D. Confidence is 20

E. Don’t know.

2
Note that in association mining, we would normally also
include features fi such that fi = 1 if the corresponding feature
is less than the median; for brevity we will not consider features
of this kind in this problem

8 of 14
to the nodes in the decision tree?

A. A: x7 ≥ 0.5, B: x9 ≥ 0.54, C: x9 ≥ 0.35, D:


x9 ≥ 0.26

B. A: x7 ≥ 0.5, B: x9 ≥ 0.26, C: x9 ≥ 0.54, D:


x9 ≥ 0.35

C. A: x9 ≥ 0.54, B: x7 ≥ 0.5, C: x9 ≥ 0.35, D:


x9 ≥ 0.26

D. A: x9 ≥ 0.26, B: x7 ≥ 0.5, C: x9 ≥ 0.35, D:


x9 ≥ 0.54

E. Don’t know.
Figure 9: Example classification tree.

Question 18. We will again consider the binarized


version of the Avila Bible dataset already encountered
in Table 4, however we will now only consider the first
M = 6 features f1 , f2 , f3 , f4 , f5 , f6 .
We wish to apply the Apriori algorithm (the specific
variant encountered in chapter 19 of the lecture notes)
to find all itemsets with support greater than ε = 0.15.
Suppose at iteration k = 3 we know that:
 
1 1 0 0 0 0
1 0 0 0 0 1
 
0 1 1 0 0 0
 
0 1 0 1 0 0
L2 =  
0 1 0 0 0 1
 
0 0 1 1 0 0
0 0 1 0 0 1

Recall the key step in the Apriori algorithm is to con-


Figure 10: classification boundary. struct L3 by first considering a large number of candi-
date itemsets C30 , and then rule out some of them using
the downwards-closure principle thereby saving many
(potentially costly) evaluations of support. Suppose L2
is given as above, which of the following itemsets does
Question 17. the Apriori algorithm not have to evaluate the support
of?
Consider again the Avila Bible dataset. Suppose we
train a decision tree to classify which of the 3 classes, A. {f2 , f3 , f4 }
Copyist 1, Copyist 2, Copyist 3, an observation belongs
to. Since the attributes of the dataset are continuous, B. {f1 , f2 , f6 }
we will consider binary splits of the form xi ≥ z for C. {f2 , f3 , f6 }
different values of i and z, and for simplicity we limit
ourselves to the attributes x7 and x9 . Suppose the D. {f1 , f3 , f4 }
trained decision tree has the form shown in Figure 9,
and that according to the tree the predicted label E. Don’t know.
assignment for the N = 525 observations are as given
in Figure 10, what is then the correct rule assignment

9 of 14
Question 19.
Consider again the Avila Bible dataset in Table 1. Feature(s) Training RMSE Test RMSE
We would like to predict the copyist using a linear none 3.429 4.163
regression, and since we would like the model to be as x1 3.043 3.252
interpretable as possible we will use variable selection x5 3.303 4.52
to obtain a parsimonious model. We limit ourselves x6 3.424 4.274
to the 5 features x1 , x5 , x6 , x8 , x9 and in Table 5 x8 3.399 4.429
we have pre-computed the estimated training and test x9 2.866 5.016
error for different variable combinations of the dataset. x1 , x5 3.001 3.44
Which of the following statements is correct? x1 , x6 3.031 3.423
x5 , x6 3.297 4.641
A. Backward selection will select attributes x1
x1 , x8 3.017 3.42
B. Backward selection will select attributes x5 , x8 3.299 4.485
x1 , x5 , x6 , x8 x6 , x8 3.396 4.519
x1 , x9 2.644 4.267
C. Forward selection will select attributes x1 , x8 x5 , x9 2.645 5.495
x6 , x9 2.787 5.956
D. Forward selection will select attributes
x8 , x9 2.71 5.536
x1 , x5 , x6 , x8
x1 , x5 , x6 2.988 3.607
E. Don’t know. x1 , x5 , x8 3.0 3.453
x1 , x6 , x8 3.007 3.574
x5 , x6 , x8 3.292 4.61
Question 20.
x1 , x5 , x9 2.523 4.704
Consider the Avila Bible dataset from Table 1. We
x1 , x6 , x9 2.562 5.184
wish to predict the copyist based on the attributes
x5 , x6 , x9 2.544 6.552
upperm and mr/is.
x1 , x8 , x9 2.517 4.686
Therefore, suppose the attributes have been bina-
x5 , x8 , x9 2.628 5.532
rized such that x̃2 = 0 corresponds x2 ≤ −0.056
x6 , x8 , x9 2.629 6.569
(and otherwise x̃2 = 1) and x̃10 = 0 corresponds
x1 , x5 , x6 , x8 2.988 3.614
x10 ≤ −0.002 (and otherwise x̃10 = 1). Suppose the
x1 , x5 , x6 , x9 2.425 5.725
probability for each of the configurations of x̃2 and x̃10
x1 , x5 , x8 , x9 2.491 4.734
conditional on the copyist y are as given in Table 6.
x1 , x6 , x8 , x9 2.433 5.687
and the prior probability of the copyists is
x5 , x6 , x8 , x9 2.53 6.597
p(y = 1) = 0.316, p(y = 2) = 0.356, p(y = 3) = 0.328. x1 , x5 , x6 , x8 , x9 2.398 5.766

Using this, what is then the probability an observation Table 5: Root-mean-square error (RMSE) for the
was authored by copyist 1 given that x̃2 = 1 and training and test set when using least squares regres-
x̃10 = 0? sion to predict y in the avila dataset using different
combinations of the features x1 , x5 , x6 , x8 , x9 .
A. p(y = 1|x̃2 = 1, x̃10 = 0) = 0.25

B. p(y = 1|x̃2 = 1, x̃10 = 0) = 0.313 p(x̃2 , x̃10 |y) y=1 y=2 y=3
C. p(y = 1|x̃2 = 1, x̃10 = 0) = 0.262 x̃2 = 0, x̃10 = 0 0.19 0.3 0.19
x̃2 = 0, x̃10 = 1 0.22 0.3 0.26
D. p(y = 1|x̃2 = 1, x̃10 = 0) = 0.298 x̃2 = 1, x̃10 = 0 0.25 0.2 0.35
x̃2 = 1, x̃10 = 1 0.34 0.2 0.2
E. Don’t know.
Table 6: Probability of observing particular values of
x̃2 and x̃10 conditional on y.

10 of 14
Variable t=1 t=2 t=3 t=4
y1 1 2 2 2
y2 1 2 2 1
y3 2 2 2 1
y4 1 1 1 2
y5 1 1 1 1
y6 2 2 2 1
y7 1 2 2 1
y8 2 1 1 2
y9 2 2 2 2
y10 1 1 2 2
y11 2 2 1 2
y12 2 1 1 2
y1test 2 1 1 2
y2test 2 2 1 2
t 0.583 0.657 0.591 0.398
αt -0.168 -0.325 -0.185 0.207

Table 7: Tabulation of each of the predicted outputs


of the AdaBoost classifiers, as well as the intermediate
values αt and t , when the AdaBoost algorithm when
evaluated for T = 4 steps. Note the table includes the
prediction of the two test points in Figure 11.

Question 21.
Consider again the Avila Bible dataset of Table 1.
Suppose we limit ourselves to N = 12 observations Figure 11: Decision boundaries for a KNN classifier
from the original dataset and furthermore suppose we for the first T = 4 rounds of boosting. Notice that in
limit ourselves to class y = 1 or y = 2 and only addition to the training data, the plot also indicate the
consider the features x6 and x8 . We wish to apply a location of two test points.
KNN classification model (K = 2) to this dataset and
apply AdaBoost to improve the performance. During
the first T = 4 rounds of boosting, we obtain the
decision boundaries shown in Figure 11. The figure
also contains two test observations (marked by a cross
and a square).
The prediction of the intermediate AdaBoost clas-
sifiers, as well as the values of αt and t , are given
in Table 7. Given this information, how will the Ad-
aBoost classifier, as obtained by combining the T = 4
weak classifiers, classify the two test observations?

A. ỹ1test ỹ2test = 1
   
1

B. ỹ1test ỹ2test = 2
   
1

C. ỹ1test ỹ2test = 1
   
2

D. ỹ1test ỹ2test = 2
   
2

E. Don’t know.

11 of 14
to the function f ?

A. ANN output 4

B. ANN output 1

C. ANN output 3

D. ANN output 2

E. Don’t know.

Question 23. Suppose a neural network is trained to


translate documents. As part of training the network,
we wish to select between four different ways to encode
the documents (i.e., S = 4 models) and estimate the
generalization error of the optimal choice. In the outer
loop we opt for K1 = 3-fold cross-validation, and in
Figure 12: Suggested activation curves for an ANN the inner K2 = 4-fold cross-validation. The time taken
applied to the feature x7 from Avila Bible dataset. to train a single model is 20 minutes, and this can
be assumed constant for each fold. If the time taken
to test a model is negligible, what is the total time
Question 22. required for the 2-level cross-validation procedure?
We will consider an artificial neural network (ANN)
A. 1020 minutes
applied to the Avila Bible dataset described in Table 1
and trained to predict based on just the feature x7 ; that B. 2040 minutes
is, the neural network is a function that maps from a
single real number to a single real number: f (x7 ) = y C. 300 minutes
Suppose the neural network takes the form: D. 960 minutes

E. Don’t know.
2
(2) (2) (1)
X
f (x, w) = w0 + wj h(1) ([1 x]wj ).
j=1

where h(1) (x) = max(x, 0) is the rectified linear func-


tion used as activation function in the hidden layer and
the weights are given as:

 
(1) −1.8
w1 =
−1.1
 
(1) −0.6
w2 =
3.8
 
−0.1
w(2) = ,
2.1
(2)
w0 = − 0.8.

Which of the curves in Figure 12 will then correspond

12 of 14
Figure 13: Mixture components in a GMM mixture
Figure 14: Scatter plot of each pairs of attributes of a
model with K = 3.
vectors x drawn from a multivariate normal distribu-
tion of 3 dimensions.
Question 24.
We wish to apply the EM algorithm to fit a 1D
Question 25. Consider a multivariate normal distri-
GMM mixture model to the single feature x3 from
bution with covariance matrix Σ and mean µ and sup-
the Avila Bible dataset. At the first step of the
pose we generate 1000 random samples from it:
EM algorithm, the K = 3 mixture components has
densities as indicated by each of the curves in Figure 13  >
x = x1 x2 x3 ∼ N (µ, Σ)
(i.e. each curve is a normalized, Gaussian density
N (x; µk , σk )). In the figure, we have indicated the x3 - Plots of each pair of coordinates of the draws x is shown
value of a single observation i from the dataset as a in Figure 14. What is the most plausible covariance
black cross. matrix?
Suppose we wish to apply the EM algorithm to this  
mixture model beginning with the E-step. We assume 1.0 0.65 −0.65
the weights of the components are A. Σ =  0.65 1.0 0.0 
  −0.65 0.0 1.0
π = 0.15 0.53 0.32  
1.0 0.0 0.65
and the mean/variances of the components are those B. Σ =  0.0 1.0 −0.65
indicated in the figure. 0.65 −0.65 1.0
According to the EM algorithm, what is the (ap-  
proximate) probability the black cross is assigned to 1.0 −0.65 0.0
mixture component 3 (γik )? C. Σ = −0.65 1.0 0.65
0.0 0.65 1.0
A. 0.4  
1.0 0.0 −0.65
B. 0.86 D. Σ =  0.0 1.0 0.65 
−0.65 0.65 1.0
C. 0.28
E. Don’t know.
D. 0.58

E. Don’t know.

13 of 14
Which norms were used in the four KNN classifiers?

A. KNN classifier 1 corresponds to p = ∞, KNN


classifier 2 corresponds to p = 2, KNN classifier 3
corresponds to p = 4, KNN classifier 4 corresponds
to p = 1

B. KNN classifier 1 corresponds to p = 4, KNN


classifier 2 corresponds to p = 2, KNN classifier 3
corresponds to p = 1, KNN classifier 4 corresponds
to p = ∞

C. KNN classifier 1 corresponds to p = 4, KNN


classifier 2 corresponds to p = 1, KNN classifier 3
corresponds to p = 2, KNN classifier 4 corresponds
to p = ∞

D. KNN classifier 1 corresponds to p = ∞, KNN


classifier 2 corresponds to p = 1, KNN classifier 3
corresponds to p = 2, KNN classifier 4 corresponds
to p = 4

E. Don’t know.
Figure 15: Decision boundaries for a KNN classifier,
K = 1, computed for the two observations marked by
circles (the colors indicate class labels), but using four Question 27. Consider a small dataset comprised of
different p-distances dp (·, ·) to compute k-neighbors. N = 9 observations
 
x = 0.1 0.3 0.5 1.0 2.2 3.0 4.1 4.4 4.7 .
Question 26.
We consider a K-nearest neighbor (KNN) classifier Suppose a k-means algorithm is applied to the dataset
with K = 1. Recall in a KNN classifier, we find the with K = 4 and using Euclidian distances. At a given
nearest neighbors by computing the distances using a stage of the algorithm the data is partitioned into the
distance measure d(x, y). For this problem, we will blocks:
consider KNN classifiers based on different distance
{0.1, 0.3}, {0.5, 1}, {2.2, 3, 4.1}, {4.4, 4.7}
measures based on p-norms
 1 What clustering will the k-means algorithm eventually
M p
X converge to?
dp (x, y) =  |xj − yj |p  , p ≥ 1
j=1 A. {0.1, 0.3, 0.5, 1}, {2.2}, {}, {3, 4.1, 4.4, 4.7}
and what decision surfaces they induce. B. {0.1, 0.3}, {0.5, 1}, {2.2, 3}, {4.1, 4.4, 4.7}
In Figure 15 are shown four different decision bound-
aries obtained by training the KNN (K = 1) classifiers C. {0.1, 0.3}, {0.5}, {1, 2.2}, {3, 4.1, 4.4, 4.7}
using the training observations (marked by the two cir- D. {0.1, 0.3}, {0.5, 1, 2.2, 3}, {4.1, 4.4}, {4.7}
cles in the figure):
    E. Don’t know.
0.301 0.34
x1 = , x2 =
0.514 0.672

and with corresponding class labels y1 = 0 and y2 = 1,


but with distance measures based on p = 1, 2, 4, ∞
(not necessarily plotted in that order).

14 of 14

You might also like