02450ex Fall2018
02450ex Fall2018
Please hand in your answers using the electronic file. Only use this page in the case where digital
handin is unavailable. In case you have to hand in the answers using the form on this sheet, please follow
these instructions:
Print name and study number clearly. The exam is multiple choice. All questions have four possible answers
marked by the letters A, B, C, and D as well as the answer “Don’t know” marked by the letter E. Correct answer
gives 3 points, wrong answer gives -1 point, and “Don’t know” (E) gives 0 points.
The individual questions are answered by filling in the answer fields with one of the letters A, B, C, D, or E.
Answers:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27
Name:
Student number:
1 of 14
No. Attribute description Abbrev.
x1 intercolumnar distance interdist
x2 upper margin upperm
x3 lower margin lowerm
x4 exploitation exploit
x5 row number row nr.
x6 modular ratio modular
x7 interlinear spacing interlin
x8 weight weight
x9 peak number peak nr.
x10 modular ratio/ interlinear spacing mr/is
y Who copied the text? Copyist
1
A. Boxplot 1 is mr/is, Boxplot 2 is lowerm, Boxplot
0.8
3 is upperm and Boxplot 4 is peak nr.
0.6
-0.4
D. Boxplot 1 is mr/is, Boxplot 2 is lowerm, Boxplot
3 is peak nr. and Boxplot 4 is upperm 1 2 3 4
E. Don’t know.
Figure 2: Boxplots corresponding to the variables
plotted in Figure 1 but not necessarily in that order.
1
Dataset obtained from https://archive.ics.uci.edu/ml/
datasets/Avila
2 of 14
Question 2.
A Principal Component Analysis (PCA) is carried
out on the Avila Bible dataset in Table 1 based on the
attributes x1 , x3 , x5 , x6 , x7 .
The data is standardized by (i) substracting the
mean and (ii) dividing each column by its standard
deviation to obtain the standardized matrix X̃. A
singular value decomposition is then carried out on
the standardized matrix to obtain the decomposition
U SV T = X̃
0.04 −0.12 −0.14 0.35 0.92
0.06
0.13 0.05 −0.92 0.37
−0.03
V = −0.98 0.08 −0.16 −0.05 (1)
−0.99 0.03 0.06 −0.02 0.07
−0.07 −0.05 −0.98 −0.11 −0.11
14.4 0.0 0.0 0.0 0.0 Figure 3: Black dots show attributes x5 and x7 of the
0.0 8.19 0.0 0.0 0.0
Avila Bible dataset from Table 1. The two points cor-
0.0 0.0 7.83 0.0 0.0
S=
responding to the colored markers indicate two specific
0.0 0.0 0.0 6.91 0.0
observations A, B.
0.0 0.0 0.0 0.0 6.01
E. Don’t know.
Question 3.
Consider again the PCA analysis fo the Avila Bible
dataset. In Figure 3 the features x5 and x7 from Figure 4: Candidate plots of the observations and
Table 1 are plotted as black dots. We have indicated path shown in Figure 3 projected onto the first two
two special observations as colored markers (Point A principal components considered in Equation (1). The
and Point B). colored markers still refer to points A and B, now
We can imagine that the dataset, along with the in the coordinate system corresponding to the PCA
two special observations, is projected onto the first two projection.
principal component directions given in V as computed
earlier (see Equation (1)). Which one of the four plots
3 of 14
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 K nearest neighbors. What is the average relative
o1 0.0 2.91 0.63 1.88 1.02 1.82 1.92 1.58 1.08 1.43
o2 2.91 0.0 3.23 3.9 2.88 3.27 3.48 4.02 3.08 3.47 density for observation o4 for K = 2 nearest neighbors?
o3 0.63 3.23 0.0 2.03 1.06 2.15 2.11 1.15 1.09 1.65
o4 1.88 3.9 2.03 0.0 2.52 1.04 2.25 2.42 2.18 2.17
o5 1.02 2.88 1.06 2.52 0.0 2.44 2.38 1.53 1.71 1.94
A. 1.0
o6 1.82 3.27 2.15 1.04 2.44 0.0 1.93 2.72 1.98 1.8
o7 1.92 3.48 2.11 2.25 2.38 1.93 0.0 2.53 2.09 1.66 B. 0.71
o8 1.58 4.02 1.15 2.42 1.53 2.72 2.53 0.0 1.68 2.06
o9 1.08 3.08 1.09 2.18 1.71 1.98 2.09 1.68 0.0 1.48
o10 1.43 3.47 1.65 2.17 1.94 1.8 1.66 2.06 1.48 0.0
C. 0.68
D. 0.36
Table 2: The pairwise
qP Euclidian distances, d(oi , oi ) =
M 2 E. Don’t know.
kxi − xj k2 = k=1 (xik − xjk ) between 10 obser-
vations from the Avila Bible dataset (recall M = 10).
Each observation oi corresponds to a row of the data Question 5.
matrix X of Table 1 (the data has been standardized). Suppose a GMM model is applied to the Avila Bible
The colors indicate classes such that the black observa- dataset in the processed version shown in Table 2. The
tions {o1 , o2 , o3 } belongs to class C1 (corresponding to GMM is constructed as having K = 3 components,
copyist one), the red observations {o4 , o5 , o6 , o7 , o8 } and each component k of the GMM is fitted by letting
belongs to class C2 (corresponding to copyist two), and it’s mean vectors µk be equal to the location of the
the blue observations {o9 , o10 } belongs to class C3 (cor- observations:
responding to copyist three). o 7 , o8 , o9
(i.e. each observation corresponds to exactly one mean
in Figure 4 shows the correct PCA projection? vector) and setting the covariance matrix equal to
Σk = σ 2 I where I is the identity matrix:
A. Plot A −d(oi ,µk )2
1
N (oi ; µk , Σk ) = p e 2σ2
B. Plot B |2πΣk |
A. p(o3 ) = 0.048402
Question 4. To examine if observation o4 may be an
B. p(o3 ) = 0.076
outlier, we will calculate the average relative density
based on euclidean distance and the observations given C. p(o3 ) = 0.005718
in Table 2 only. We recall that the KNN density and
average relative density (ard) for the observation xi are D. p(o3 ) = 0.114084
given by:
E. Don’t know.
1
densityX \i (xi , K) = 1 P 0)
,
K x0 ∈NX \i (xi ,K) d(xi , x
densityX \i (xi , K)
ardX (xi , K) = 1 P ,
K xj ∈NX \i (xi ,K) densityX \j (xj , K)
4 of 14
Figure 6: Dendrogram 1 from Figure 5 with a cutoff
indicated by the dotted line, thereby generating 3
clusters.
Question 7.
Consider dendrogram 1 from Figure 5. Suppose we
apply a cutoff (indicated by the black line) thereby gen-
Figure 5: Proposed hierarchical clustering of the 10 erating three clusters. We wish to compare the quality
observations in Table 2. of this clustering, Q, to the ground-truth clustering, Z,
indicated by the colors in Table 2. Recall the normal-
Question 6. A hierarchical clustering is applied ized mutual information of the two clusterings Z and
to the 10 observations in Table 2 using minimum Q is defined as
linkage. Which of the dendrograms shown in Figure 5 MI[Z, Q]
corresponds to the clustering? NMI[Z, Q] = p p
H[Z] H[Q]
A. Dendrogram 1 where MI is the mutual information and H is the
entropy. Assuming we always use an entropy based
B. Dendrogram 2 on the natural logarithm,
C. Dendrogram 3 n
X
H=− pi log pi , log(e) = 1,
D. Dendrogram 4 i=1
A. NMI[Z, Q] ≈ 0.313
B. NMI[Z, Q] ≈ 0.302
C. NMI[Z, Q] ≈ 0.32
D. NMI[Z, Q] ≈ 0.274
E. Don’t know.
5 of 14
x9 -interval y=1 y=2 y=3 Question 10. Consider the split in Table 3. Suppose
we build a classification tree with only this split and
x9 ≤ 0.13 108 112 56
evaluate it on the same data it was trained on. What
0.13 < x9 58 75 116
is the accuracy?
Table 3: Proposed split of the Avila Bible dataset
A. Accuracy is: 0.64
based on the attribute x9 . We consider a 2-way split
where for each interval we count how many observa- B. Accuracy is: 0.29
tions belonging to that interval has the given class la-
bel. C. Accuracy is: 0.35
A. ∆ ≈ 0.485
B. ∆ ≈ 0.078
C. ∆ ≈ 0.566
D. ∆ ≈ 1.128
E. Don’t know.
6 of 14
Figure 7: Output of a logistic regression classifier
trained on 7 observations from the dataset. Figure 8: Proposed ROC curves for the logistic regres-
sion classifier in Figure 7.
Question 12. Consider again the Avila Bible dataset.
We are particularly interested in predicting whether a Question 13.
bible copy was written by copyist 1, and we therefore To evaluate the classifier Figure 7, we will use the
wish to train a logistic regression classifier to distin- area under curve (AUC) of the reciever operator char-
guish between copyist one vs. copyist two and three. acteristic (ROC) curve as computed on the 7 observa-
To simplify the setup further, we select just 7 obser- tions in Figure 7. In Figure 8 is given four proposed
vations and train a logistic regression classifier using ROC curves, which one of the curves corresponds to
only the feature x8 as input (as usual, we apply a simple the classifier?
feature transformation to the inputs to add a constant
feature in the first coordinate to handle the intercept A. ROC curve 1
term). To be consistent with the lecture notes, we la-
B. ROC curve 2
bel the output as y = 0 (corresponding to copyist one)
and y = 1 (corresponding to copyist two and three). C. ROC curve 3
In Figure 7 is shown the predicted output probability
an observation belongs to the positive class, p(y = D. ROC curve 4
1|x8 ). What are the weights?
E. Don’t know.
−0.93
A.
1.72
−2.82
B.
0.0
1.36
C.
0.4
−0.65
D.
0.0
E. Don’t know.
7 of 14
f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 Question 15.
o1 1 1 0 0 0 1 0 0 0 1 Consider the binarized version of the Avila Bible
o2 1 0 0 0 0 0 0 0 0 0 dataset shown in Table 4.
o3 1 1 0 0 0 1 0 0 0 1
o4 0 1 1 1 0 0 0 1 1 0 The matrix can be considered as representing N =
o5 1 1 0 0 0 1 0 0 0 1 10 transactions o1 , o2 , . . . , o10 and M = 10 items
o6 0 1 1 1 0 0 1 1 1 0 f1 , f2 , . . . , f10 . Which of the following options repre-
o7 1 1 1 0 0 1 1 1 1 0 sents all (non-empty) itemsets with support greater
o8 0 1 1 1 0 1 1 0 0 1
o9 0 0 0 0 1 1 1 0 1 1
than 0.55 (and only itemsets with support greater than
o10 1 0 0 0 0 1 1 1 1 0 0.55)?
Table 4: Binarized version of the Avila Bible dataset. A. {f1 }, {f2 }, {f6 }, {f7 }, {f9 }, {f10 }, {f1 , f6 },
Each of the features fi are obtained by taking a feature {f2 , f6 }, {f6 , f10 }
xi and letting fi = 1 correspond to a value xi greater
B. {f1 }, {f2 }, {f6 }
than the median (otherwise fi = 0).The colors indicate
classes such that the black observations {o1 , o2 , o3 } C. {f1 }, {f2 }, {f3 }, {f4 }, {f6 }, {f7 }, {f8 }, {f9 },
belongs to class C1 (corresponding to copyist one), the {f10 }, {f1 , f2 }, {f2 , f3 }, {f2 , f4 }, {f3 , f4 }, {f1 , f6 },
red observations {o4 , o5 , o6 , o7 , o8 } belongs to class {f2 , f6 }, {f2 , f7 }, {f3 , f7 }, {f6 , f7 }, {f2 , f8 },
C2 (corresponding to copyist two), and the blue obser- {f3 , f8 }, {f7 , f8 }, {f2 , f9 }, {f3 , f9 }, {f6 , f9 },
vations {o9 , o10 } belongs to class C3 (corresponding to {f7 , f9 }, {f8 , f9 }, {f1 , f10 }, {f2 , f10 }, {f6 , f10 },
copyist three). {f2 , f3 , f4 }, {f1 , f2 , f6 }, {f2 , f3 , f7 }, {f2 , f3 , f8 },
{f2 , f3 , f9 }, {f6 , f7 , f9 }, {f2 , f8 , f9 }, {f3 , f8 , f9 },
Question 14. We again consider the Avila Bible {f7 , f8 , f9 }, {f1 , f2 , f10 }, {f1 , f6 , f10 }, {f2 , f6 , f10 },
dataset from Table 1 and the N = 10 observations we {f2 , f3 , f8 , f9 }, {f1 , f2 , f6 , f10 }
already encountered in Table 2. The data is processed
D. {f1 }, {f2 }, {f3 }, {f6 }, {f7 }, {f8 }, {f9 }, {f10 },
to produce 10 new, binary features such that fi = 1
{f1 , f2 }, {f2 , f3 }, {f1 , f6 }, {f2 , f6 }, {f6 , f7 },
corresponds to a value xi greater than the median2 ,
{f7 , f9 }, {f8 , f9 }, {f2 , f10 }, {f6 , f10 }, {f1 , f2 , f6 },
and we thereby arrive at the N × M = 10 × 10 binary
{f2 , f6 , f10 }
matrix in Table 4. Suppose we train a naı̈ve-Bayes
classifier to predict the class label y from only the E. Don’t know.
features f1 , f2 , f6 . If for an observations we observe
Question 16. We again consider the binary matrix
f1 = 1, f2 = 1, f6 = 0
from Table 4 as a market basket problem consisting
what is then the probability that y = 1 according to of N = 10 transactions o1 , . . . , o10 and M = 10 items
the Naı̈ve-Bayes classifier? f1 , . . . , f10 .
What is the confidence of the rule {f1 , f3 , f8 , f9 } →
50 {f
A. pNB (y = 1|f1 = 1, f2 = 1, f6 = 0) = 77 2 , f6 , f7 }
25 1
B. pNB (y = 1|f1 = 1, f2 = 1, f6 = 0) = 43 A. Confidence is 10
5
C. pNB (y = 1|f1 = 1, f2 = 1, f6 = 0) = 11 B. Confidence is 1
10 1
D. pNB (y = 1|f1 = 1, f2 = 1, f6 = 0) = 19 C. Confidence is 2
E. Don’t know. 3
D. Confidence is 20
E. Don’t know.
2
Note that in association mining, we would normally also
include features fi such that fi = 1 if the corresponding feature
is less than the median; for brevity we will not consider features
of this kind in this problem
8 of 14
to the nodes in the decision tree?
E. Don’t know.
Figure 9: Example classification tree.
9 of 14
Question 19.
Consider again the Avila Bible dataset in Table 1. Feature(s) Training RMSE Test RMSE
We would like to predict the copyist using a linear none 3.429 4.163
regression, and since we would like the model to be as x1 3.043 3.252
interpretable as possible we will use variable selection x5 3.303 4.52
to obtain a parsimonious model. We limit ourselves x6 3.424 4.274
to the 5 features x1 , x5 , x6 , x8 , x9 and in Table 5 x8 3.399 4.429
we have pre-computed the estimated training and test x9 2.866 5.016
error for different variable combinations of the dataset. x1 , x5 3.001 3.44
Which of the following statements is correct? x1 , x6 3.031 3.423
x5 , x6 3.297 4.641
A. Backward selection will select attributes x1
x1 , x8 3.017 3.42
B. Backward selection will select attributes x5 , x8 3.299 4.485
x1 , x5 , x6 , x8 x6 , x8 3.396 4.519
x1 , x9 2.644 4.267
C. Forward selection will select attributes x1 , x8 x5 , x9 2.645 5.495
x6 , x9 2.787 5.956
D. Forward selection will select attributes
x8 , x9 2.71 5.536
x1 , x5 , x6 , x8
x1 , x5 , x6 2.988 3.607
E. Don’t know. x1 , x5 , x8 3.0 3.453
x1 , x6 , x8 3.007 3.574
x5 , x6 , x8 3.292 4.61
Question 20.
x1 , x5 , x9 2.523 4.704
Consider the Avila Bible dataset from Table 1. We
x1 , x6 , x9 2.562 5.184
wish to predict the copyist based on the attributes
x5 , x6 , x9 2.544 6.552
upperm and mr/is.
x1 , x8 , x9 2.517 4.686
Therefore, suppose the attributes have been bina-
x5 , x8 , x9 2.628 5.532
rized such that x̃2 = 0 corresponds x2 ≤ −0.056
x6 , x8 , x9 2.629 6.569
(and otherwise x̃2 = 1) and x̃10 = 0 corresponds
x1 , x5 , x6 , x8 2.988 3.614
x10 ≤ −0.002 (and otherwise x̃10 = 1). Suppose the
x1 , x5 , x6 , x9 2.425 5.725
probability for each of the configurations of x̃2 and x̃10
x1 , x5 , x8 , x9 2.491 4.734
conditional on the copyist y are as given in Table 6.
x1 , x6 , x8 , x9 2.433 5.687
and the prior probability of the copyists is
x5 , x6 , x8 , x9 2.53 6.597
p(y = 1) = 0.316, p(y = 2) = 0.356, p(y = 3) = 0.328. x1 , x5 , x6 , x8 , x9 2.398 5.766
Using this, what is then the probability an observation Table 5: Root-mean-square error (RMSE) for the
was authored by copyist 1 given that x̃2 = 1 and training and test set when using least squares regres-
x̃10 = 0? sion to predict y in the avila dataset using different
combinations of the features x1 , x5 , x6 , x8 , x9 .
A. p(y = 1|x̃2 = 1, x̃10 = 0) = 0.25
B. p(y = 1|x̃2 = 1, x̃10 = 0) = 0.313 p(x̃2 , x̃10 |y) y=1 y=2 y=3
C. p(y = 1|x̃2 = 1, x̃10 = 0) = 0.262 x̃2 = 0, x̃10 = 0 0.19 0.3 0.19
x̃2 = 0, x̃10 = 1 0.22 0.3 0.26
D. p(y = 1|x̃2 = 1, x̃10 = 0) = 0.298 x̃2 = 1, x̃10 = 0 0.25 0.2 0.35
x̃2 = 1, x̃10 = 1 0.34 0.2 0.2
E. Don’t know.
Table 6: Probability of observing particular values of
x̃2 and x̃10 conditional on y.
10 of 14
Variable t=1 t=2 t=3 t=4
y1 1 2 2 2
y2 1 2 2 1
y3 2 2 2 1
y4 1 1 1 2
y5 1 1 1 1
y6 2 2 2 1
y7 1 2 2 1
y8 2 1 1 2
y9 2 2 2 2
y10 1 1 2 2
y11 2 2 1 2
y12 2 1 1 2
y1test 2 1 1 2
y2test 2 2 1 2
t 0.583 0.657 0.591 0.398
αt -0.168 -0.325 -0.185 0.207
Question 21.
Consider again the Avila Bible dataset of Table 1.
Suppose we limit ourselves to N = 12 observations Figure 11: Decision boundaries for a KNN classifier
from the original dataset and furthermore suppose we for the first T = 4 rounds of boosting. Notice that in
limit ourselves to class y = 1 or y = 2 and only addition to the training data, the plot also indicate the
consider the features x6 and x8 . We wish to apply a location of two test points.
KNN classification model (K = 2) to this dataset and
apply AdaBoost to improve the performance. During
the first T = 4 rounds of boosting, we obtain the
decision boundaries shown in Figure 11. The figure
also contains two test observations (marked by a cross
and a square).
The prediction of the intermediate AdaBoost clas-
sifiers, as well as the values of αt and t , are given
in Table 7. Given this information, how will the Ad-
aBoost classifier, as obtained by combining the T = 4
weak classifiers, classify the two test observations?
A. ỹ1test ỹ2test = 1
1
B. ỹ1test ỹ2test = 2
1
C. ỹ1test ỹ2test = 1
2
D. ỹ1test ỹ2test = 2
2
E. Don’t know.
11 of 14
to the function f ?
A. ANN output 4
B. ANN output 1
C. ANN output 3
D. ANN output 2
E. Don’t know.
E. Don’t know.
2
(2) (2) (1)
X
f (x, w) = w0 + wj h(1) ([1 x]wj ).
j=1
(1) −1.8
w1 =
−1.1
(1) −0.6
w2 =
3.8
−0.1
w(2) = ,
2.1
(2)
w0 = − 0.8.
12 of 14
Figure 13: Mixture components in a GMM mixture
Figure 14: Scatter plot of each pairs of attributes of a
model with K = 3.
vectors x drawn from a multivariate normal distribu-
tion of 3 dimensions.
Question 24.
We wish to apply the EM algorithm to fit a 1D
Question 25. Consider a multivariate normal distri-
GMM mixture model to the single feature x3 from
bution with covariance matrix Σ and mean µ and sup-
the Avila Bible dataset. At the first step of the
pose we generate 1000 random samples from it:
EM algorithm, the K = 3 mixture components has
densities as indicated by each of the curves in Figure 13 >
x = x1 x2 x3 ∼ N (µ, Σ)
(i.e. each curve is a normalized, Gaussian density
N (x; µk , σk )). In the figure, we have indicated the x3 - Plots of each pair of coordinates of the draws x is shown
value of a single observation i from the dataset as a in Figure 14. What is the most plausible covariance
black cross. matrix?
Suppose we wish to apply the EM algorithm to this
mixture model beginning with the E-step. We assume 1.0 0.65 −0.65
the weights of the components are A. Σ = 0.65 1.0 0.0
−0.65 0.0 1.0
π = 0.15 0.53 0.32
1.0 0.0 0.65
and the mean/variances of the components are those B. Σ = 0.0 1.0 −0.65
indicated in the figure. 0.65 −0.65 1.0
According to the EM algorithm, what is the (ap-
proximate) probability the black cross is assigned to 1.0 −0.65 0.0
mixture component 3 (γik )? C. Σ = −0.65 1.0 0.65
0.0 0.65 1.0
A. 0.4
1.0 0.0 −0.65
B. 0.86 D. Σ = 0.0 1.0 0.65
−0.65 0.65 1.0
C. 0.28
E. Don’t know.
D. 0.58
E. Don’t know.
13 of 14
Which norms were used in the four KNN classifiers?
E. Don’t know.
Figure 15: Decision boundaries for a KNN classifier,
K = 1, computed for the two observations marked by
circles (the colors indicate class labels), but using four Question 27. Consider a small dataset comprised of
different p-distances dp (·, ·) to compute k-neighbors. N = 9 observations
x = 0.1 0.3 0.5 1.0 2.2 3.0 4.1 4.4 4.7 .
Question 26.
We consider a K-nearest neighbor (KNN) classifier Suppose a k-means algorithm is applied to the dataset
with K = 1. Recall in a KNN classifier, we find the with K = 4 and using Euclidian distances. At a given
nearest neighbors by computing the distances using a stage of the algorithm the data is partitioned into the
distance measure d(x, y). For this problem, we will blocks:
consider KNN classifiers based on different distance
{0.1, 0.3}, {0.5, 1}, {2.2, 3, 4.1}, {4.4, 4.7}
measures based on p-norms
1 What clustering will the k-means algorithm eventually
M p
X converge to?
dp (x, y) = |xj − yj |p , p ≥ 1
j=1 A. {0.1, 0.3, 0.5, 1}, {2.2}, {}, {3, 4.1, 4.4, 4.7}
and what decision surfaces they induce. B. {0.1, 0.3}, {0.5, 1}, {2.2, 3}, {4.1, 4.4, 4.7}
In Figure 15 are shown four different decision bound-
aries obtained by training the KNN (K = 1) classifiers C. {0.1, 0.3}, {0.5}, {1, 2.2}, {3, 4.1, 4.4, 4.7}
using the training observations (marked by the two cir- D. {0.1, 0.3}, {0.5, 1, 2.2, 3}, {4.1, 4.4}, {4.7}
cles in the figure):
E. Don’t know.
0.301 0.34
x1 = , x2 =
0.514 0.672
14 of 14