Dda3020 22
Dda3020 22
Final Exam
1 Multiple-choice questions (2 points per ques- D. Both Binomial and Gaussian distributions are
tion) (30 points) continuous distributions.
Note that there might be one or more correct op-
tion(s). If there are more than one correct answer, you 1.5 Which of the following statement(s) is/are correct?
should select all the correct options in order to get full
marks. If your answer is correct but incomplete, you A. When the number of training set n is very large
will only get partial marks. If any incorrect option is and the feature dimension d is very small, we
chosen, you will get zero mark. prefer to use the closed form solution rather than
gradient descent method.
1.1 Which one(s) are the sub-area of artificial intelli-
gence? B. Linear regression can only be used for regression
tasks.
A. Computer Vision
C. The decision boundary of Ridge regression is
B. Robotics non-linear in the original feature space.
C. Natural Language Processing D. The decision boundary of polynomial linear
D. Machine Learning regression is non-linear in the original feature
space.
E. Optimization
E. If standard linear regression is overfitting, we can
1.2 Suppose in a 2-class classification problem, our try Ridge regression.
standard logistic regression model achieves 99% accu- F. If standard linear regression is overfitting, we can
racy on the training set, but only 51% accuracy on the try polynomial linear regression.
test set. Which of the following modifications might
potentially improve our algorithm’s test accuracy?
1.6 Which of the following statement(s) is/are correct?
A. Use regularized logistic regression.
A. Standard SVM cannot handle a non-linearly
B. Use polynomial hypothesis function to replace the separated data.
original hypothesis function w> x + b.
B. SVM with slack variables can handle a non-
C. Add more training data. linearly separated data and its training error
D. Add more testing data. could be 0.
C. Kernel SVM can perfectly fit non-linear separated
1.3 Which of the following statement(s) is/are correct? data.
A. Backpropagation consists of a forward pass that D. Margin is the distance from the closest point
computes the error, and a backward pass that of positive and negative classes to the decision
adjusts the weights. boundary
B. In a feedforward neural network, the information
always moves in one direction. 1.7 Which of the following statement(s) is/are correct?
C. A convolutional neural network is a fully con- A. PCA can give non-linear projection of data.
nected network.
B. Dimensionality reduction tasks can be solved by
D. When training a multi-layer perception, we com- supervised or unsupervised learning methods.
pute the error using a loss function at the output
layer and pass the gradients from the output C. We project N dimensional data points to a k di-
layer backwards to the input layer to update the mensional space, the dimension of reconstruction
weights. points is lower than that of the original data.
D. In PCA, we choose the k eigenvectors with the
1.4 Which of the following statement(s) is/are correct? top k largest eigenvalues to form a new matrix.
A. Toss a coin 4 times, then there will be 4 possible
1.8 Which of the following statement(s) is/are correct?
events.
B. If the probability of Event A is 0.1 and the A. Decision tree can only handle training data with
probability of Event B is 0.01, then information numerical attributes
of A is smaller then B. B. When we set the minimal size of leaf node to be a
C. KL divergence is positive and non-symmetric. large value, then we will grow a deep decision tree.
C. When we set the maximal depth to be a small D. When we increase the model complexity, bias
value, then we will grow a shallow decision tree. grows and variance drops in the testing set.
D. When we increase the number of decision trees,
the training performance of Bagging will often be 1.13 Which of the following statement(s) is/are cor-
better. rect?
A. If we fix the covariance matrix Σ as the identity
1.9 Which of the following would you apply unsuper- matrix I when fitting GMM using EM, then it is
vised learning to? equivalent to the standard K-means algorithm.
A. Given the data of body dimensions collected from B. It is guaranteed that EM algorithm can improve
1,000 consumers, determine the size of the clothes the log likelihood function.
to be produced. C. The latent variable in latent variable models must
B. Develop a model to predict the stock market. be discrete.
C. Given a large dataset of medical records of patient D. Suppose that f is a concave function and X is a
including the disease name, predict the disease of random variable, then f (E(X)) ≥ E(f (X)).
the new patient by the symptom.
D. Compress a high-resolution image to a low- 1.14 Given a linear system Xw = y, where m×d w ∈ Rd
resolution image. is the variable we want to solve, and X ∈ R and
y ∈ Rm are the data provided, which of the following
1.10 A class has 10 students. They received marks for statement(s) is/are correct?
their mid-term quiz as follows. A. When m = d, it is guaranteed to obtain a unique
solution.
Student ID 01 02 03 04 05 B. When m > d, this linear system is called under-
Marks 90 50 66 71 82 determined system, and there is no solution.
Student ID 06 07 08 09 10 C. When m > d, this linear system is called over-
Marks 72 75 68 99 60 determined system, and there is no solution.
D. When m < d, this linear system is called
To group the students into two tutorial groups accord- under-determined system, and there are infinite
ing to their marks, we use k-means. We pick student solutions.
03 as the initial centroid for Group A, and 07 for E. When m < d, this linear system is called
Group B, and assign the students to the two groups over-determined system, and there are infinite
using Euclidean distance. solutions.
A. We will have 4 students in Group A. 1.15 Suppose we want to train a classifier in a super-
vised learning manner, in order to automatically eval-
B. We will have 5 students in Group B. uate student assignments. In this setting, which of the
C. If we change the initial centroid, the clustering following statement(s) is/are correct?
result by kmeans will not be changed.
A. The collected students assignments which have
D. With the above group assignments, we re-estimate not been graded by teachers can be used as the
the new centroids for the two groups. The new experience.
centroid of Group A is 60.
B. The task is to predict the grades of the student
1.11 Which of the following statement(s) is/are cor- assignments.
rect? C. The accuracy of the predicted grades can be used
as the performance measure.
A. When the TPR-curve is plotted as y-axis and the
FPR-curve is plotted as x-axis, the plot is called D. The accuracy of the students’ answers to the
the ROC curve. assignments can be used as the performance
measure.
B. T P R + F P R = 1.
T P R+T N R
C. Accuracy = 2 .
2 Calculations and Derivations (70 points)
D. As the classification threshold increases, the FNR
increases.
1.12 Which of the following statement(s) is/are cor- 2.1 (5 points) Consider two discrete random variables
rect? X and Y , X ∈ {0, 1, 2}, Y ∈ {2, 3}. P (Y = 2) =
0.4, P (Y = 3) = 0.6. Given Y , X follows binomial
n!
A. When we increase the model complexity, bias distribution, i.e., P (X = k|Y = y) = k!(n−k)! ( y1 )k (1 −
drops and variance grows in the training set. 1 n−k
y) , where n = 2.
B. When we increase the model complexity, bias
drops and variance grows in the testing set. 1. Calculate the probability distribution of X, i.e.,
P (X). (1 point)
C. When we increase the model complexity, both
bias and variance drops in the training set. 2. Calculate the conditional distribution of P (Y |X =
1). (Hint: using Bayes rule) (2 points)
3. Calculate E(X), and E(Y |X = 1). (2 points) 3. Derive the solution of w, b according to the La-
grangian function and KKT condition, and tell
2.2 (15 points) We define a CNN model as when a training data is called support vector.
(Hint: firstly write down the Lagrangian function,
fCN N (X) = (1)
then KKT conditions, then use the KKT condi-
Sof tmax(F C1 (Conv2 (M P1 (Relu1 (Conv1 (X)))))). tions to derive the solution.) (6 points)
The size of the input data X is 36 × 36 × 3; the first 2.4 (10 points) Consider the figure below:
convolutional layer Conv1 includes 10 8 × 8 × 3 filters,
stride=2, padding=1; Relu1 indicates the first Relu
layer; M P1 is a 2 × 2 max pooling layer, stride=2;
the second convolutional layer Conv2 includes 100
5 × 5 × 10 filters, stride=1, padding=0; F C1 indi-
cates the fully connected layer, where there are 10 out-
put neurons; Sof tmax denotes the Softmax activation
function. The ground-truth label of X is denoted as t,
and the loss function used for training this CNN model
is denoted as L(y, t).
3. Plot the computational graph (CG) of the for- The red circles represent the data points in positive
ward pass of this CNN model (3 points) (hint: use class and the blue squares represents negative. We con-
z1 , z2 , z3 , z4 , z5 , z6 denote the activated value after struct a model to do the prediction and get a decision
Conv1 , Relu1 , M P1 , Conv2 , F C1 , Sof tmax) boundary as the black curve. The points on the top
right of the decision boundary are predicted to be pos-
4. Based on the plotted CG, write down the formula- itive and other points at the bottom left of the bound-
tions of back-propagation algorithm, including the ary are predicted to be negative.
forward and backward pass (6 points) (Hint: for
the forward pass, write down the process of how 1. Write down the confusion matrix of this classifica-
to get the value of loss function L(y, t); for the tion. (2 points)
backward pass, write down the process of comput- 2. Calculate the precision, recall and accuracy of this
ing the partial derivative of each parameter, like classification method. (3 points)
∂L ∂L
∂w1 , ∂b1 )
3. Calculate the FNR and FPR. (2 points)
2.3 (17 points) Derivations of SVM’s objective func-
tion and solution according to dual problem. Given 4. Suppose that the posterior probabilities of the 6
n
the training data {(xi , yi )}i , denoting the parameters positive data p(yi = +|xi ) are 0.2, 0.5, 0.7, 0.7, 0.8
as w, b, the objective function of SVM is formulated as and 0.9, respectively. The posterior probabilities
follows: of the 4 negative data p(yi = +|xi ) are 0.1, 0.3,
1 0.5 and 0.6, respectively. Calculate the AUC of
min kwk2 (2) this model. (3 points)
w,b 2
s.t. yi w> xi + b ≥ 1, ∀i
2.5 (15 points) Suppose that a random variable x
follows the Gaussian mixture distribution with p(x) =
PK
1. Derive the above objective from the perspective k=1 πk N (x | µk , Σk ). Where Θ = {π, µ, Σ}, π =
of large margin. (Hint: The margin is defined as {π1 , . . . , πK } , µ = {µ1 , . . . , µK } , Σ = {Σ1 , . . . , ΣK }
the closest distance from the data point to the and latent variable z ∼ Categorical (π), where π ≥
hyperplane. We wish to find a hyperplane that 0,
P
k πk = 1.
separate the data, meanwhile having the largest
margin among all the hyperplanes. The distance 1. Likelihood Decomposition. Show that
of a point y to a hyperplane given by fw,b (x) :=
|f (x)| ln p(D; Θ) ≥ L(q; Θ), ∀q, Θ,
wT x + b = 0 is given by w,b kwk ) (6 points)
where
2. Derive the above objective from the perspective of
hinge loss. (Hint: The objective function of SVM XN
p(x(n) , z (n) ; )
using hinge loss is given by L(q; Θ) = Eqn (z(n) ) ln ,
n=1 qn (z (n) )
m
1 QN
and q(z) = n=1 qn z (n) . And, write down the
X
C hinge loss(x, y; w, b) + ||w||2
i
2 gap between ln p(D; Θ) and L(q; Θ). (6 points)
You can firstly write down the explicit expression 2. E-Step Derivation. With given Θ = {π, µ, Σ},
of hinge loss. ) (5 points) update q(z). (3 points)
3. M-Step Derivation Given q(z), update Θ =
{π, µ, Σ}. (6 points)