0% found this document useful (0 votes)
13 views16 pages

Finals 19

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views16 pages

Finals 19

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CS 189 Introduction to

Spring 2019 Machine Learning Final Exam


• Please do not open the exam before you are instructed to do so.
• Electronic devices are forbidden on your person, including cell phones, iPods, headphones, and laptops. Turn your
cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.
• When you start, the first thing you should do is check that you have all 12 pages and all 6 questions. The second
thing is to please write your initials at the top right of every page after this one (e.g., write “JS” if you are Jonathan
Shewchuk).
• The exam is closed book, closed notes except your two cheat sheets.

• You have 3 hours.


• Mark your answers on the exam itself in the space provided. Do not attach any extra sheets.
• The total number of points is 150. There are 26 multiple choice questions worth 3 points each, and 5 written questions
worth a total of 72 points.
• For multiple answer questions, fill in the bubbles for ALL correct choices: there may be more than one correct choice,
but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct
answers must be checked.

First name

Last name

SID

First and last name of student to your left

First and last name of student to your right

1
Q1. [78 pts] Multiple Answer
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct
choice. NO partial credit: the set of all correct answers must be checked.

(a) [3 pts] Which of the following algorithms can learn nonlinear decision boundaries? The decision trees use only axis-
aligned splits.

A depth-five decision tree AdaBoost with depth-one decision trees

Quadratic discriminant analysis (QDA) Perceptron


The solutions are obvious other than AdaBoost with depth-one decision trees, where you can form non-linear boundaries due to the final
classifier not actually being a linear combination of the linear weak learners. Do not assume added features on exam unless said otherwise. If every
ML exam question carried an assumption of "you can add any extra features you want unless we explicitly say otherwise", then all of our exam
answers would be wrong. As for AdaBoost, you do end up with a -1 or 1 at the end, but the decision BOUNDARY could be nonlinear.
(b) [3 pts] Which of the following classifiers are capable of achieving 100% training accuracy on the data below? The
decision trees use only axis-aligned splits.

Logistic regression AdaBoost with depth-one decision trees

A neural network with one hidden layer AdaBoost with depth-two decision trees

top left: Each weak learner will either classify the points from each pair in different classes, or classify every point in the same
class. Since the meta classifier is a weighted sum of all of these weak classifiers, each which has a 50% training accuracy, the
meta classifier cannot have 100% accuracy.

top right: A neural network with one hidden layer (with enough units) is a universal function approximator.

lower left: Logistic regression finds a linear decision boundary, which cannot separate the data.

lower right: A depth two decision tree can fully separate the data.

(c) [3 pts] Which of the following are true of support vector machines?

Increasing the hyperparameter C tends to decrease Increasing the hyperparameter C tends to decrease
the training error the margin

The hard-margin SVM is a special case of the soft- Increasing the hyperparameter C tends to decrease
margin with the hyperparameter C set to zero the sensitivity to outliers

Top left: True, from the lecture notes.

Bottom left: False, Hard-margin SVM is where C tends towards infinity.

Top right: false, perceptron is trained using gradient descent and SVM is trained using a quadratic program.

2
Bottom right: True: slack becomes less expensive, so you allow data points to be farther on the wrong side of the margin and
make the margin bigger. Doing this will never reduce the number of data points inside the margin.

(d) [3 pts] Let r(x) be a decision rule that minimizes the risk for a three-class classifier with labels y ∈ {0, 1, 2} and an
asymmetric loss function. What is true about r(·)?

∀y ∈ {0, 1, 2}, ∃x : r(x) = y ∀x, r(x) is a class y that maximizes the posterior
probability P(Y = y|X = x)
If we don’t have access to the underlying data dis-
tribution P(X) or P(Y|X), we cannot exactly compute If P(X = x) changes but P(Y = y|X = x) remains
the risk of r(·) the same for all x, y, r(X) still minimizes the risk

top left: it is possible that r(X) is the same for all X.

top right: no, because the risk is asymmetric

lower left: by definition of risk we need to be able to compute expectations over these two distributions.

lower right: Given that r(X) has no constraint, it can pick the y that minimizes risk for every X = x without trade-offs. Therefore,
if only the marginals change, that choice is not affected.

(e) [3 pts] Which of the following are true about two-class Gaussian discriminant analysis? Assume you have estimated the
parameters µ̂C , Σ̂C , π̂C for class C and µ̂D , Σ̂D , π̂D for class D.
If µ̂C = µ̂D and π̂C = π̂D , then the LDA and QDA If Σ̂C = Σ̂D , π̂C = 1/6, and π̂D = 5/6, then the
classifiers are identical LDA and QDA classifiers are identical

If Σ̂C = I (the identity matrix) and Σ̂D = 5I, then If the LDA and QDA classifiers are identical, then
the LDA and QDA classifiers are identical the posterior probability P(Y = C|X = x) is linear in x
Top left: false, the covariance matrices might differ, making the QDA decision function nonlinear.

Bottom left: false, the QDA decision function is nonlinear. Check out the figure on p. 36 of Lecture 7 and look at the two sites
furthest to the right. one of which has a much wider Gaussian than the other). That's the QDA classifier for this example (maybe the
constant of 5 is not quite right, but close enough), and it has an elliptical decision boundary. For the same points, the LDA decision
boundary would just be a straight line.

Top right: correct. All points in R^d are classified as class D by both classifiers. So the decision boundary is the empty set.
Bottom right: no, the posterior is a logistic function.
(f) [3 pts] Consider an n × d design matrix X with labels y ∈ Rn . What is true of fitting this data with dual ridge regression
with the polynomial kernel k(Xi , X j ) = (XiT X j + 1) p = Φ(Xi )> Φ(X j ) and regularization parameter λ > 0?

If the polynomial degree is high enough, the poly- The algorithm solves an n × n linear system
nomial will fit the data exactly
When n is very large, this dual algorithm is
The algorithm computes Φ(Xi ) and Φ(X j ) in O(d p ) more likely to overfit than the primal algorithm with
time degree-p polynomial features
Top right: see definition of dual ridge regression

Lower right: both give the same solution, no matter n!

Top left: The dual method problem of ridge regression is indeed recommended only when d > n. But in this case we use a
Kernel, so in fact we have a number of features of d0 = d p! Also, In ridge regression, we would never reach the point of
perfectly fitting on the data due to the regularization term. Ridge regression helps us prevent from overfitting on the training
data regardless of the degree of the polynomial features.

Bottom left: no need! just their dot product, which can easily be obtained with (XiT X j + 1)p.
(g) [3 pts] Consider the kernel perceptron algorithm on an n × d design matrix X. We choose a matrix M ∈ RD×d and define
the feature map Φ(x) = Mx ∈ RD and the kernel k(x, z) = Φ(x) · Φ(z). Which of the following are always true?
3
The kernel matrix is XM > MX > The kernel matrix is MX> XM>

If the primal perceptron algorithm terminates, then If the kernel perceptron algorithm terminates, then the
the kernel perceptron algorithm terminates primal perceptron algorithm terminates
Bottom left: True. The dual is computational faster in this case than the primal. So if the primal finished, then the dual has as well

Top left: Yes, because K=Φ(X)Φ(X)> and Φ(X)=XM>.


Top right: No algebarically if we do the substuition

Bottom right: The dual is computational faster in this case than the primal. If the projected data is linearly separable, the
original data must be linearly separable too. (But the reverse is not true.)

(h) [3 pts] Which of the following are true of decision trees? Assume splits are binary and are done so as to maximize the
information gain.

If there are at least two classes at a given node, The deeper the decision tree is, the more likely it
there exists a split such that information gain is is to overfit
strictly positive

As you go down any path from the root to a leaf, Random forests are less likely to overfit than de-
the information gain at each level is non-increasing cision trees
Top left: false, recall example from section. consider the first split on an XOR example. Note that there does not exist a binary,
axis-aligned linear split that can give you positive information gain here (the entropy before and after the split is the same).

Bottom left: false, recall example from section. (same example) the information gain at the second split is larger than the first one.
Top right: correct.
Bottom right: correct.

(i) [3 pts] While solving a classification problem, you use a pure, binary decision tree constructed by the standard greedy
procedure we outlined in class. While your training accuracy is perfect, your validation accuracy is unexpectedly low.
Which of the following, in isolation, is likely to improve your validation accuracy in most real-world applications?

Lift your data into a quadratic feature space Normalize each feature to have variance 1

Select a random subset of the features and use only Prune the tree, using validation to decide how to
those in your tree prune

Top left: False, lifting to a more complex feature space will not generally stop you from overfitting.

Bottom left: False, an ensemble of standard decision trees fit to the same data-set will not learn

Top right: The small change in split criterion will not generally stop you from overfitting.

Bottom right: Correct, lowering depth defends against overfitting.

(j) [3 pts] For the sigmoid activation function and the ReLU activation function, which of the following are true in general?

Both activation functions are monotonically non- Compared to the sigmoid, the ReLU is more com-
decreasing putationally expensive

Both functions have a monotonic first derivative The sigmoid derivative s0 (γ) is quadratic in s(γ)

4
Top left: True. Simply graph the activation functions
Bottom left: False. Sigmoid has non-monotonic derivative
Top right: False. ReLU is simpler as all positives have derivative 1 and all negatives have 0. While we have to
calculate exponential for Sigmoid
Bottom right: True. Unlike Sigmoid, the product of gradients of ReLU function doesn’t end up converging to 0 as the
value is either 0 or 1

(k) [3 pts] Which of the following are true in general for backpropagation?

It is a dynamic programming algorithm The weights are initially set to zero

Some of the derivatives cannot be fully computed Its running time grows exponentially in the num-
until the backward pass ber of layers
Top left: False. As it is not a model, but a quick algorithm to compute derivatives in the network b True. We have a forward pass and a
backward pass

Bottom right: False. Linear time complexity instead

Top right: False. The weights set randomly

Bottom left: True. In the backward pass


(l) [3 pts] Facets of neural networks that have (reasonable, though not perfect) analogs in human brains include

backpropagation convolutional masks applied to many patches

linear combinations of input values edge detectors

(m) [3 pts] Which of the following are true of the vanishing gradient problem for sigmoid units?

Deeper neural networks tend to be more suscepti- Using ReLU units instead of sigmoid units can
ble to vanishing gradients reduce this problem

If a unit has the vanishing gradient problem for Networks with sigmoid units don’t have this prob-
one training point, it has the problem for every train- lem if they’re trained with the cross-entropy loss func-
ing point tion

Top left: false, as the number of layers goes up, the gradient is more likely to vanish during backpropagation. If one node yields
a gradient close to zero, the gain of the nodes in the previous layers will also be very low.

Bottom left: true, ReLU is generally better since its gradient does not go to zero as the input goes to zero.

Top right: false, if gradients are vanishing, the weights have already effectively stopped changing their values.

Bottom right: true, this is the incentive for ResNets.

(n) [3 pts] Suppose our input is two-dimensional sample points, with ten non-exclusive classes those points may belong to
(i.e., a point can belong to more than one class). To train a classifier, we build a fully-connected neural network (with
bias terms) that has a single hidden layer of twenty units and an output layer of ten units (one for each class). Which
statements apply?

For the output units, softmax activations are more For the hidden units, ReLU activations are more
appropriate than sigmoid activations appropriate than linear activations

This network will have 240 trainable parameters This network will have 270 trainable parameters

Softmax will create a valid probability distribution across all the outputs, making it well suited to predicting the single class
a point is most likely to belong to but not to predicting whether or not the point is in each class. Sigmoid will give us a valid
in-class probability for each class independently, allowing us to perform multiclass predictions.

5
There are 2 ∗ 20 + 20 = 60 parameters in the first layer and 20 ∗ 10 + 10 = 210 in the second, giving us a total of 270 parameters.

With a linear activation on the hidden layer, this network reduces to a perceptron and cannot model non-linear decision bound-
aries.

Randomly initializing the weight terms breaks the symmetry of the network; its okay (and in fact standard practice) to initialize
the bias terms to zero.

(o) [3 pts] Which of the following can lead to valid derivations of PCA?

Fit the mean and covariance matrix of a Gaussian Find the direction w that minimizes the sum of
distribution to the sample data with maximum likeli- projection distances
hood estimation

Find the direction w that minimizes the sample Find the direction w that minimizes the sum of
variance of the projected data squares of projection distances

This is best explained by the lecture notes - in particular, lecture note 20 from the Spring 2019 iteration of the course.

(p) [3 pts] Write the SVD of an n × d design matrix X (with n ≥ d) as X = UDV T . Which of the following are true?

The components of D are all nonnegative The columns of V all have unit length and are
orthogonal to each other
If X is a real, symmetric matrix, the SVD is always
the same as the eigendecomposition The columns of D are orthogonal to each other
Top left: True.
Bottom left: False. Consider a symmetric matrix with negative eigenvalues, like [[-1,0],[0,-2]] whose eigenvalues are -1 and -2. Singular
values cannot be negative (by definition), so the singular values are 1 and 2. The SVD makes up for the signs of the singular values by
making the right singular vectors have signs opposite of those of the left singular vectors.

Top right: True.


Bottom right: False, a subset of the columns of V correspond to the null space of X. The zero vector is the only vector in a Euclidean space that
is orthogonal to all other vectors in that space (including itself)
(q) [3 pts] Which of the following is true about Lloyd’s algorithm for k-means clustering?

It is a supervised learning algorithm If run for long enough, it will always terminate

It never returns to a particular assignment of No algorithm (Lloyd’s or any other) can always
classes to sample points after changing to another one find the optimal solution
k-means is an unsupervised learning algorithm. The number of clusters k is a hyperparameter.
Bottom Right: Clustering is NP hard and only able to find local optimum not global optimum, according to lecture 21. It should say "Unless P = NP,
no polynomial time algorithm can always find the optimal solution." There is a trivial brute-force algorithm to find the optimal solution.
(r) [3 pts] Which of the following are advantages of using k-medoid clustering instead of k-means?

k-medoids is less sensitive to outliers Medoids are faster to compute than means

Medoids make more sense than means for non- The k-medoids algorithm with the Euclidean dis-
Euclidean distance metrics tance metric has no hyperparameters, unlike k-means

Both k means and k medoids have k as a hyperparameter. Medoids are much more expensive to compute than means (calculating
all pairwise distances, rather than just summing all points and averaging).

6
(s) [3 pts] We wish to cluster 2-dimensional points into two clusters, so we run Lloyd’s algorithm for k-means clustering until
convergence. Which of the following clusters could it produce? (The points inside an oval belong to the same cluster).

The algorithm converges if


In the top right image, none of the points are closer
the mean of the top to the mean of a cluster they
cluster would be at are not currently assigned
(0,3) and the mean of to. In the top right figure,
the bottom cluster the point on the y axis is
would be somewhere closer to the mean of the top
around (0,-1.7). That cluster than it is to the mean
point that is around of the bottom cluster it is
(0,1.5) is closer to the currently assigned to, so at
mean of the top the next iteration of the
cluster than it is to the algorithm that point would
mean of the bottom switch clusters. In the top
cluster, so Lloyd's left image, none of the
algorithm has not points are closer to the
converged. mean of the cluster they
were not assigned to.
Because finding a solution
to the k-means optimization
problem is NP hard, we use
Lloyd's algorithm, which is
an iterative approach. While
this algorithm is guaranteed
to converge, it may end up
at a local minimum that is
not globally optimal.
(t) [3 pts] Which of the following are true of hierarchical clustering?

The number k of clusters is a hyperparameter Complete linkage works only with the Euclidean
distance metric
The greedy agglomerative clustering algorithm
repeatedly fuses the two clusters that minimize the During agglomerative clustering, single linkage is
distance between clusters more sensitive to outliers than complete linkage

Top left: Part of the point of hierarchy is so you don’t have to guess k in advance

Bottom left: Correct

Top right: Complete linkage is compatible with any distance function

Bottom right: Single linkage is very sensitive to outliers

(u) [3 pts] Which of the following are true of spectral clustering?

The Fiedler vector is the eigenvector associated The relaxed optimization problem for partitioning
with the second largest eigenvalue of the Laplacian a graph involves minimizing the Rayleigh quotient of
matrix the Laplacian matrix and an indicator vector (subject
to a constraint)
Nobody knows how to find the sparsest cut in
polynomial time The Laplacian matrix of a graph is invertible

Top left: The Fiedler vector corresponds to the second smallest eigenvalue.

Bottom left: The relaxed optimization problem minimizes the rayleigh quotient with constraints.

Top right: It’s NP-hard.

7
Bottom right: the laplacian is never invertible; 1 is always in the nullspace.

(v) [3 pts] For binary classification, which of the following statements are true of AdaBoost?

It can be applied to neural networks The metalearner provides not just a classification,
but also an estimate of the posterior probability
It uses the majority vote of learners to predict the
class of a data point The paper on AdaBoost won a Gödel Prize

For part w:

Top left: When we use a single decision tree, we usually make it very deep. If there are no two coincident points in different classes and we always
refine until the leaves are pure, the bias is zero. It's hard to get lower than zero bias.

Top right: Decision trees don't use a loss function. They do choose splits locally with a cost function based on entropy, but you don't use w_i to
scale the entropy. Rather, you redefine p_C from "the proportion of points in S that are in class C" to "the proportion of weight in S that is in class
C" before you compute the entropy.

8
(w) [3 pts] For binary classification, which of the following statements are true of AdaBoost with decision trees?

It usually has lower bias than a single decision To use the weight wi of a sample point Xi when
tree training a decision tree G, we scale the loss function
L(G(Xi ), yi ) by wi
It is popular because it usually works well even
before any hyperparameter tuning It can train multiple decision trees in parallel

(x) [3 pts] Which of the following are reasons one might choose latent factor analysis (LFA) over k-means clustering to group
together n data points in Rd ?

LFA is not sensitive to how you initialize it, In market research, LFA can distinguish different
whereas Lloyd’s algorithm is consumer types, whereas k-means cannot

LFA allows us to consider points as belonging to k-means requires you to guess k in advance,
multiple “overlapping” clusters, whereas in k-means, whereas LFA makes it easier to infer the right num-
each point belongs to only one cluster ber of clusters after the computation

The first choice is true due to the curse of dimensionality. The second one is false: LFA is more expensive than k-means. For I
iterations and k clusters, k-means runs in O(nkID) time, whereas SVD takes O(min(dn2 , d2 n)) time. The third one is true. We
can measure how much a user vector belongs to a particular cluster by taking its inner product with the corresponding singular
vector. The fourth one is false because of the above application.

(y) [3 pts] Which of the following are true for k-nearest neighbor classification?

It is more likely to overfit with k = 1 (1-NN) than If you have enough training points drawn from the
with k = 1,000 (1,000-NN) same distribution as the test points, k-NN can achieve
accuracy almost as good as the Bayes decision rule
In very high dimensions, exhaustively checking
every training point is often faster than any widely The optimal running time to classify a point with
used competing exact k-NN query algorithm k-NN grows linearly with k

Top left: correct; smaller k’s overfit more.

Bottom left: empirical fact.

Top right: true; Fix & Hodges, 1951.

Bottom right: false, it’s poly.

(z) [3 pts] Suppose we use the k-d tree construction and query algorithms described in class to find the approximate nearest
neighbor of a query point among n sample points. Select the true statements.

It is possible to guarantee that the tree has O(log n) Querying the k-d tree is faster than querying a
depth by our choice of splitting rule at each treenode Voronoi diagram for sample points in R2

Sometimes the query algorithm declines to search


Sometimes we permit the k-d tree to be unbalanced inside a box that’s closer to the query point than the
so we can choose splits with better information gain nearest neighbor it’s found so far
Bottom Left: there is no entropy or sense of info gain in the k-d algorithm

Bottom Right: if your approximate nearest neighbor is good enough than you won't search for a better neighbor

It is hard to beat O(log n) query time. Too bad Voronoi diagrams are only guaranteed to be that fast in 2D, and they
can only be used for 1-NN. Nevertheless, if you need to do 1-NN on a large 2D dataset, and you will be doing a lot of
queries once you've built the search structure, using a Voronoi diagram is actually really fast and effective. Just don't try
to write the code yourself. Get a library written by an expert.

9
Q2. [17 pts] Getting Down(hill) with the Funk Function
The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films. Com-
petitors were given an n×d ratings matrix R; entry R jk is user j’s rating of movie k. Because users only watch a small fraction of
the movies, most entries in R are unobserved, hence filled with a default value of zero. Latent factor analysis attempts to predict
missing ratings by replacing R with a low-rank approximation, which is a truncated singular value decomposition (SVD).

(a) [4 pts] Given the SVD R = UDV > , write a formula for the rank-r truncated SVD R0 for comparison; make sure you
explain your notation. Then write the standard restrictions (imposed by the definition of SVD) on U, D, and V.
The rank-r truncated SVD of R is R0 = ri=1 δi ui v>i , where ui is column i of U and vi is column i of V. The standard
P
restrictions are U U = I, V V = I, and D is a diagonal matrix with nonnegative components.
> >

LFA leaves plenty of room for improvement. Simon Funk (a pseudonym, but a real person), who at one point was ranked third
in the competition, developed a method called “Funk SVD.” Recall that the rank-r truncated SVD R0 minimizes the Frobenius
norm kR − R0 kF , subject to the constraint that R0 has rank r. Mr. Funk modified this approach to learn two matrices A ∈ Rn×r
and B ∈ Rr×d such that AB ≈ R. The rank of AB cannot exceed r. Let a j be the jth row of A, let bk be the kth column of B, and
observe that (AB) jk = a j · bk . Mr. Funk solves the problem of finding matrices A and B that minimize the objective function
X
L(A, B) = (R jk − a j · bk )2 .
j,k : R jk ,0

The key difference between this objective function and the one optimized by the truncated SVD is that the summation is over
only nonzero components of R. Instead of computing an SVD, Mr. Funk minimizes this objective with gradient descent.

(b) [2 pts] Explain why the optimal solution is not unique; that is, there is more than one pair of optimal matrices (A, B).
If AB = R, then (2A)( 21 B) = R too.

(c) [5 pts] Although Mr. Funk uses stochastic gradient descent, we will derive a batch gradient descent algorithm. It turns
out to be easiest to write the update rule for A one row at a time. State the gradient descent rule for updating row a j
during the minimization of Mr. Funk’s objective function L(A, B). Use some step size  > 0. (Be careful that you sum
only the correct terms!) (Note: there is a symmetric rule for updating bk ; the algorithm must update both A and B.)
X
∇a j L(A, B) = −2 (R jk − a j · bk )bk .
k:R jk ,0

Hence the gradient descent update is X


a j ← a j + 2 (R jk − a j · bk )bk .
k:R jk ,0

(You may omit the factor of 2.)

(d) [3 pts] What will happen if you initialize Funk SVD by setting A ← 0 and B ← 0? Suggest a better initialization.
If the matrices are initialized to zero, the gradient descent rule cannot make them nonzero.
There are many better initializations. You could use A ← UD and B ← V > , or better yet, A ← UD1/2 and B ← D1/2 V > .
A random initialization will generally work okay.
Note that a choice in which all the components of a j are the same and all the components of bk are the same will not
work.

(e) [3 pts] Consider the special case where r = 1 and the matrix R has no zero entries. In this case, what is the relationship
between an optimal solution A, B and the rank-one truncated singular value decomposition?
AB = δ1 u1 v>1 .
(Because the rank-1 truncated SVD is δ1 u1 v>1 , and it minimizes the Frobenius norm reconstruction error. This is equiva-
lent to Mr. Funk’s objective when R has no zeros.)

10
Q3. [10 pts] Decision Boundaries
In the question, you will draw the decision boundaries that classifiers would learn.

(a) [6 pts] Given the sample points below, draw and label two lines: the decision boundary learned by a hard-margin SVM
and the decision boundary learned by a soft-margin SVM. We are not specifying the hyperparameter C, but don’t make
C too extreme. (We are looking for a qualitative difference between hard- and soft-margin SVMs.) Label the two lines
clearly.
Also draw and label four dashed lines to show the margins of both SVMS.

Solution: With a soft-margin SVM, we care more about maximizing


the distance from the margin to each of the data points,
and less about classifying every point correctly; the way to
achieve such an objective for these points in 3A is by a
downward-sloped line, which would achieve wide margins
from most of the points as opposed to the hard margin
SVM.

Regarding "the same center line", if you mean a horizontal


line for the soft-margin SVM, that wouldn't receive credit
since it's not maximizing the distance from the points
compared to the downward-sloping line.

(b) [4 pts] Given the sample points below, draw and label two curves: the decision boundary learned by LDA and the
decision boundary learned by QDA. Label the two curves clearly.

11
Solution:
Draw the isocontours first. Say, three isocontours for
each of the two classes, big enough to overlap each
other. Then look at where the red isocontours intersect
the black isocontours.

12
Q4. [16 pts] Kernel Principal Components Analysis
Let X be an n × d design matrix. Suppose that X has been centered, so the sample points in X have mean zero. In this problem
we consider kernel PCA and show that it equates to solving a generalized Rayleigh quotient problem.

(a) [1 pt] Fill in the blank: every principal component direction for X is an eigenvector of . X> X

(b) [1 pt] Fill in the blank: an optimization problem can be kernelized only if its solution w is always a linear combination
of the sample points. In other words, we can write it in the form w = .
X > a (for some vector a).

(c) [4 pts] Show that every principal component direction w with a nonzero eigenvalue is a linear combination of the sample
points (even when n < d).
As w is an eigenvector of X > X, there is a λ ∈ R such that X > Xw = λw. Setting a = λ1 Xw, we have X > a = w.

(d) [4 pts] Let Φ(z) be a feature map that takes a point z ∈ Rd and maps it to a point Φ(z) ∈ RD , where D might be extremely
large or even infinite. But suppose that we can compute the kernel function k(x, z) = Φ(x) · Φ(z) much more quickly than
we can compute Φ(x) directly. Let Φ(X) be the n × D matrix in which each sample point is replaced by a featurized point.
By our usual convention, row i of X is Xi> , and row i of Φ(X) is Φ(Xi )> .
Remind us: what is the kernel matrix K? Answer this two ways: explain the relationship between K and the kernel
function k(·, ·); then write the relationship between K and Φ(X). Lastly, show that these two definitions are equivalent.
K is the n × n matrix with components Ki j = k(Xi , X j ). Also, K = Φ(X)Φ(X)> .
These two characterizations are equivalent because Ki j = k(Xi , X j ) = Φ(Xi ) · Φ(X j ) is the inner product of row i of Φ(X)
and column j of Φ(X)> , which implies that K = Φ(X)Φ(X)> .

(e) [2 pts] Fill in the space: the first principle component direction of the featurized design matrix Φ(X) is any nonzero vector
w ∈ RD that maximizes the Rayleigh quotient, which is .

w> Φ(X)> Φ(X)w


.
w> w

(f) [4 pts] Show that the problem of maximizing this Rayleigh quotient is equivalent to maximizing

a> Ba
a>Ca
for some positive semidefinite matrices B, C ∈ Rn×n , where a ∈ Rn is a vector of dual weights. This expression is called
a generalized Rayleigh quotient. What are the matrices B and C? For full points, express them in a form that does not
require any direct computation of the feature vectors Φ, which could be extremely long.
w> Φ(X)> Φ(X)w a> Φ(X)Φ(X)> Φ(X)Φ(X)> a a> K 2 a
w> w = a> Φ(X)Φ(X)> a = a> Ka . Hence B = K 2 and C = K.

13
Q5. [12 pts] Spectral Graph Clustering
Let’s apply spectral graph clustering to this graph.

1
1 2
1 1
1
3 1
4

(a) [4 pts] Write the Laplacian matrix L for this graph. All the edges have weight 1.
 2 −1 −1 0 
 
−1 3 −1 −1
 
−1 −1 3 −1
0 −1 −1 2

(b) [2 pts] Consider the minimum bisection problem, where we find an indicator vector y that minimizes y> Ly, subject to
the balance constraint 1> y = 0 and the strict binary constraint ∀i, yi = 1 or yi = −1. Write an indicator vector y that
represents a minimum bisection of this graph.
 1  −1  1  −1
       
−1  1   1  −1
Any one of   or   or   or   will do.
 1  −1 −1  1 
−1 1 −1 1

(c) [4 pts] Suppose we relax (discard) the binary constraint and replace it with the weaker constraint y> y = constant, per-
mitting y to have real-valued components. (We keep the balance constraint.) What indicator vector is a solution to the
relaxed optimization problem? What is its eigenvalue?
Hint: Look at the symmetries of the graph. Given that the continuous values of the yi ’s permit some of the vertices to be
at or near zero, what symmetry do you think would minimize the continuous-valued cut? Guess and then check whether
it’s an eigenvector.
 1  −1
   
 0   0 
  or   or any nonzero multiple of these. The eigenvalue is 2.
 0   0 
−1 1

(d) [2 pts] If we apply the sweep cut to find a cut with good sparsity, what two clusters do we get? Is it a bisection?
The sweep cut either puts vertex 1 in a subgraph by itself, or vertex 4 in a subgraph by itself. It does not choose a
bisection.

14
Q6. [17 pts] Learning Mixtures of Gaussians with k-Means
Let X1 , . . . , Xn ∈ Rd be independent, identically distributed points sampled from a mixture of two normal (Gaussian) distribu-
tions. They are drawn independently from the probability distribution function (PDF)
1 1
e−kx−µ1 k /2 and N2 (x) = √ e−kx−µ2 k /2
2 2
p(x) = θ N1 (x) + (1 − θ) N2 (x), where N1 (x) = √
( 2π) d ( 2π) d

are the PDFs for the isotropic multivariate normal distributions N(µ1 , 1) and N(µ2 , 1), respectively. The parameter θ ∈ (0, 1) is
called the mixture proportion. In essence, we flip a biased coin to decide whether to draw a point from the first Gaussian (with
probability θ) or the second (with probability 1 − θ).

Each data point is generated as follows. First draw a random Zi , which has value 1 with probability θ, and has value 2 with
probability 1 − θ. Then, draw Xi ∼ N(µZi , 1). Our learning algorithm gets Xi as an input, but does not know Zi .

Our goal is to find the maximum likelihood estimates of the three unknown distribution parameters θ ∈ (0, 1), µ1 ∈ Rd , and
µ2 ∈ Rd from the sample points X1 , . . . , Xn . Unlike MLE for one Gaussian, it is not possible to give explicit analytic formulas
for these estimates. Instead, we develop a variant of k-means clustering which (often) converges to the correct maximum
likelihood estimates of θ, µ1 , and µ2 . This variant doesn’t assign each point entirely to one cluster; rather, each point is assigned
an estimated posterior probability of coming from normal distribution 1.

(a) [4 pts] Let τi = P(Zi = 1|Xi ). That is, τi is the posterior probability that point Xi has Zi = 1. Use Bayes’ Theorem to
express τi in terms of Xi , θ, µ1 , µ2 , and the Gaussian PDFs N1 (x) and N2 (x). To help you with part (c), also write down a
similar formula for 1 − τi , which is the posterior probability that Zi = 2.
Bayes’ Theorem implies that
θN1 (Xi ) (1 − θ)N2 (Xi )
τi = , 1 − τi = .
θN1 (Xi ) + (1 − θ)N2 (Xi ) θN1 (Xi ) + (1 − θ)N2 (Xi )

(b) [3 pts] Write down the log-likelihood function, `(θ, µ1 , µ2 ; X1 , . . . , Xn ) = ln p(X1 , . . . , Xn ), as a summation. Note: it
doesn’t simplify much.
Because the samples are iid, p(X1 , . . . , Xn ) = ni=1 p(Xi ), so
Q

n
X
`(θ, µ1 , µ2 ; X1 , . . . , Xn ) = ln(θN1 (Xi ) + (1 − θ)N2 (Xi )).
i=1

∂`
(c) [3 pts] Express ∂θ in terms of θ and τi , i ∈ {1, . . . , n} and simplify as much as possible. There should be no normal PDFs
0
(x)
explicitly in your solution, though the τi ’s may implicitly use them. Hint: Recall that (ln f (x))0 = ff (x) .

n n
∂` X τi 1 − τi ( τi ) − θn
! P
1 X
= − = (τi − θ) = .
∂θ i=1 θ 1−θ θ − θ2 i=1 θ − θ2
(Any of these expressions is simple enough to receive full marks.)

15
(d) [4 pts] Express ∇µ1 ` in terms of µ1 and τi , Xi , i ∈ {1, . . . , n}. Do the same for ∇µ2 ` (but in terms of µ2 rather than µ1 ).
Again, there should be no normal PDFs explicitly in your solution, though the τi ’s may implicitly use them.
Hint: It will help (and get you part marks) to first write ∇µ1 N1 (x) as a function of N1 (x), x, and µ1 .

n
X θ∇µ1 N1 (Xi )
∇µ1 ` =
i=1
θN1 (Xi ) + (1 − θ)N2 (Xi )
n
X θN1 (Xi )
= (Xi − µ1 )
i=1
θN1 (Xi ) + (1 − θ)N2 (Xi )
n
X
= τi (Xi − µ1 ).
i=1

Similarly,
n
X
∇µ2 ` = (1 − τi )(Xi − µ2 ).
i=1

(e) [3 pts] We conclude: if we know µ1 , µ2 , and θ, we can compute the posteriors τi . On the other hand, if we know the τi ’s,
we can estimate µ1 , µ2 , and θ by using the derivatives in parts (c) and (d) to find the maximum likelihood estimates. This
leads to the following k-means-like algorithm.
• Initialize τ1 , τ2 , . . . , τn to arbitrary values in the range [0, 1].
• Repeat the following two steps.
1. Update the Gaussian cluster parameters: for fixed values of τ1 , τ2 , . . . , τn , update µ1 , µ2 , and θ.
2. Update the posterior probabilities: for fixed values of µ1 , µ2 and θ, update τ1 , τ2 , . . . , τn .
In part (a), you wrote the update rule for step 2. Using your results from parts (c) and (d), write down the explicit update
formulas for step 1.

n
τi Xi (1 − τi )Xi
Pn Pn
1X
µ1 ← Pi=1 , µ2 ← Pi=1 , θ← τi .
i=1 τi i=1 (1 − τi )
n n
n i=1

16

You might also like