Homework2 v1.0
Homework2 v1.0
Rules:
1. Homework submission is done via CMU Autolab system. Please package your writeup and code into
a zip or tar file, e.g., let submit.zip contain writeup.pdf and ps2 code/*.m. Submit the package to
https://autolab.cs.cmu.edu/courses/10701-f15.
2. Like conference websites, repeated submission is allowed. So please feel free to refine your answers.
We will only grade the latest version.
3. Autolab may allow submission after the deadline, note however it is because of the late day policy.
Please see course website for policy on late submission.
4. We recommend that you typeset your homework using appropriate software such as LATEX. If you are
writing please make sure your homework is cleanly written up and legible. The TAs will not invest
undue effort to decrypt bad handwriting.
5. You are allowed to collaborate on the homework, but you should write up your own solution and code.
Please indicate your collaborators in your submission.
1
1 Bayes Optimal Classification (20 Points) (Yan)
In classification, the loss function we usually want to minimize is the 0/1 loss:
where f (x), y ∈ {0, 1} (i.e., binary classification). In this problem we will consider the effect of using an
asymmetric loss function:
Under this loss function, the two types of errors receive different weights, determined by α, β > 0.
1. Determine the Bayes optimal classifier, i.e., the classifier that achieves minimum risk assuming P (x, y)
is known, for the loss `α,β where α, β > 0.
2. Suppose that the class y = 0 is extremely uncommon (i.e., P (y = 0) is small). This means that the
classifier f (x) = 1 for all x will have good risk. We may try to put the two classes on even footing by
considering the risk:
Show how this risk is equivalent to choosing a certain α, β and minimizing the risk where the loss
function is `α,β .
3. Consider the following classification problem. I first choose the label Y ∼ Ber 21 , which is 1 with
probability 12 . If Y = 1, then X ∼ Ber (p); otherwise, X ∼ Ber (q). Assume that p > q. What is the
Bayes optimal classifier, and what is its risk?
4. Now consider the regular 0/1 loss `, and assume that P (y = 0) = P (y = 1) = 12 . Also, assume that
the class-conditional densities are Gaussian with mean µ0 and co-variance Σ0 under class 0, and mean
µ1 and co-variance Σ1 under class 1. Further, assume that µ0 = µ1 .
For the following case, draw contours of the level sets of the class conditional densities and label them
with p(x | y = 0 and p(x | y = 1). Also, draw the decision boundaries obtained using the Bayes optimal
classifier in each case and indicate the regions where the classifier will predict class 0 and where it will
predict class 1.
1 0 4 0
Σ0 = , Σ1 = . (4)
0 4 0 1
2
1. Show that whitening the training data nicely decouples the features, making wi? determined by the ith
feature and the output regardless of other features. To show this, write Jλ (w) in the form
d
X
Jλ (w) = g(y) + f (X·i , y, wi , λ) , (6)
i=1
where C is the number of classes, and W is a C × (d + 1) weight matrix, and d is the dimension of input
vector x. In other words, W is a matrix whose rows are the weight vectors for each class.
1. Show that in the special case where C = 2, the multinomial logistic regression reduces to logistic
regression.
2. In the training process of the the multinomial logistic regression model, we are given a set of training
n
data {xi , yi }i=1 , and we want to learn a set of weight vectors that maximize the conditional likelihood
n n
of the output labels {yi }i=1 , given the input data {xi }i=1 and W. That is, we want to solve the
following optimization problem (assuming the data points are i.i.d ).
n
Y
?
W = argmax P (yi | xi , W) (8)
W i=1
In order to solve this optimization problem, most numerical solvers require that we provide a function
that computes the objective function value given some weight and the gradient of that objective function
(i.e., its first derivative). Some solvers usually also require the Hessian of the objective function (i.e.,
its second derivative). So, in order to implement the algorithm we need to derive these functions.
(a) Derive the conditional log-likelihood function for the multinomial logistic regression model. You
may denote this function as `(W ).
(b) Derive the gradient of `(W ) with respect to the weight vector of class c (wc ). That is, derive
∇wc `(W ). You may denote this function as gradient gc (W ). Note: The gradient of a function
f (x) with respect to a vector x is also a vector, whose i-th entry is defined as ∂f∂x(x) i
, where xi is
the i-th element of x.
(c) Derive the block submatrix of the Hessian with respect to weight vector of class c (wc ) and class
c0 (wc0 ). You may denote this function as Hc,c0 (W). Note: The Hessian of a function f (x) with
∂ 2 f (x)
respect to vector x is a matrix, whose {i, j}th entry is defined as ∂x i ∂xj
. In this case, we are
asking a block submatrix of Hessian of the conditional log-likelihood function, taken with respect
∂ 2 `(W)
to only two classes c and c0 . The {i, j}th entry of the submatrix is defined as ∂w ci ∂w 0
.
c j
3
4 Perceptron Mistake Bounds (20 Points) (Xun)
Suppose {(xi , yi ) : xi ∈ Rn , yi ∈ {+1, −1} , i = 1, . . . , m} can be linearly separated by a margin γ > 0, i.e.
where ha, bi = a> b is the dot product between two vectors. Further assume kxi k2 ≤ M, ∀i. Recall that
Perceptron algorithm starts from w(0) = 0 and updates w(t) = w(t−1) + y (t) x(t) , where (x(t) , y (t) ) is the t-th
misclassified example. We will prove that Perceptron learns an exact classifier in finitely many steps.
3. Use the results above, show that number of updates t is upper bounded by M 2 /γ 2 .
4. True or False: when zero error is achieved, the classifier always has margin γ. Explain briefly.
2. range of labels
3. range of pixel values
4. maximum and minimum `2 -norm of the images
2. Complete the function [f, g] = oracle lr(w, X, y), where w is the weight vector, X is the set of
images, and y is the set of labels. This function returns the objective f and the gradient g.
4
3. Complete the function err = grad check(oracle, t). This will help you check if the oracle imple-
mentation is correct. First recall the definition of derivatives:
f (t + h) − f (t)
g(t) = lim . (10)
h→0 h
The idea is to check analytic gradients against numerical gradients. For a small h, say h ≈ 10−6 , the
numerical estimate
f (t + hej ) − f (t − hej )
gbj (t) = ≈ gj (t), (11)
2h
for all j ∈ {1, . . . , d}, where ej is the unit vector at j-th coordinate. If the oracle is implemented
correctly, then we should expect small (e.g. ≈ 10−6 ) average error
d
1X
err = |b
gj (t) − gj (t)| . (12)
d j=1
Try running oracle lr test.m to see if your oracle can pass the test.
4. Complete the function w = optimize lr(w0, X, y), where w0 is the initial value. You will implement
a simple gradient descent/ascent algorithm to find the best parameter w.
5. Complete the function acc = binary accuracy(w, X, y). This function returns the fraction of cor-
rect predictions of classifier w on data X.
2
6. Run lr.m, report the number of iterations, final objective function value, final kwk2 , training accuracy,
and test accuracy. You should be able to get ≥ 95% test accuracy.
7. Modify oracle lr.m to return `2 -regularized objective and gradient, with tuning parameter λ. Specif-
2
ically, add/subtract λ2 kwk2 to the objective, depending on whether the objective is log-likelihood or
2
negative log-likelihood. Report the number of iterations, final objective function value, final kwk2 ,
training accuracy, and test accuracy. Briefly summarize your observation.
(Note: you can check your implementation again with oralce lr test.m.)
1. Complete the function [f, g] = oracle mlr(W0, X, y). In particular, implement `2 -regularization
2
for each class, again with λ being the tuning parameter, i.e. include λ2 kWkF in your objective.
(Note: you can check your implementation with oralce mlr test.m.)