0% found this document useful (0 votes)
29 views

Homework2 v1.0

This document describes homework 2 for the 10-701 Introduction to Machine Learning course due on October 16th. It includes 4 problems on various machine learning topics: 1. Bayes optimal classification with asymmetric loss functions. 2. Regularized linear regression using lasso to encourage sparse solutions. 3. Multinomial logistic regression for multiclass classification. 4. Perceptron mistake bounds. It provides instructions for submitting homework, allows collaboration but requires individual write-ups, and details the specific questions and tasks for each problem.

Uploaded by

royadaneshi2001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Homework2 v1.0

This document describes homework 2 for the 10-701 Introduction to Machine Learning course due on October 16th. It includes 4 problems on various machine learning topics: 1. Bayes optimal classification with asymmetric loss functions. 2. Regularized linear regression using lasso to encourage sparse solutions. 3. Multinomial logistic regression for multiclass classification. 4. Perceptron mistake bounds. It provides instructions for submitting homework, allows collaboration but requires individual write-ups, and details the specific questions and tasks for each problem.

Uploaded by

royadaneshi2001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

10-701 Introduction to Machine Learning

Homework 2, version 1.0 Due Oct 16, 11:59 am

Rules:

1. Homework submission is done via CMU Autolab system. Please package your writeup and code into
a zip or tar file, e.g., let submit.zip contain writeup.pdf and ps2 code/*.m. Submit the package to
https://autolab.cs.cmu.edu/courses/10701-f15.
2. Like conference websites, repeated submission is allowed. So please feel free to refine your answers.
We will only grade the latest version.
3. Autolab may allow submission after the deadline, note however it is because of the late day policy.
Please see course website for policy on late submission.
4. We recommend that you typeset your homework using appropriate software such as LATEX. If you are
writing please make sure your homework is cleanly written up and legible. The TAs will not invest
undue effort to decrypt bad handwriting.
5. You are allowed to collaborate on the homework, but you should write up your own solution and code.
Please indicate your collaborators in your submission.

1
1 Bayes Optimal Classification (20 Points) (Yan)
In classification, the loss function we usually want to minimize is the 0/1 loss:

` (f (x), y) = 1 {f (x) 6= y} (1)

where f (x), y ∈ {0, 1} (i.e., binary classification). In this problem we will consider the effect of using an
asymmetric loss function:

`α,β (f (x), y) = α1 {f (x) = 1, y = 0} + β1 {f (x) = 0, y = 1} . (2)

Under this loss function, the two types of errors receive different weights, determined by α, β > 0.

1. Determine the Bayes optimal classifier, i.e., the classifier that achieves minimum risk assuming P (x, y)
is known, for the loss `α,β where α, β > 0.
2. Suppose that the class y = 0 is extremely uncommon (i.e., P (y = 0) is small). This means that the
classifier f (x) = 1 for all x will have good risk. We may try to put the two classes on even footing by
considering the risk:

R = P (f (x) = 1 | y = 0) + P (f (x) = 0 | y = 1) . (3)

Show how this risk is equivalent to choosing a certain α, β and minimizing the risk where the loss
function is `α,β .
3. Consider the following classification problem. I first choose the label Y ∼ Ber 21 , which is 1 with


probability 12 . If Y = 1, then X ∼ Ber (p); otherwise, X ∼ Ber (q). Assume that p > q. What is the
Bayes optimal classifier, and what is its risk?
4. Now consider the regular 0/1 loss `, and assume that P (y = 0) = P (y = 1) = 12 . Also, assume that
the class-conditional densities are Gaussian with mean µ0 and co-variance Σ0 under class 0, and mean
µ1 and co-variance Σ1 under class 1. Further, assume that µ0 = µ1 .
For the following case, draw contours of the level sets of the class conditional densities and label them
with p(x | y = 0 and p(x | y = 1). Also, draw the decision boundaries obtained using the Bayes optimal
classifier in each case and indicate the regions where the classifier will predict class 0 and where it will
predict class 1.
   
1 0 4 0
Σ0 = , Σ1 = . (4)
0 4 0 1

2 Regularized Linear Regression Using Lasso (20 Points) (Yan)


Lasso is a form of regularized linear regression, where the L1 norm of the parameter vector is penalized. It
is used in an attempt to get a sparse parameter vector where features of little “importance” are assigned to
zero weight. But why does lasso encourage sparse parameters? For this question, you are going to examine
this.
Let X denote an n × d matrix where rows are training points, y denotes an n × 1 vector of corresponding
output value, w denotes a d × 1 parameter vector and w? denotes the optimal parameter vector. To make
the analysis easier we will consider the special case where the training data is whitened (i.e., X > X = I).
For lasso regression, the optimal parameter vector is given by
1 2
w? = argmin ky − Xwk + λ kwk1 , (5)
w 2
where λ > 0.

2
1. Show that whitening the training data nicely decouples the features, making wi? determined by the ith
feature and the output regardless of other features. To show this, write Jλ (w) in the form
d
X
Jλ (w) = g(y) + f (X·i , y, wi , λ) , (6)
i=1

where X·i is the ith column of X.


2. Assume that wi? > 0, what is the value of wi? in this case?
3. Assume that wi? < 0, what is the value of wi? in this case?
4. From 2 and 3, what is the condition for wi? to be 0? How can you interpret that condition?
2
5. Now consider ridge regression where the regularization term is replaced by 12 λ kwk2 . What is the
condition for wi? = 0? How does it differ from the condition you obtained in 4?

3 Multinomial Logistic Regression (20 Points) (Yan)


Multinomial logistic regression is a classification method that generalizes logistic regression to multiclass
problems. It has the form
exp wc0 + wc> x

p (y = c | x, W) = PC , (7)
>
k=1 exp wk0 + wk x

where C is the number of classes, and W is a C × (d + 1) weight matrix, and d is the dimension of input
vector x. In other words, W is a matrix whose rows are the weight vectors for each class.
1. Show that in the special case where C = 2, the multinomial logistic regression reduces to logistic
regression.
2. In the training process of the the multinomial logistic regression model, we are given a set of training
n
data {xi , yi }i=1 , and we want to learn a set of weight vectors that maximize the conditional likelihood
n n
of the output labels {yi }i=1 , given the input data {xi }i=1 and W. That is, we want to solve the
following optimization problem (assuming the data points are i.i.d ).
n
Y
?
W = argmax P (yi | xi , W) (8)
W i=1

In order to solve this optimization problem, most numerical solvers require that we provide a function
that computes the objective function value given some weight and the gradient of that objective function
(i.e., its first derivative). Some solvers usually also require the Hessian of the objective function (i.e.,
its second derivative). So, in order to implement the algorithm we need to derive these functions.
(a) Derive the conditional log-likelihood function for the multinomial logistic regression model. You
may denote this function as `(W ).
(b) Derive the gradient of `(W ) with respect to the weight vector of class c (wc ). That is, derive
∇wc `(W ). You may denote this function as gradient gc (W ). Note: The gradient of a function
f (x) with respect to a vector x is also a vector, whose i-th entry is defined as ∂f∂x(x) i
, where xi is
the i-th element of x.
(c) Derive the block submatrix of the Hessian with respect to weight vector of class c (wc ) and class
c0 (wc0 ). You may denote this function as Hc,c0 (W). Note: The Hessian of a function f (x) with
∂ 2 f (x)
respect to vector x is a matrix, whose {i, j}th entry is defined as ∂x i ∂xj
. In this case, we are
asking a block submatrix of Hessian of the conditional log-likelihood function, taken with respect
∂ 2 `(W)
to only two classes c and c0 . The {i, j}th entry of the submatrix is defined as ∂w ci ∂w 0
.
c j

3
4 Perceptron Mistake Bounds (20 Points) (Xun)
Suppose {(xi , yi ) : xi ∈ Rn , yi ∈ {+1, −1} , i = 1, . . . , m} can be linearly separated by a margin γ > 0, i.e.

∃w ∈ Rn s.t. kwk2 = 1, hyi xi , wi ≥ γ, ∀i = 1, . . . , m, (9)

where ha, bi = a> b is the dot product between two vectors. Further assume kxi k2 ≤ M, ∀i. Recall that
Perceptron algorithm starts from w(0) = 0 and updates w(t) = w(t−1) + y (t) x(t) , where (x(t) , y (t) ) is the t-th
misclassified example. We will prove that Perceptron learns an exact classifier in finitely many steps.

1. Show that w(t) , w ≥ tγ.

2. Show that kw(t) k22 ≤ tM 2 .

3. Use the results above, show that number of updates t is upper bounded by M 2 /γ 2 .
4. True or False: when zero error is achieved, the classifier always has margin γ. Explain briefly.

5 Logistic Regression for Image Classification (20 Points) (Xun)


In this problem, we will implement (almost) from scratch logistic regression for image classification. We
will use MNIST (http://yann.lecun.com/exdb/mnist/), which contains in total 70,000 handwritten digits
from 0 to 9. Although initially released in 1998, MNIST is still one of the most widely used benchmark data
sets for image classification.

5.1 Exploring the data


It is often a good practice to take a careful look at the data before modeling. Run download mnist.sh and
visualize mnist.m, and explore the following properties of images and labels:
1. size of each image

2. range of labels
3. range of pixel values
4. maximum and minimum `2 -norm of the images

5. whether the data is sparse or dense


6. whether the label distribution is skewed or uniform
Please append your code to visualize mnist.m.

5.2 Binary logistic regression


Let’s start with a simple binary logistic regression for classifying ONLY between the digits 3 and 8. In lr.m
we have outlined the training and testing process, as well as some preprocessing routines, such as selecting
3’s and 8’s and normalizing data to have zero mean and unit variance. Your goal is to perform the following
steps:

1. Complete the function s = sigmoid(a). Note that a and s could be vectors.

2. Complete the function [f, g] = oracle lr(w, X, y), where w is the weight vector, X is the set of
images, and y is the set of labels. This function returns the objective f and the gradient g.

4
3. Complete the function err = grad check(oracle, t). This will help you check if the oracle imple-
mentation is correct. First recall the definition of derivatives:
f (t + h) − f (t)
g(t) = lim . (10)
h→0 h
The idea is to check analytic gradients against numerical gradients. For a small h, say h ≈ 10−6 , the
numerical estimate
f (t + hej ) − f (t − hej )
gbj (t) = ≈ gj (t), (11)
2h
for all j ∈ {1, . . . , d}, where ej is the unit vector at j-th coordinate. If the oracle is implemented
correctly, then we should expect small (e.g. ≈ 10−6 ) average error
d
1X
err = |b
gj (t) − gj (t)| . (12)
d j=1

Try running oracle lr test.m to see if your oracle can pass the test.
4. Complete the function w = optimize lr(w0, X, y), where w0 is the initial value. You will implement
a simple gradient descent/ascent algorithm to find the best parameter w.

5. Complete the function acc = binary accuracy(w, X, y). This function returns the fraction of cor-
rect predictions of classifier w on data X.
2
6. Run lr.m, report the number of iterations, final objective function value, final kwk2 , training accuracy,
and test accuracy. You should be able to get ≥ 95% test accuracy.
7. Modify oracle lr.m to return `2 -regularized objective and gradient, with tuning parameter λ. Specif-
2
ically, add/subtract λ2 kwk2 to the objective, depending on whether the objective is log-likelihood or
2
negative log-likelihood. Report the number of iterations, final objective function value, final kwk2 ,
training accuracy, and test accuracy. Briefly summarize your observation.
(Note: you can check your implementation again with oralce lr test.m.)

5.3 Multiclass logistic regression


Now let’s learn a classifier for ALL digits using multiclass logistic regression derived above. Again we have
the algorithm outline and preprocessing in mlr.m. Notice that the labels are now shifted by 1, so that they
are 1-indexed. Your goal is to perform the following steps:

1. Complete the function [f, g] = oracle mlr(W0, X, y). In particular, implement `2 -regularization
2
for each class, again with λ being the tuning parameter, i.e. include λ2 kWkF in your objective.
(Note: you can check your implementation with oralce mlr test.m.)

2. Complete the function W = optimize mlr(W0, X, y).


3. Complete the function acc = multiclass accuracy(W, X, y).
2
4. Run mlr.m, report the number of iterations, final objective function value, final kWkF , training accu-
racy, and test accuracy. Also include the visualization of the learned weights. For this task you should
be able to get ≥ 92% test accuracy.

You might also like