0% found this document useful (0 votes)
55 views19 pages

ps2 Sol

XCS assignment 2 solution

Uploaded by

nasoheel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views19 pages

ps2 Sol

XCS assignment 2 solution

Uploaded by

nasoheel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CS229 Problem Set #2 1

CS 229, Summer 2020


Problem Set #2 Solutions
HUY PHAM (https://github.com/huyfam/cs229-solutions-2020)

Due Monday, July 27 at 11:59 pm on Gradescope.

Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at https://piazza.com/stanford/summer2020/cs229. (3)
This quarter, Summer 2020, students may submit in pairs. If you do so, make sure both names
are attached to the Gradescope submission. However, students are not allowed to work with
the same partner on more than one assignment. If you missed the first lecture or are unfamiliar
with the collaboration or honor code policy, please read the policy on the course website before
starting work. (4) For the coding problems, you may not use any libraries except those defined
in the provided environment.yml file. In particular, ML-specific libraries such as scikit-learn
are not permitted. (5) To account for late days, the due date is Monday, July 27 at 11:59 pm.
If you submit after Monday, July 27 at 11:59 pm, you will begin consuming your late days. If
you wish to submit on time, submit before Monday, July 27 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly rec-
ommend typesetting your solutions via LATEX, and we will award one bonus point for typeset
submissions. All students must also submit a zip file of their source code to Gradescope, which
should be created using the make zip.py script. You should make sure to (1) restrict yourself
to only using libraries included in the environment.yml file, and (2) make sure your code runs
without errors. Your submission may be evaluated by the auto-grader using a private test set,
or used for verifying the outputs reported in the writeup.
CS229 Problem Set #2 2

1. [15 points] Logistic Regression: Training stability


In this problem, we will be delving deeper into the workings of logistic regression. The goal of
this problem is to help you develop your skills debugging machine learning algorithms (which
can be very different from debugging software in general).
We have provided an implementation of logistic regression in src/stability/stability.py,
and two labeled datasets A and B in src/stability/ds1 a.csv and src/stability/ds1 b.csv.
Please do not modify the code for the logistic regression training algorithm for this problem.
First, run the given logistic regression code to train two different models on A and B. You can
run the code by simply executing python stability.py in the src/stability directory.

(a) [2 points] What is the most notable difference in training the logistic regression model on
datasets A and B?
Answer:
Our model converges on dataset A but fails to converge on dataset B.

(b) [5 points] Investigate why the training procedure behaves unexpectedly on dataset B, but
not on A. Provide hard evidence (in the form of math, code, plots, etc.) to corroborate
your hypothesis for the misbehavior. Remember, you should address why your explanation
does not apply to A.
Hint: The issue is not a numerical rounding or over/underflow error.
Answer:

(a) Dataset A isn’t separable. (b) Dataset B separated by x1 + x2 = 1.

Figure 1: These datasets differ in the linear separability.

Let θ be a parameter vector such that dataset B is completely separated by the hyperplane
θT x = 0. Now, consider a parameter θ0 = c · θ. Given an example x, we observe that:
 
1
ˆ If y = 0, then θ x < 0 and its loss w.r.t. θ is: − log
T 0
.
1 + exp(−c · θT x)
 
1
ˆ If y = 1, then θT x > 0 and its loss w.r.t. θ0 is: − log 1 − .
1 + exp(−c · θT x)
As c → ∞, in both cases the losses are strictly decreasing without a bound. And so the
total cost is a lower-unbounded convex function. The Gradient Descent will fail to converge
to the global minimum since there is not such one. Due to inseparability, dataset A doesn’t
CS229 Problem Set #2 3

endure the above trait, our optimizer actually worked nicely.

(c) [5 points] For each of these possible modifications, state whether or not it would lead to
the provided training algorithm converging on datasets such as B. Justify your answers.
i. Using a different constant learning rate.
ii. Decreasing the learning rate over time (e.g. scaling the initial learning rate by 1/t2 ,
where t is the number of gradient descent iterations thus far).
iii. Linear scaling of the input features.
iv. Adding a regularization term kθk22 to the loss function.
v. Adding zero-mean Gaussian noise to the training data or labels.
Answer:

i. Modifying the selection scheme for learning rates doesn’t change the linear separability,
hence the model wouldn’t converge.
ii. Same answer as (i).
iii. Linear transformation doesn’t remove the separability of dataset B, hence the model
still wouldn’t converge.
iv. Regularization can help in this case. The cost function is modified and might have a
minimum now.
v. Adding noise terms might break the separability and make the model converge.

(d) [3 points] Are support vector machines vulnerable to datasets like B? Why or why not?
Give an informal justification.
Answer:
The Support Vector Machine is not vulnerable to separable datasets like B. The hard margin
SVM actually dedicates to solve this problem. It tries to find a maximum margin classifier
for such a dataset.
CS229 Problem Set #2 4

2. [22 points] Spam classification


In this problem, we will use the naive Bayes algorithm and an SVM to build a spam classifier.
In recent years, spam on electronic media has been a growing concern. Here, we’ll build a
classifier to distinguish between real messages, and spam messages. For this class, we will be
building a classifier to detect SMS spam messages. We will be using an SMS spam dataset
developed by Tiago A. Almedia and José Marı́a Gómez Hidalgo which is publicly available on
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection 1
We have split this dataset into training and testing sets and have included them in this assignment
as src/spam/spam train.tsv and src/spam/spam test.tsv. See src/spam/spam readme.txt
for more details about this dataset. Please refrain from redistributing these dataset files. The
goal of this assignment is to build a classifier from scratch that can tell the difference the spam
and non-spam messages using the text of the SMS message.

(a) [5 points] Implement code for processing the the spam messages into numpy arrays that can
be fed into machine learning models. Do this by completing the get words, create dictionary,
and transform text functions within our provided src/spam.py. Do note the correspond-
ing comments for each function for instructions on what specific processing is required.
The provided code will then run your functions and save the resulting dictionary into
spam dictionary and a sample of the resulting training matrix into
spam sample train matrix.
In your writeup, report the vocabular size after the pre-processing step. You do not need
to include any other output for this subquestion.
Answer:
The vocabulary size is 1758.

(b) [10 points] In this question you are going to implement a naive Bayes classifier for spam
classification with multinomial event model and Laplace smoothing.
Code your implementation by completing the fit naive bayes model and
predict from naive bayes model functions in src/spam/spam.py.
Now src/spam/spam.py should be able to train a Naive Bayes model, compute your predic-
tion accuracy and then save your resulting predictions to spam naive bayes predictions.
In your writeup, report the accuracy of the trained model on the test set.
Remark. If you implement
Q naive Bayes the straightforward way, you will find that the
computed p(x|y) = i p(xi |y) often equals zero. This is because p(x|y), which is the
product of many numbers less than one, is a very small number. The standard computer
representation of real numbers cannot handle numbers that are too small, and instead
rounds them off to zero. (This is called “underflow.”) You’ll have to find a way to compute
Naive Bayes’ predicted class labels without explicitly representing very small numbers such
as p(x|y). [Hint: Think about using logarithms.]
Answer:
The accuracy obtained on the test set is 0.957.

1 Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New

Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11),
Mountain View, CA, USA, 2011.
CS229 Problem Set #2 5

(c) [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a
particular class. We can try to get an informal sense of how indicative token i is for the
SPAM class by looking at:
 
p(xj = i | y = 1) P (token i | email is SPAM)
log = log .
p(xj = i | y = 0) P (token i | email is NOTSPAM)

Complete the get top five naive bayes words function within the provided code using
the above formula in order to obtain the 5 most indicative tokens.
Report the top five words in your writeup.
Answer:
The five most indicative tokens are ”urgent!”, ”prize”, ”tone, ”won”, ”claim” (in increasing
order).

(d) [2 points] Support vector machines (SVMs) are an alternative machine learning model that
we discussed in class. We have provided you an SVM implementation (using a radial basis
function (RBF) kernel) within src/spam/svm.py (You should not need to modify that
code).
One important part of training an SVM parameterized by an RBF kernel (a.k.a Gaussian
kernel) is choosing an appropriate kernel radius parameter.
Complete the compute best svm radius by writing code to compute the best SVM radius
which maximizes accuracy on the validation dataset. Report the best kernel radius you
obtained in the writeup.
Answer:
The optimal SVM radius is 0.1 with test accuracy 0.968.
CS229 Problem Set #2 6

3. [18 points] Constructing kernels


In class, we saw that by choosing a kernel K(x, z) = φ(x)T φ(z), we can implicitly map data to
a high dimensional space, and have a learning algorithm (e.g SVM or logistic regression) work
in that space. One way to generate kernels is to explicitly define the mapping φ to a higher
dimensional space, and then work out the corresponding K.
However in this question we are interested in direct construction of kernels. I.e., suppose we
have a function K(x, z) that we think gives an appropriate similarity measure for our learning
problem, and we are considering plugging K into the SVM as the kernel function. However for
K(x, z) to be a valid kernel, it must correspond to an inner product in some higher dimensional
space resulting from some feature mapping φ. Mercer’s theorem tells us that K(x, z) is a (Mercer)
kernel if and only if for any finite set {x(1) , . . . , x(n) }, the square matrix K ∈ Rn×n whose entries
are given by Kij = K(x(i) , x(j) ) is symmetric and positive semidefinite. You can find more details
about Mercer’s theorem in the notes, though the description above is sufficient for this problem.
In this question we are interested to see which operations preserve the validity of kernels.
Let K1 , K2 be kernels over Rd × Rd , let a ∈ R+ be a positive real number, let f : Rd 7→ R be a
real-valued function, let φ : Rd → Rp be a function mapping from Rd to Rp , let K3 be a kernel
over Rp × Rp , and let p(x) a polynomial over x with positive coefficients.
For each of the functions K below, state whether it is necessarily a kernel. If you think it is,
prove it; if you think it isn’t, give a counter-example.

(a) [1 points] K(x, z) = K1 (x, z) + K2 (x, z)


(b) [1 points] K(x, z) = K1 (x, z) − K2 (x, z)
(c) [1 points] K(x, z) = aK1 (x, z)
(d) [1 points] K(x, z) = −aK1 (x, z)
(e) [5 points] K(x, z) = K1 (x, z)K2 (x, z)
(f) [3 points] K(x, z) = f (x)f (z)
(g) [3 points] K(x, z) = K3 (φ(x), φ(z))
(h) [3 points] K(x, z) = p(K1 (x, z))

[Hint: For part (e), the answer is that K is indeed a kernel. You still have to prove it, though.
(This one may be harder than the rest.) This result may also be useful for another part of the
problem.]
Answer:
By definition, for each kernel Ki , there must be some feature map φi such that Ki (x, z) =
hφi (x), φi (z)i. Moreover, recall that a kernel function as a dot product should be symmetric and
K(x, x) = xT x = kxk22 ≥ 0. We will use these facts for the rest of this problem.

(a)
K(x, z) = K1 (x, z) + K2 (x, z) (1)
= hφ1 (x), φ1 (z)i + hφ2 (x), φ2 (z)i (2)
   
φ1 (x) φ1 (z)
= , (3)
φ2 (x) φ2 (z)

Hence, K(x, z) is a valid kernel.


CS229 Problem Set #2 7

(b) This can not be a kernel. For example, let’s have K1 (x, z) = 1, K2 (x, z) = 2 which are
deterministic but valid kernels. Then K(x, z) = 1 − 2 = −1 < 0, which is definitely not a
kernel.
√ √
(c) This function can be written as K(x, z) = ahφ1 (x), φ1 (z)i = h aφ1 (x), aφ1 (z)i and so is
a valid kernel.
(d) Let’s take as a counter-example K1 (x, z) = 1, then K(x, z) = −a < 0 which is an invalid
kernel.
(e) We have as follows:

K(x, z) = K1 (x, z)K2 (x, z) (4)


p p
! !
(i) (i) (i) (i)
X X
= φ1 (x)φ1 (z) φ2 (x)φ2 (z) (5)
i=1 i=1
p X p   
(i) (i) (i) (i)
X
= φ1 (x)φ2 (x) φ1 (z)φ2 (z) (6)
i=1 j=1
Xp X p
= φij (x)φij (z) (7)
i=1 j=1
   
* φ11 (x) φ11 (z) +
=  ...  ,  ...  (8)
   

φpp (x) φpp (z)


(i) (j)
In (7), we defined φij (·) = φ1 (·)φ2 (·). This result implies a legitimate kernel.
(f) Think of the real-valued function f (·) as a one-component vector, K(x, z) = f (x)f (z) =
hf (x), f (z)i is a kernel.
(g) K(x, z) = K3 (φ(x), φ(z)) = hφ3 (φ(x)), φ3 (φ(z))i is a kernel.
(h) Without the loss of generality, we can write p(y) = a0 + a1 y + . . . ak y k , where coefficients
a0 , a1 , . . . , ak > 0, k ∈ N. Then,

K(x, z) = p(K1 (x, z)) (9)


k
= a0 + a1 K1 (x, z) + · · · + ak (K1 (x, z) ) (10)

Using the previous results (a), (c), (e), we observe that each term in the sum is actually a
kernel, and hence their sum K(x, z) is also a valid one.
CS229 Problem Set #2 8

4. [15 points] Kernelizing the Perceptron


Let there be a binary classification problem with y ∈ {0, 1}. The perceptron uses hypotheses
of the form hθ (x) = g(θT x), where g(z) = sign(z) = 1 if z ≥ 0, 0 otherwise. In this problem
we will consider a stochastic gradient descent-like implementation of the perceptron algorithm
where each update to the parameters θ is made using only one training example. However, unlike
stochastic gradient descent, the perceptron algorithm will only make one pass through the entire
training set. The update rule for this version of the perceptron algorithm is given by

θ(i+1) := θ(i) + α(y (i+1) − hθ(i) (x(i+1) ))x(i+1)

where θ(i) is the value of the parameters after the algorithm has seen the first i training examples.
Prior to seeing any training examples, θ(0) is initialized to ~0.

(a) [3 points] Let K be a kernel corresponding to some very high-dimensional feature mapping φ.
Suppose φ is so high-dimensional (say, ∞-dimensional) that it’s infeasible to ever represent
φ(x) explicitly. Describe how you would apply the “kernel trick” to the perceptron to make
it work in the high-dimensional feature space φ, but without ever explicitly computing φ(x).
[Note: You don’t have to worry about the intercept term. If you like, think of φ as having
the property that φ0 (x) = 1 so that this is taken care of.] Your description should specify:
i. [1 points] How you will (implicitly) represent the high-dimensional parameter vector
θ(i) , including how the initial value θ(0) = 0 is represented (note that θ(i) is now a
vector whose dimension is the same as the feature vectors φ(x));
ii. [1 points] How you will efficiently make a prediction on a new input x(i+1) . I.e., how
T
you will compute hθ(i) (x(i+1) ) = g(θ(i) φ(x(i+1) )), using your representation of θ(i) ;
and
iii. [1 points] How you will modify the update rule given above to perform an update to θ
on a new training example (x(i+1) , y (i+1) ); i.e., using the update rule corresponding to
the feature mapping φ:

θ(i+1) := θ(i) + α(y (i+1) − hθ(i) (x(i+1) ))φ(x(i+1) )

Answer: Pn (0) (0)


Parameter vector θ is initialized as ~0 = i=1 βj φ(x(j) ) where βj = 0 for all j = 1, . . . , n.
At each iteration i, we add some multiple of φ(x(i) ) to θ, thus θ is always a linear combination
Pn (0)
of training examples and can be written as θ = i=1 βj φ(x(j) ).
Given a new input x(i+1) , the prediction
T
hθ(i) (x(i+1) ) = sign(θ(i) φ(x(i+1) )) (11)
n
!
(0)
X
= sign βj φ(x(j) )φ(x(i+1) ) (12)
i=1
n
!
(0)
X
= sign βj K(x(i+1) , x(j) ) (13)
i=1
CS229 Problem Set #2 9

The update rule at (i + 1)th iteration:


 
θ(i+1) = θ(i) + α y (i+1) − hθ(i) (x(i+1) )φ(x(i+1) ) (14)
n  
(0)
X
= βj φ(x(j) ) + α y (i+1) − hθ(i) (x(i+1) )φ(x(i+1) ) (15)
i=1
n h  i
(0) (i)
X
= βj φ(x(j) ) + βi+1 + α y (i+1) − hθ(i) (x(i+1) ) φ(x(i+1) ) (16)
i6=(i+1)

(i+1)
All coefficients stay the same except for βi+1 which needs to be updated according to the
(i+1) (i) 
formula βi+1 = βi+1 + α y (i+1) − hθ(i) (x(i+1) ) .

(b) [10 points] Implement your approach by completing the initial state, predict, and
update state methods of src/perceptron/perceptron.py.
We provide three functions to be used as kernel, a dot-product kernel defined as:

K(x, z) = x> z, (17)

a radial basis function (RBF) kernel, defined as:

kx − zk22
 
K(x, z) = exp − , (18)
2σ 2

and finally the following function:


(
−1 x=z
K(x, z) = (19)
0 x 6= z

Note that the last function is not a kernel function (since its corresponding matrix
is not a PSD matrix). However, we are still interested to see what happens when
the kernel is invalid. Run src/perceptron/perceptron.py to train kernelized per-
ceptrons on src/perceptron/train.csv. The code will then test the perceptron on
src/perceptron/test.csv and save the resulting predictions in the src/perceptron/
folder. Plots will also be saved in src/perceptron/.
Include the three plots (corresponding to each of the kernels) in your writeup, and indicate
which plot belongs to which function.
Answer:
CS229 Problem Set #2 10

(a) Dot-product kernel. (b) RBF kernel.

(c) Not-a-kernel function.

Figure 2: The Perceptron using different kernels.

(c) [2 points] One of the choices in Q4b completely fails, one works a bit, and one works well
in classifying the points. Discuss the performance of different choices and why do they fail
or perform well?
Answer:
The dot-product kernel tried to fit a linear decision boundary, this apparently didn’t work
well on our highly non-linear dataset. Meanwhile, the RBF kernel had an infinite dimen-
sional feature map and was able to learn non-linear decision boundaries (figure 2(b)). Lastly,
the not-a-kernel function failed to learn anything useful because it broke the logic of our
algorithm.
CS229 Problem Set #2 11

5. [30 points] Neural Networks: MNIST image classification


In this problem, you will implement a simple neural network to classify grayscale images of
handwritten digits (0 - 9) from the MNIST dataset. The dataset contains 60,000 training images
and 10,000 testing images of handwritten digits, 0 - 9. Each image is 28×28 pixels in size, and
is generally represented as a flat vector of 784 numbers. It also includes labels for each example,
a number indicating the actual digit (0 - 9) handwritten in that image. A sample of a few such
images are shown below.

The data and starter code for this problem can be found in

ˆ src/mnist/nn.py
ˆ src/mnist/images train.csv
ˆ src/mnist/labels train.csv
ˆ src/mnist/images test.csv
ˆ src/mnist/labels test.csv

The starter code splits the set of 60,000 training images and labels into a set of 50,000 examples
as the training set, and 10,000 examples for dev set.
To start, you will implement a neural network with a single hidden layer and cross entropy loss,
and train it with the provided data set. Use the sigmoid function as activation for the hidden
layer, and softmax function for the output layer. Recall that for a single example (x, y), the
cross entropy loss is:
K
X
CE(y, ŷ) = − yk log yˆk ,
k=1
K
where ŷ ∈ R is the vector of softmax outputs from the model for the training example x, and
y ∈ RK is the ground-truth vector for the training example x such that y = [0, ..., 0, 1, 0, ..., 0]>
contains a single 1 at the position of the correct class (also called a “one-hot” representation).
For clarity, we provide the forward propagation equations below for the neural network with a
single hidden layer. We have labeled data (x(i) , y (i) )ni=1 , where x(i) ∈ Rd , and y (i) ∈ RK is a
CS229 Problem Set #2 12

one-hot vector as described above. Let h be the number of hidden units in the neural network,
so that weight matrices W [1] ∈ Rd×h and W [2] ∈ Rh×K . We also have biases b[1] ∈ Rh and
b[2] ∈ RK . The forward propagation equations for a single input x(i) then are:

>
 
a(i) = σ W [1] x(i) + b[1] ∈ Rh
>
z (i) = W [2] a(i) + b[2] ∈ RK
ŷ (i) = softmax(z (i) ) ∈ RK

where σ is the sigmoid function.


For n training examples, we average the cross entropy loss over the n examples.
n n K
1X 1 X X (i) (i)
J(W [1] , W [2] , b[1] , b[2] ) = CE(y (i) , ŷ (i) ) = − yk log ŷk .
n i=1 n i=1
k=1

The starter code already converts labels into one hot representations for you.
Instead of batch gradient descent or stochastic gradient descent, the common practice is to use
mini-batch gradient descent for deep learning tasks. In this case, the cost function is defined as
follows:

B
1 X
JM B = CE(y (i) , ŷ (i) )
B i=1

where B is the batch size, i.e. the number of training example in each mini-batch.

(a) [5 points]
For a single input example x(i) with one-hot label vector y (i) , show that

∇z(i) CE(y (i) , ŷ (i) ) = ŷ (i) − y (i) ∈ RK

where z (i) ∈ RK is the input to the softmax function, i.e.

ŷ (i) = softmax(z (i) )

(Note: in deep learning, z (i) is sometimes referred to as the ”logits”.)


Hint: To simplify your answer, it might be convenient to denote the true label of x(i) as
l ∈ {1, . . . , K}. Hence l is the index such that that y (i) = [0, ..., 0, 1, 0, ..., 0]> contains a
∂CE(y (i) , ŷ (i) )
single 1 at the l-th position. You may also wish to compute (i)
for j 6= l and
∂zj
j = l separately.
Answer:
CS229 Problem Set #2 13

For an input example x(i) of class l, the loss function:


K
(i) (i)
X
CE(y (i) , ŷ (i) ) = − yk log ŷk (20)
k=1
(i)
= − log ŷl (21)
(i)
!
exp(zl )
= − log PK (i)
(22)
k=1 exp(zk )
K
!
(i) (i)
X
= log exp(zk ) − zl (23)
k=1

Consider each component j of the gradient, when j = l we have:


(i)
∂CE(y (i) , ŷ (i) ) exp(zl ) (i) (i)
(i)
= PK (i)
− 1 = ŷl − yl (24)
∂zl k=1 exp(zk )

When j 6= l, then:
(i)
∂CE(y (i) , ŷ (i) ) exp(zj ) (i) (i) (i)
(i)
= PK (i)
= ŷj − yj (as yj = 0) (25)
∂zj k=1 exp(zk )

These imply that ∇z(i) CE(y (i) , ŷ (i) ) = ŷ (i) − y (i) .

(b) [15 points]


Implement P both forward-propagation and back-propagation for the above loss function
B
JM B = B1 i=1 CE(y (i) , ŷ (i) ). Initialize the weights of the network by sampling values
from a standard normal distribution. Initialize the bias/intercept term to 0. Set the num-
ber of hidden units to be 300, and learning rate to be 5. Set B = 1, 000 (mini batch size).
This means that we train with 1,000 examples in each iteration. Therefore, for each epoch,
we need 50 iterations to cover the entire training data. The images are pre-shuffled. So you
don’t need to randomly sample the data, and can just create mini-batches sequentially.
Train the model with mini-batch gradient descent as described above. Run the training for
30 epochs. At the end of each epoch, calculate the value of loss function averaged over the
entire training set, and plot it (y-axis) against the number of epochs (x-axis). In the same
image, plot the value of the loss function averaged over the dev set, and plot it against the
number of epochs.
Similarly, in a new image, plot the accuracy (on y-axis) over the training set, measured as
the fraction of correctly classified examples, versus the number of epochs (x-axis). In the
same image, also plot the accuracy over the dev set versus number of epochs.
Submit the two plots (one for loss vs epoch, another for accuracy vs epoch) in
your writeup.
Also, at the end of 30 epochs, save the learnt parameters (i.e all the weights and biases) into
a file, so that next time you can directly initialize the parameters with these values from
the file, rather than re-training all over. You do NOT need to submit these parameters.
Hint: Be sure to vectorize your code as much as possible! Training can be very slow
otherwise.
CS229 Problem Set #2 14

Answer:
For clarity, let’s write down equations to be implemented in this sub-problem. We are not
Stanford students, so let’s change some notations to make our life easier. The forward
propagation computation for an example (x, y) is re-written equivalently as follows:
z [1] = W [1] x + b[1] (26)
[1] [1]
a = σ(z ) (27)
[2] [2] [1] [2]
z =W a +b (28)
[2]
ŷ = softmax(z ) (29)
Note that our weight matrices are transposes of the weights introduced in this sub-problem
statement. Let L be the cross entropy loss of a single training example (x, y). Using matrix
calculus, we obtain the following formulas:
∂L
= ŷ − y (30)
∂z [2]
∂L ∂L T ∂L ∂L
[2]
= [2] a[1] , [2] = [2] (31)
∂W ∂z ∂b ∂z
∂L [2] T ∂L ∂L ∂L
=W , = σ 0 (z [1] ) (32)
∂a[1] ∂z [2] ∂z [1] ∂a[1]
∂L ∂L ∂L ∂L
[1]
= [1] xT , [1] = [1] (33)
∂W ∂z ∂b ∂z
After running the model for 30 epochs, our result is shown in the figure below.

Figure 3: Baseline NN model.

(c) [7 points] Now add a regularization term to your cross entropy loss. The loss function will
become !
B
1 X  
JM B = CE(y (i) , ŷ (i) ) + λ ||W [1] ||2 + ||W [2] ||2
B i=1
CS229 Problem Set #2 15

Be careful not to regularize the bias/intercept term. Set λ to be 0.0001. Implement the
regularized version and plot the same figures as part (a). Be careful NOT to include the
regularization term to measure the loss value for plotting (i.e., regularization should only
be used for gradient calculation for the purpose of training).
Submit the two new plots obtained with regularized training (i.e loss (without
regularization term) vs epoch, and accuracy vs epoch) in your writeup.
Compare the plots obtained from the regularized model with the plots obtained
from the non-regularized model, and summarize your observations in a couple
of sentences.
As in the previous part, save the learnt parameters (weights and biases) into a different file
so that we can initialize from them next time.
Answer:

Figure 4: Regularized NN model.

These model attained the same level of accuracy on the training set which was actually
almost optimal. But the gap from the training to the dev accuracy is greater in the non-
regularized baseline model. As a heuristic, this characteristic might indicate that compared
to the regularized model, the baseline endured larger variance problem. Regularization did
help in this case. And we expect better test accuracy on the second model (as we will see
in the next sub-problem).

(d) [3 points] All this while you should have stayed away from the test data completely. Now
that you have convinced yourself that the model is working as expected (i.e, the observations
you made in the previous part matches what you learnt in class about regularization), it is
finally time to measure the model performance on the test set. Once we measure the test
set performance, we report it (whatever value it may be), and NOT go back and refine the
model any further.
CS229 Problem Set #2 16

Initialize your model from the parameters saved in part (a) (i.e, the non-regularized model),
and evaluate the model performance on the test data. Repeat this using the parameters
saved in part (b) (i.e, the regularized model).
Report your test accuracy for both regularized model and non-regularized model. Briefly
(in one sentence) explain why this outcome makes sense” You should have accuracy close
to 0.92870 without regularization, and 0.96760 with regularization. Note: these accuracies
assume you implement the code with the matrix dimensions as specified in the comments,
which is not the same way as specified in your code. Even if you do not precisely these
numbers, you should observe good accuracy and better test accuracy with regularization.
Answer:
Our model had the test accuracy 0.932 without regularization and 0.9653 with regulariza-
tion. Regularized models often offer better generalization(apparently after some tuning),
which leads to better test accuracy.
CS229 Problem Set #2 17

6. [20 points] Bayesian Interpretation of Regularization


Background: In Bayesian statistics, almost every quantity is a random variable, which can
either be observed or unobserved. For instance, parameters θ are generally unobserved random
variables, and data x and y are observed random variables. The joint distribution of all the
random variables is also called the model (e.g., p(x, y, θ)). Every unknown quantity can be esti-
mated by conditioning the model on all the observed quantities. Such a conditional distribution
over the unobserved random variables, conditioned on the observed random variables, is called
the posterior distribution. For instance p(θ|x, y) is the posterior distribution in the machine
learning context. A consequence of this approach is that we are required to endow our model
parameters, i.e., p(θ), with a prior distribution. The prior probabilities are to be assigned before
we see the data—they capture our prior beliefs of what the model parameters might be before
observing any evidence.
In the purest Bayesian interpretation, we are required to keep the entire posterior distribu-
tion over the parameters all the way until prediction, to come up with the posterior predictive
distribution, and the final prediction will be the expected value of the posterior predictive dis-
tribution. However in most situations, this is computationally very expensive, and we settle for
a compromise that is less pure (in the Bayesian sense).
The compromise is to estimate a point value of the parameters (instead of the full distribution)
which is the mode of the posterior distribution. Estimating the mode of the posterior distribution
is also called maximum a posteriori estimation (MAP). That is,

θMAP = arg max p(θ|x, y).


θ

Compare this to the maximum likelihood estimation (MLE) we have seen previously:

θMLE = arg max p(y|x, θ).


θ

In this problem, we explore the connection between MAP estimation, and common regularization
techniques that are applied with MLE estimation. In particular, you will show how the choice
of prior distribution over θ (e.g., Gaussian or Laplace prior) is equivalent to different kinds of
regularization (e.g., L2 , or L1 regularization). You will also explore how regularization strengths
affect generalization in part (d).

(a) [3 points] Show that θMAP = argmaxθ p(y|x, θ)p(θ) if we assume that p(θ) = p(θ|x). The
assumption that p(θ) = p(θ|x) will be valid for models such as linear regression where the
input x are not explicitly modeled by θ. (Note that this means x and θ are marginally
independent, but not conditionally independent when y is given.)
Answer:
We have:

θMAP = arg max p(θ|x, y) (34)


θ
p(θ|x)p(y|x, θ)
= arg max (35)
θ p(y|x)
= arg max p(θ)p(y|x, θ) (36)
θ

Here, we can safely remove p(y|x) in the denominator as it does not depend on θ. And by
assumption, p(θ|x) = p(θ).
CS229 Problem Set #2 18

(b) [5 points] Recall that L2 regularization penalizes the L2 norm of the parameters while
minimizing the loss (i.e., negative log likelihood in case of probabilistic models). Now we
will show that MAP estimation with a zero-mean Gaussian prior over θ, specifically θ ∼
N (0, η 2 I), is equivalent to applying L2 regularization with MLE estimation. Specifically,
show that for some scalar λ,
θMAP = arg min − log p(y|x, θ) + λ||θ||22 . (37)
θ

Also, what is the value of λ?


Answer:
Using the result from (a), we have:
θMAP = arg max log(p(θ)p(y|x, θ)) (38)
θ
kθk2
 
1
= arg min − log p(y|x, θ) − log exp(− 22 ) (39)
θ (2π)d/2 |η 2 I|1/2 2η
kθk22
= arg min − log p(y|x, θ) + (40)
θ 2η 2
Then, λ = 1/2η 2 .

(c) [7 points] Now consider a specific instance, a linear regression model given by y = θT x + 
where  ∼ N (0, σ 2 ). Assume that the random noise (i) is independent for every training
example x(i) . Like before, assume a Gaussian prior on this model such that θ ∼ N (0, η 2 I).
For notation, let X be the design matrix of all the training example inputs where each row
vector is one example input, and ~y be the column vector of all the example outputs.
Come up with a closed form expression for θMAP .
Answer:
Our model for the whole training set can be written effectively as ~y = Xθ + ~ where
~ ∼ N (0, σ 2 I) and θ ∼ N (0, η 2 I). Then, ~y |X, θ ∼ N (Xθ, σ 2 I). Using the result from (b),
we have:
kθk22
θMAP = arg min − log p(~y |x, θ) + (41)
θ 2η 2
k~y − Xθk22 kθk22
  
1
= arg min − log exp − + (42)
θ (2π)d/2 |σ 2 I|1/2 2σ 2 2η 2
2 2
k~y − Xθk2 kθk2
= arg min 2
+ (43)
θ 2σ 2η 2
σ2
= arg min k~y − Xθk22 + 2 kθk22 (44)
θ η
Let J(θ) be the above objective function. To minimize J(θ), we compute the gradient of
J(θ) w.r.t. θ and set it to 0. We have:
2σ 2
∇θ J(θ̂) = 2X T (X θ̂ − ~y ) + 2 θ̂ (45)
η
2
 
σ
= X T X + 2 I θ̂ − X T ~y (46)
η
=0 (47)
CS229 Problem Set #2 19

 −1
σ2
This implies that θ̂MAP = X T X + η2 I X T ~y .

(d) [5 points] Next, consider the Laplace distribution, whose density is given by
 
1 |z − µ|
fL (z|µ, b) = exp − .
2b b
As before, consider a linear regression model given by y = xT θ +  where  ∼ N (0, σ 2 ).
Assume a Laplace prior on this model, where each parameter θi is marginally independent,
and is distributed as θi ∼ L(0, b).
Show that θMAP in this case is equivalent to the solution of linear regression with L1
regularization, whose loss is specified as

J(θ) = ||Xθ − ~y ||22 + γ||θ||1


Also, what is the value of γ?
Note: A closed form solution for linear regression problem with L1 regularization does not
exist. To optimize this, we use gradient descent with a random initialization and solve it
numerically.
Answer:
In this case, our model is ~y = Xθ + ~ where ~ ∼ N (0, σ 2 I) and θi ∼ Laplace(0,
 b) for
i = 1, . . . , n. As before, ~y |X, θ ∼ N (Xθ, σ 2 I) and the density p(θi ) = 1
2b exp − |θbi | . We
have:
θMAP = arg min − log(p(θ)p(~y |x, θ)) (48)
θ
d
Y
= arg min − log p(~y |x, θ) − log p(θi ) (49)
θ
i=1
d
k~y − Xθk22
   X   
1 1 |θi |
= arg min − log exp − − log exp −
θ (2π)d/2 |σ 2 I|1/2 2σ 2 i=1
2b b
(50)
d
k~y − Xθk22 X |θi |
= arg min + (51)
θ 2σ 2 i=1
b
2σ 2
= arg min k~y − Xθk22 + kθk1 (52)
θ b
Then, γ = 2σ 2 /b.

Remark: Linear regression with L2 regularization is also commonly called Ridge regression, and
when L1 regularization is employed, is commonly called Lasso regression. These regularizations
can be applied to any Generalized Linear models just as above (by replacing log p(y|x, θ) with
the appropriate family likelihood). Regularization techniques of the above type are also called
weight decay, and shrinkage. The Gaussian and Laplace priors encourage the parameter values
to be closer to their mean (i.e., zero), which results in the shrinkage effect.
Remark: Lasso regression (i.e., L1 regularization) is known to result in sparse parameters,
where most of the parameter values are zero, with only some of them non-zero.

You might also like