Final Exam Solutions
Final Exam Solutions
Instructions: Write your name on the exam sheet. No electronic devices may be used
during the exam. You may consult your paper notes and/or a print copy of texts during
the exam. Sharing of notes/texts during the exam is strictly prohibited. Show your work
for all derivation questions. A suitable explanation must be provided for all questions to
earn full credit. Attempt all problems. Partial credit may be given for partially incorrect
or incomplete answers. If your answer to a question spans two pages, please make sure
to indicate that at the end of the first page of the answer. If you have questions at any
time, please raise your hand.
1 Optimization 1-2 10
Total: 100
1. (10 points) Optimization: The following questions concern numerical optimization.
a. (5 pts) Explain the primary difference between using gradient descent and stochastic
gradient descent to minimize the risk of a supervised machine learning model. Provide
supporting equations.
Example Solution: In gradient descent, we P use all of the training data to compute the
value of the gradient of the risk R(θ, D) = N N
1
n=1 l(yn , fθ xn ) with P
respect to the model
parameters θ. This gives a gradient vector of the form ∇R(θ, D) = N N 1
n=1 ∇l(yn , fθ (xn )).
The main difference with stochastic gradient descent is that we use a subset (or batch) of
data cases of size B < N to estimate the gradient. Letting B be the subset of data cases
1
P
indices in a batch, we have ∇R(θ, D) ≈ B n∈B l(yn , fθ (xn ))
b. (5 pts) Give two (2) reasons why stochastic gradient descent is usually preferred to
gradient descent when learning supervised neural network models.
Example Solution: The first reason why stochastic gradient descent is usually preferred
is speed. Using B < N requires less computation per optimization iteration. This usually
results in model learning taking much less total time compared to using all of the data.
The second reason is memory usage. Using B < N requires much less memory during
backpropagation compared to using all of the data. This can make it possible to efficiently
learn models using GPUs even when all of the data won’t fit in GPU memory at once.
1
2. (10 points) Capacity Control: List four (4) different approaches to prevent
overfitting when learning neural network models and briefly describe each of them.
Example Solution:
1. Weight decay (or regularization): We can add a weight decay term λ∥w∥22 to the
optimization objective function. This can prevent the weights from becoming large
and therefore keep the learned function from becoming too complex.
2. Dropout: We can add dropout layers to the mode during training. This sets hidden
unit values to zero at random with a specified probability and has been shown
experimentally to reduce overfitting.
3. Early Stopping: We can use a validation set to determine how long to run learning
for. When the validation set error stops decreasing, we can stop learning. If the
weights are initialized to small values, this generally stop the weights from becoming
large and results in less complex functions.
2
3. (10 points) Model Architectures: Suppose we are building a model for classifying
grayscale hand written digits of size 30 × 30. There are 10 classes. Use this information
to answer the following questions. Show your work and explain your answers. It is ac-
ceptable to leave final answers in the form of an arithmetic expression (e.g., 5 · 10 + 3, etc).
a. (5 pts) Suppose we use a two-hidden layer MLP where the first hidden layer has 100
hidden units and the second hidden layer has 50 hidden units. Assume all hidden units
have bias parameterts. How many total parameters does the model have?
Example Solution: The input layer will have size 30 × 30 = 900. The first hidden layer
has size 100. Each hidden unit in the first hidden layer will thus have 900 input weights,
plus one bias parameter. This yields as total of 901 × 100 parameters in the input to first
hidden layer. The second hidden layer has 50 hidden units, each with 100 inputs from the
first hidden layer, and a bias. This given 101 × 50 parameters. The output layer has one
output per class, each with 50 inputs and a bias. This yields 51 × 10 parameters. The
total number of parameters needed in the network is thus 901 × 100 + 101 × 50 + 51 × 10.
b. (5 pts) Suppose we use a CNN where the input is followed immediately by two
feature extraction sections consisting of two Conv2D → RelU → M axP ool blocks. If
the first block uses 10 output channels with a kernel size of 4 × 4 and the second block
uses 20 output channels with a kernels of size 4 × 4, how many parameters are in the
feature extraction section of the model? Assume that the Conv2D operation includes the
addition of a bias term for each output channel.
3
4. (10 points) Maximum Likelihood Estimation: The univariate normal distribu-
tion has an alternative paraemterization in terms of a parameter τ > 0 as shown below.
Use this information to answer the following questions.
√
τ τ
N (x; µ, τ ) = √ exp(− (x − µ)2 )
2π 2
a. (5 pts) Write down the negative log likelihood function for this model assuming a
data set containing N instances D = {x1 , ..., xN }. Simplify to the extent possible.
Example Solution: For this model, the negative log likelihood is the average over a data
set of size N of the negative of the log of the probability density function. This gives:
N
1 X1 τ 2
nll(θ, D) = − log(τ ) − log(2π) − (xn − µ)
N n=1 2 2
b. (5 pts) Assume the value of the mean µ is known. Derive the maximum likelihood
estimate of τ . Show your work and explain the steps in your solution.
Example Solution: We first find the partial derivative of the nll with respect of τ
N N
∂ ∂ 1 X1 2
1 X 1 2
nll(θ, D) = − log(τ ) − log(2π) − τ (xn − µ) = − − (xn − µ)
∂τ ∂τ N n=1 2 2N n=1 τ
We now set this expression equal to zero and solve to identify the stationary points of the
unconstrained NLL:
N
1 X 1 2
− − (xn − µ) = 0
2N n=1 τ
N
1 1 X
= (xn − µ)2
τ N n=1
1
τ̂ = 1
P(
N n=1 xn − µ)2
Next, we need to check the the value found is indeed a minimizer of the unconstrained
NLL byP checking that the second derivative is positive. The second derivative of the NLL
is − 2N N
1 1 1
n=1 (−1) τ 2 = 2τ 2 . This term is clearly positive for any τ so the stationary point
is a minimizer.
We next need to check that the identified minimizer of the unconstrained NLL τ̂ satisfies
the constraint τ̂ > 0. We can see that (xn − µ)2 ≥ 0 for all n. We assume N > 0 so the
data set is non-empty. This means the denominator is an average of non-negative values
and the overall fraction must be non-negative. τ̂ is thus a minmimizer of the NLL subject
to the positivity constraint on τ .
4
5. (10 points) Probabilistic Classification: Consider the following attempt at
defining a multi-class probabilistic classification model. Let C be the number of classes.
Assume that x = [x1 , ..., xD ] ∈ RD is a length D row vector. Assume each wc =
[w1c ; ...; wDc ] ∈ RD is a length D column vector. Assume each bc ∈ R is a scalar value.
Let θ be the collection of model parameters. Answer the following questions.
C
![c=y]
Y xwc + bc
Pθ (Y = y|X = x) = PC (1)
c=1 c′ =1 xwc′ + bc′
b. (5 pts) What conditions could we impose on the data and model parameters to
make this a valid probabilistic classification model? Explain how these conditions fix the
problem with the model. Note: your answer can not involve changing the definition of
the conditional probability model Pθ (Y = y|X = x)
Example Solution: We can fix this problem by requiring that the data be non-negative
reals and the parameters be non-negative reals. This will ensure that the numerator
terms xwc + bc are all greater than or equal to 0. As a result, the denominator will also
be greater than or equal to 0 and non-negativity will be satisfied. Normalization will also
be satisfied so long as all of the xwc + bc values are not equal to 0.
5
6. (10 points) Probabilistic Regression: The von Mises distribution given below
provides a probability density over angles y measured in radians. The parameters are
θ = [κ, µ] where κ ∈ R>0 is the concentration parameter and µ ∈ R is the location
parameter. The function I() provides part of the normalization term for the model. Use
this distribution to answer the following questions.
1
pθ (Y = y) = exp(κ · cos(y − µ)) (2)
2πI(κ)
a. (5 pts) Write down a probabilistic regression model based on the von Mises distribu-
tion where both the concentration and location depend on x ∈ RD . Explain your choices.
Example Solution: Let κ(x) = exp(xw +b) and µ = xv +c. Here the model parameters
are θ = [w, v, b, c] ∈ R2D+2 . The parameter prediction function κ(x) ensures the predicted
κ values are strictly positive since exp(z) is strictly positive for all finite z. This gives us
the model:
1
pθ (Y = y|X = x) = exp(κ(x) · cos(y − µ(x)))
2πI(κ(x))
b. (5 pts) Write down the negative log likelihood for the model you define in part (a).
Simplify to the extent possible. Show your work.
Example Solution: The negative log likelihood for this model is the negative of the
average over a data set of size N of the log of pθ (Y = yn |X = xn ). We have:
N
1 X
nll(θ, D) = − pθ (Y = yn |X = xn )
N n=1
N
1 X
=− (− log(2πI(κ(xn ))) + κ(xn ) · cos(yn − µ(xn )))
N n=1
N
1 X
= (log(2πI(κ(xn ))) − κ(xn ) · cos(yn − µ(xn )))
N n=1
6
7. (10 points) Experiment Design: Suppose we are training a one hidden layer
neural network classifier. Answer the following questions.
a. (5 pts) Is it a valid experiment design to evaluate several learning rates for a given
model architecture and select the learning rate that yields the lowest training risk? Ex-
plain your answer.
Example Solution: Yes. Configuring the learning rate with the model architecture fixed
is configuring an optimizer hyper-parameter. We can select optimizer hyper-parameters
to minimize the training risk since we are only trying to find values that result in good
solutions to the optimization problem we are solving and the optimization problem we
are solving is to minimize the training set risk.
b. (5 pts) Is it a valid experiment design to evaluate several hidden layer sizes and
select the hidden layer size that yields the lowest training set risk? Explain your answer.
Example Solution: No. The hidden layer size is a model complexity hyper-parameter.
To select a value for it, we need to use a validation set performance metric to avoid
over-fitting.
7
8. (10 points) ProductQ of Marginals: Consider the product of Bernoulli marginals
model Pθ (X = x) = d=1 θdxd (1 − θd )(1−xd ) . Provide a PyTorch implementation of a
function compare(x1,x2,theta) that will return true when the joint probability of x1 is
greater than the joint probability of x2 according to the model parameters theta. Assume
x1, x2, and theta are PyTorch tensors of shape (1, D) and the compare function is called
with valid inputs only. Your function should be vectorized and work for any D for full
points. You can define additional functions to help structure your solution if needed.
Explain your answer.
Example Solution: Our code is shown below. First, the log joint function assumes that
x and theta have the same shape and computes the log of the joint distribution. This is
for numerical stability reasons so the model will not underflow when D becomes large.
Next, the function compare(x1,x2,theta) simply computes and compares the log of the
joint distribution for each of the two values of x.
def log_joint(x,theta):
return(torch.sum(torch.log(theta)*x + torch.log(1-theta)*(1-x)))
def compare(x1,x2,theta):
return log_joint(x1,theta) > log_joint(x2,theta)
8
9. (10 points) Nonlinear Factor Analysis: Consider the non-linear factor analysis
model shown below where x ∈ RD and z ∈ RK . Suppose we wanted to re-formulate the
model using binary latent variables instead of real-valued latent variables. Answer the
following questions.
a. (5 pts) Provide an updated model definition where the K latent variables are binary.
Explain your answer.
Example Solution: To make the latent variables binary, we need to change the distribu-
tion p(Z = z) into a distribution over K binary variables. Since the original distribution
is a standard multivariate normal, which is equivalent to a product of univariate stan-
dard normal marginals, we change the distribution p(Z = z) into a product of standard
Bernoulli marginals as shown below. The rest of the model remains the same.
K
Y
P (Z = z) = 0.5zk 0.5(1−zk ) = 0.5K
k=1
9
b. (5 pts) Is your model learnable using direct negative log marginal likelihood mini-
mization? Explain your answer and give supporting equations.
Example Solution: Yes. The joint distribution now has discrete random variables that
can be marginalized out of the model using summation so long as pθ (X = x|Z = z) is
computable and K is not too large. We have the following NLML function:
N 1 1
!
X X X
N LM L(θ, D) = log ··· 0.5K N (xn ; fw (z), Ψ)
n=1 z1 =0 zK =0
To optimize the NLML, the parameters w of the generator have no constraints. We just
need to use a parameter transformation on Ψ to ensure it is a positive diagonal matrix.
10
10. (10 points) Mixture Models: Consider a mixture model with product of
Bernoulli mixture components as shown below. Assume x = [x1 , ..., xD ] ∈ {0, 1}D . Us-
ing marginalization and conditioning, derive an equation for predicting x1 given x2:D =
[x2 , ..., xD ] as input in terms of the parameters of this model. Show your work and explain
the steps of your derivation.
Pθ (X = x, Z = z) = Pπ (Z = z)Pϕ (X = x|Z = z)
Y x
Pϕ (X = x|Z = z) = ϕdzd (1 − ϕdz )(1−xd )
d=1
Pπ (Z = z) = πz
Example Solution: To begin, we are looking for the distribution Pθ (X1 = x1 |X2:D =
x2:D ). We will predict 1 if Pθ (X1 = 1|X2:D = x2:D ) > 0.5 and 0 otherwise. We can
compute this probability distribution using conditioning and marginalization as follows.
First we apply the definition of conditional probability. Next, we expand the numerator
and denominator in terms of marginals over the joint probability distribution given by
the model. Finally, we write the required probability in terms of the model parameters.
PK
Pϕ (X1 = 1, X2:D = x2:D ))Pπ (Z = z)
= PK P1 z=1
z ′ =1 x′1 =0 Pϕ (X1 = x′1 , X2:D = x2:D |Z = z ′ )Pπ (Z = z ′ )
PK
ϕxdzd (1 − ϕdz )(1−xd )
Q
z=1 πz ϕ1z d=2
= ′
′ x1 x
PK P1 ′ Q
z ′ =1 x′1 =0 πz ϕ1z (1 − ϕ1z )(1−x1 ) d′ =2 ϕd′dz′ (1 − ϕd′ z )(1−xd′ )
PK
ϕxdzd (1 − ϕdz )(1−xd )
Q
z=1 πz ϕ1z d=2
= PK xd ′ x′
ϕd′ z )(1−xd′ ) 1x′ =0 ϕ1z1 (1 − ϕ1z )(1−x1 )
Q P ′
′
z ′ =1 πz d′ =2 ϕd′ z (1 − 1
PK Q xd (1−xd )
z=1 π z ϕ 1z d=2 ϕdz (1 − ϕdz )
= PK Q xd ′
′ (1−xd′ )
z ′ =1 πz d′ =2 ϕd′ z (1 − ϕd′ z )
11