0% found this document useful (0 votes)

27 views12 pages

Final Exam Solutions

The document contains the final exam solutions for CS689: Machine Learning, detailing instructions, problems, and solutions across various topics such as optimization, capacity control, model architectures, and probabilistic classification. Each problem includes specific points and example solutions that illustrate key concepts in machine learning. The exam covers a range of topics including gradient descent, overfitting prevention techniques, model parameter calculations, and probabilistic models.

Uploaded by

xg26t4bknn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views12 pages

Final Exam Solutions

Uploaded by

xg26t4bknn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

CS689: Machine Learning - Fall 2023

Final Exam Solutions

Dec 15, 2023 Name:

Instructions: Write your name on the exam sheet. No electronic devices may be used
during the exam. You may consult your paper notes and/or a print copy of texts during
the exam. Sharing of notes/texts during the exam is strictly prohibited. Show your work
for all derivation questions. A suitable explanation must be provided for all questions to
earn full credit. Attempt all problems. Partial credit may be given for partially incorrect
or incomplete answers. If your answer to a question spans two pages, please make sure
to indicate that at the end of the first page of the answer. If you have questions at any
time, please raise your hand.

Problem Topic Page Points Score

1 Optimization 1-2 10

2 Capacity Control 3-4 10

3 Model Architectures 5-6 10

4 Maximum Likelihood 7-8 10

5 Probabilistic Classification 9-10 10

6 Probabilsitic Regression 11-12 10

7 Experiment Design 13-14 10

8 Product of Marginals 15-16 10

9 Non-Linear Factor Analysis 17-20 10

10 Mixture Models 21-22 10

Total: 100
1. (10 points) Optimization: The following questions concern numerical optimization.

a. (5 pts) Explain the primary difference between using gradient descent and stochastic
gradient descent to minimize the risk of a supervised machine learning model. Provide
supporting equations.

Example Solution: In gradient descent, we P use all of the training data to compute the
value of the gradient of the risk R(θ, D) = N N
1
n=1 l(yn , fθ xn ) with P
respect to the model
parameters θ. This gives a gradient vector of the form ∇R(θ, D) = N N 1
n=1 ∇l(yn , fθ (xn )).

The main difference with stochastic gradient descent is that we use a subset (or batch) of
data cases of size B < N to estimate the gradient. Letting B be the subset of data cases
1
P
indices in a batch, we have ∇R(θ, D) ≈ B n∈B l(yn , fθ (xn ))

b. (5 pts) Give two (2) reasons why stochastic gradient descent is usually preferred to
gradient descent when learning supervised neural network models.

Example Solution: The first reason why stochastic gradient descent is usually preferred
is speed. Using B < N requires less computation per optimization iteration. This usually
results in model learning taking much less total time compared to using all of the data.
The second reason is memory usage. Using B < N requires much less memory during
backpropagation compared to using all of the data. This can make it possible to efficiently
learn models using GPUs even when all of the data won’t fit in GPU memory at once.

1
2. (10 points) Capacity Control: List four (4) different approaches to prevent
overfitting when learning neural network models and briefly describe each of them.
Example Solution:

1. Weight decay (or regularization): We can add a weight decay term λ∥w∥22 to the
optimization objective function. This can prevent the weights from becoming large
and therefore keep the learned function from becoming too complex.

2. Dropout: We can add dropout layers to the mode during training. This sets hidden
unit values to zero at random with a specified probability and has been shown
experimentally to reduce overfitting.

3. Early Stopping: We can use a validation set to determine how long to run learning
for. When the validation set error stops decreasing, we can stop learning. If the
weights are initialized to small values, this generally stop the weights from becoming
large and results in less complex functions.

4. Architectural Hyper-parameters: We can select architectural hyper-parameters like

layer widths and the number of layers using validation set error to control the
complexity of the model.

2
3. (10 points) Model Architectures: Suppose we are building a model for classifying
grayscale hand written digits of size 30 × 30. There are 10 classes. Use this information
to answer the following questions. Show your work and explain your answers. It is ac-
ceptable to leave final answers in the form of an arithmetic expression (e.g., 5 · 10 + 3, etc).

a. (5 pts) Suppose we use a two-hidden layer MLP where the first hidden layer has 100
hidden units and the second hidden layer has 50 hidden units. Assume all hidden units
have bias parameterts. How many total parameters does the model have?

Example Solution: The input layer will have size 30 × 30 = 900. The first hidden layer
has size 100. Each hidden unit in the first hidden layer will thus have 900 input weights,
plus one bias parameter. This yields as total of 901 × 100 parameters in the input to first
hidden layer. The second hidden layer has 50 hidden units, each with 100 inputs from the
first hidden layer, and a bias. This given 101 × 50 parameters. The output layer has one
output per class, each with 50 inputs and a bias. This yields 51 × 10 parameters. The
total number of parameters needed in the network is thus 901 × 100 + 101 × 50 + 51 × 10.

b. (5 pts) Suppose we use a CNN where the input is followed immediately by two
feature extraction sections consisting of two Conv2D → RelU → M axP ool blocks. If
the first block uses 10 output channels with a kernel size of 4 × 4 and the second block
uses 20 output channels with a kernels of size 4 × 4, how many parameters are in the
feature extraction section of the model? Assume that the Conv2D operation includes the
addition of a bias term for each output channel.

Example Solution: The number of parameters in a convolution kernel is the number of

input channels times the width of kernel times the height of the kernel. The input has one
channel since the images are grayscale, so the each first layer kernel has 4×4 = 16 weights
plus bias. There are 10 kernels in the first layer yielding a total of 10 × 17 parameters.
The second layer has an input from the first layer with 10 channels. The kernel size is
again 4 × 4 and there are 20 kernels so there are 20 × 10 × 17 parameters in the second
layer. The total number of parameters is thus 10 × 17 + 20 × 10 × 17 = 210 × 17.

3
4. (10 points) Maximum Likelihood Estimation: The univariate normal distribu-
tion has an alternative paraemterization in terms of a parameter τ > 0 as shown below.
Use this information to answer the following questions.
√
τ τ
N (x; µ, τ ) = √ exp(− (x − µ)2 )
2π 2
a. (5 pts) Write down the negative log likelihood function for this model assuming a
data set containing N instances D = {x1 , ..., xN }. Simplify to the extent possible.

Example Solution: For this model, the negative log likelihood is the average over a data
set of size N of the negative of the log of the probability density function. This gives:
N
1 X1 τ 2

nll(θ, D) = − log(τ ) − log(2π) − (xn − µ)
N n=1 2 2
b. (5 pts) Assume the value of the mean µ is known. Derive the maximum likelihood
estimate of τ . Show your work and explain the steps in your solution.

Example Solution: We first find the partial derivative of the nll with respect of τ

N N
∂ ∂ 1 X1 2
1 X 1 2
nll(θ, D) = − log(τ ) − log(2π) − τ (xn − µ) = − − (xn − µ)
∂τ ∂τ N n=1 2 2N n=1 τ

We now set this expression equal to zero and solve to identify the stationary points of the
unconstrained NLL:
N
1 X 1 2
− − (xn − µ) = 0
2N n=1 τ
N
1 1 X
= (xn − µ)2
τ N n=1
1
τ̂ = 1
P(
N n=1 xn − µ)2

Next, we need to check the the value found is indeed a minimizer of the unconstrained
NLL byP checking that the second derivative is positive. The second derivative of the NLL
is − 2N N
1 1 1
n=1 (−1) τ 2 = 2τ 2 . This term is clearly positive for any τ so the stationary point
is a minimizer.

We next need to check that the identified minimizer of the unconstrained NLL τ̂ satisfies
the constraint τ̂ > 0. We can see that (xn − µ)2 ≥ 0 for all n. We assume N > 0 so the
data set is non-empty. This means the denominator is an average of non-negative values
and the overall fraction must be non-negative. τ̂ is thus a minmimizer of the NLL subject
to the positivity constraint on τ .

4
5. (10 points) Probabilistic Classification: Consider the following attempt at
defining a multi-class probabilistic classification model. Let C be the number of classes.
Assume that x = [x1 , ..., xD ] ∈ RD is a length D row vector. Assume each wc =
[w1c ; ...; wDc ] ∈ RD is a length D column vector. Assume each bc ∈ R is a scalar value.
Let θ be the collection of model parameters. Answer the following questions.

C
![c=y]
Y xwc + bc
Pθ (Y = y|X = x) = PC (1)
c=1 c′ =1 xwc′ + bc′

a. (5 pts) Explain why this is not a valid probabilistic classification model.

Example Solution: For the model to be valid, the distribution P (Y = y|X = x)

must satisfy non-negativity and normalization for any value of x ∈ RD and any set of
parameters θ. For non-negativity to hold, we must have Pθ (Y = y|X = x) ≥ 0 for any
x, y and θ. However, in this model, it is possible for this quantity to be negative since
the inputs x and parameters θ are unrestricted real values. A specific example would be
C = 2, x = [1, ..., 1], w1 = [1, ...1], w2 = [−1, ... − 1], b1 = 1 and b2 = 0. This would give
Pθ (Y = 2|X = x) = −D.

b. (5 pts) What conditions could we impose on the data and model parameters to
make this a valid probabilistic classification model? Explain how these conditions fix the
problem with the model. Note: your answer can not involve changing the definition of
the conditional probability model Pθ (Y = y|X = x)

Example Solution: We can fix this problem by requiring that the data be non-negative
reals and the parameters be non-negative reals. This will ensure that the numerator
terms xwc + bc are all greater than or equal to 0. As a result, the denominator will also
be greater than or equal to 0 and non-negativity will be satisfied. Normalization will also
be satisfied so long as all of the xwc + bc values are not equal to 0.

5
6. (10 points) Probabilistic Regression: The von Mises distribution given below
provides a probability density over angles y measured in radians. The parameters are
θ = [κ, µ] where κ ∈ R>0 is the concentration parameter and µ ∈ R is the location
parameter. The function I() provides part of the normalization term for the model. Use
this distribution to answer the following questions.

1
pθ (Y = y) = exp(κ · cos(y − µ)) (2)
2πI(κ)

a. (5 pts) Write down a probabilistic regression model based on the von Mises distribu-
tion where both the concentration and location depend on x ∈ RD . Explain your choices.

Example Solution: Let κ(x) = exp(xw +b) and µ = xv +c. Here the model parameters
are θ = [w, v, b, c] ∈ R2D+2 . The parameter prediction function κ(x) ensures the predicted
κ values are strictly positive since exp(z) is strictly positive for all finite z. This gives us
the model:
1
pθ (Y = y|X = x) = exp(κ(x) · cos(y − µ(x)))
2πI(κ(x))

b. (5 pts) Write down the negative log likelihood for the model you define in part (a).
Simplify to the extent possible. Show your work.

Example Solution: The negative log likelihood for this model is the negative of the
average over a data set of size N of the log of pθ (Y = yn |X = xn ). We have:

N
1 X
nll(θ, D) = − pθ (Y = yn |X = xn )
N n=1
N
1 X
=− (− log(2πI(κ(xn ))) + κ(xn ) · cos(yn − µ(xn )))
N n=1
N
1 X
= (log(2πI(κ(xn ))) − κ(xn ) · cos(yn − µ(xn )))
N n=1

6
7. (10 points) Experiment Design: Suppose we are training a one hidden layer
neural network classifier. Answer the following questions.

a. (5 pts) Is it a valid experiment design to evaluate several learning rates for a given
model architecture and select the learning rate that yields the lowest training risk? Ex-
plain your answer.

Example Solution: Yes. Configuring the learning rate with the model architecture fixed
is configuring an optimizer hyper-parameter. We can select optimizer hyper-parameters
to minimize the training risk since we are only trying to find values that result in good
solutions to the optimization problem we are solving and the optimization problem we
are solving is to minimize the training set risk.

b. (5 pts) Is it a valid experiment design to evaluate several hidden layer sizes and
select the hidden layer size that yields the lowest training set risk? Explain your answer.

Example Solution: No. The hidden layer size is a model complexity hyper-parameter.
To select a value for it, we need to use a validation set performance metric to avoid
over-fitting.

7
8. (10 points) ProductQ of Marginals: Consider the product of Bernoulli marginals
model Pθ (X = x) = d=1 θdxd (1 − θd )(1−xd ) . Provide a PyTorch implementation of a
function compare(x1,x2,theta) that will return true when the joint probability of x1 is
greater than the joint probability of x2 according to the model parameters theta. Assume
x1, x2, and theta are PyTorch tensors of shape (1, D) and the compare function is called
with valid inputs only. Your function should be vectorized and work for any D for full
points. You can define additional functions to help structure your solution if needed.
Explain your answer.

Example Solution: Our code is shown below. First, the log joint function assumes that
x and theta have the same shape and computes the log of the joint distribution. This is
for numerical stability reasons so the model will not underflow when D becomes large.
Next, the function compare(x1,x2,theta) simply computes and compares the log of the
joint distribution for each of the two values of x.

def log_joint(x,theta):
return(torch.sum(torch.log(theta)*x + torch.log(1-theta)*(1-x)))

def compare(x1,x2,theta):
return log_joint(x1,theta) > log_joint(x2,theta)

8
9. (10 points) Nonlinear Factor Analysis: Consider the non-linear factor analysis
model shown below where x ∈ RD and z ∈ RK . Suppose we wanted to re-formulate the
model using binary latent variables instead of real-valued latent variables. Answer the
following questions.

pθ (X = x, Z = z) = p(Z = z)pθ (X = x|Z = z)

pθ (X = x|Z = z) = N (x; fw (z), Ψ)
p(Z = z) = N (z; 0, I)

a. (5 pts) Provide an updated model definition where the K latent variables are binary.
Explain your answer.

Example Solution: To make the latent variables binary, we need to change the distribu-
tion p(Z = z) into a distribution over K binary variables. Since the original distribution
is a standard multivariate normal, which is equivalent to a product of univariate stan-
dard normal marginals, we change the distribution p(Z = z) into a product of standard
Bernoulli marginals as shown below. The rest of the model remains the same.
K
Y
P (Z = z) = 0.5zk 0.5(1−zk ) = 0.5K
k=1

9
b. (5 pts) Is your model learnable using direct negative log marginal likelihood mini-
mization? Explain your answer and give supporting equations.

Example Solution: Yes. The joint distribution now has discrete random variables that
can be marginalized out of the model using summation so long as pθ (X = x|Z = z) is
computable and K is not too large. We have the following NLML function:
N 1 1
!
X X X
N LM L(θ, D) = log ··· 0.5K N (xn ; fw (z), Ψ)
n=1 z1 =0 zK =0

To optimize the NLML, the parameters w of the generator have no constraints. We just
need to use a parameter transformation on Ψ to ensure it is a positive diagonal matrix.

10
10. (10 points) Mixture Models: Consider a mixture model with product of
Bernoulli mixture components as shown below. Assume x = [x1 , ..., xD ] ∈ {0, 1}D . Us-
ing marginalization and conditioning, derive an equation for predicting x1 given x2:D =
[x2 , ..., xD ] as input in terms of the parameters of this model. Show your work and explain
the steps of your derivation.

Pθ (X = x, Z = z) = Pπ (Z = z)Pϕ (X = x|Z = z)
Y x
Pϕ (X = x|Z = z) = ϕdzd (1 − ϕdz )(1−xd )
d=1
Pπ (Z = z) = πz

Example Solution: To begin, we are looking for the distribution Pθ (X1 = x1 |X2:D =
x2:D ). We will predict 1 if Pθ (X1 = 1|X2:D = x2:D ) > 0.5 and 0 otherwise. We can
compute this probability distribution using conditioning and marginalization as follows.
First we apply the definition of conditional probability. Next, we expand the numerator
and denominator in terms of marginals over the joint probability distribution given by
the model. Finally, we write the required probability in terms of the model parameters.

Pθ (X1 = 1, X2:D = x2:D )

Pθ (X1 = 1|X2:D = x2:D ) =
P (X2:D = x2:D )
PK
z=1 Pθ (X1 = 1, X2:D = x2:D , Z = z)
= PK 1
Pθ (X1 = x′1 , X2:D = x2:D , Z = z ′ )
P
z ′ =1 x′1 =0

PK
Pϕ (X1 = 1, X2:D = x2:D ))Pπ (Z = z)
= PK P1 z=1
z ′ =1 x′1 =0 Pϕ (X1 = x′1 , X2:D = x2:D |Z = z ′ )Pπ (Z = z ′ )

PK
ϕxdzd (1 − ϕdz )(1−xd )
Q
z=1 πz ϕ1z d=2
= ′
′ x1 x
PK P1 ′ Q
z ′ =1 x′1 =0 πz ϕ1z (1 − ϕ1z )(1−x1 ) d′ =2 ϕd′dz′ (1 − ϕd′ z )(1−xd′ )

PK
ϕxdzd (1 − ϕdz )(1−xd )
Q
z=1 πz ϕ1z d=2
= PK xd ′ x′
ϕd′ z )(1−xd′ ) 1x′ =0 ϕ1z1 (1 − ϕ1z )(1−x1 )
Q P ′
′
z ′ =1 πz d′ =2 ϕd′ z (1 − 1

PK Q xd (1−xd )
z=1 π z ϕ 1z d=2 ϕdz (1 − ϕdz )
= PK Q xd ′
′ (1−xd′ )
z ′ =1 πz d′ =2 ϕd′ z (1 − ϕd′ z )

1.deep Learning Assignment1 Solutions 1
100% (3)
1.deep Learning Assignment1 Solutions 1
12 pages
ECS7020P Sample Paper Solutions
No ratings yet
ECS7020P Sample Paper Solutions
6 pages
Reznor Handbook
100% (1)
Reznor Handbook
72 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
G4-T3 Exponential Moving Average (EMA)
No ratings yet
G4-T3 Exponential Moving Average (EMA)
4 pages
Strategic Management of Mitsubishi
No ratings yet
Strategic Management of Mitsubishi
17 pages
Adobe CS2 EULA
No ratings yet
Adobe CS2 EULA
7 pages
2007 Bakery Whole Catalog
No ratings yet
2007 Bakery Whole Catalog
13 pages
Midterm 2008s Solution
No ratings yet
Midterm 2008s Solution
12 pages
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
No ratings yet
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
12 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
ML Midsem 2018 Solutions
No ratings yet
ML Midsem 2018 Solutions
7 pages
18 Home Savings vs. Dailo
No ratings yet
18 Home Savings vs. Dailo
11 pages
DMW Project Report by Saurabh Zingade
No ratings yet
DMW Project Report by Saurabh Zingade
16 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
No ratings yet
CS 229, Autumn 2017 Problem Set #4: EM, DL & RL
10 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
Solution PDF
No ratings yet
Solution PDF
20 pages
ML Finals16 PDF
No ratings yet
ML Finals16 PDF
12 pages
Strategic Competitice Analysis
No ratings yet
Strategic Competitice Analysis
30 pages
Final: CS 189 Spring 2016 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2016 Introduction To Machine Learning
12 pages
DSCI 303: Machine Learning For Data Science Fall 2020
No ratings yet
DSCI 303: Machine Learning For Data Science Fall 2020
5 pages
Midterm Solutions For Machine Learning
No ratings yet
Midterm Solutions For Machine Learning
13 pages
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
No ratings yet
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
8 pages
Midterm Solutions
No ratings yet
Midterm Solutions
14 pages
Technology Management
No ratings yet
Technology Management
2 pages
Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids
No ratings yet
Machine Learning 10-701 Final Exam May 5, 2015: Obvious Exceptions For Pacemakers and Hearing Aids
17 pages
Java ™ Cryptography Architecture (JCA) Reference Guide: For Java Platform Standard Edition 6
No ratings yet
Java ™ Cryptography Architecture (JCA) Reference Guide: For Java Platform Standard Edition 6
95 pages
Sample Exam Answers
No ratings yet
Sample Exam Answers
6 pages
Mini Monitor Module Installation Guide: Troubleshooting
No ratings yet
Mini Monitor Module Installation Guide: Troubleshooting
2 pages
Final 2006
No ratings yet
Final 2006
15 pages
Midterm Sol
No ratings yet
Midterm Sol
23 pages
Georgetown Thesis Database
100% (3)
Georgetown Thesis Database
4 pages
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
SMAI End 2015 S
No ratings yet
SMAI End 2015 S
4 pages
IFRS 15 Summary PDF
No ratings yet
IFRS 15 Summary PDF
8 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
56 pages
Specifications Alphasorb Barrier Fabric Wrapped Acoustic Panels
No ratings yet
Specifications Alphasorb Barrier Fabric Wrapped Acoustic Panels
3 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
ML Midsem 2022
No ratings yet
ML Midsem 2022
8 pages
Midpaper
No ratings yet
Midpaper
16 pages
12s 701 Final
No ratings yet
12s 701 Final
17 pages
DNN Cluster S2 22 MidSem Makeup
No ratings yet
DNN Cluster S2 22 MidSem Makeup
7 pages
Final 2014 Wwithanswers
No ratings yet
Final 2014 Wwithanswers
8 pages
DNN Cluster S2 22 MidSem Regular
No ratings yet
DNN Cluster S2 22 MidSem Regular
6 pages
Final PPT CAMPUS
No ratings yet
Final PPT CAMPUS
20 pages
Repeat 2014 Wwith Answers
No ratings yet
Repeat 2014 Wwith Answers
9 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Historiopreneurship Related Paper 3
No ratings yet
Historiopreneurship Related Paper 3
13 pages
ML Assignment
No ratings yet
ML Assignment
17 pages
Tap Magic Eco Oil Sds en Us 2023pdf
No ratings yet
Tap Magic Eco Oil Sds en Us 2023pdf
8 pages
Mock Test
No ratings yet
Mock Test
6 pages
Agilent Technologies E4350B User Manual
No ratings yet
Agilent Technologies E4350B User Manual
129 pages
Applied Auditing
No ratings yet
Applied Auditing
2 pages
How To Add or Remove An Employee
No ratings yet
How To Add or Remove An Employee
4 pages
Final2019 Solutions
No ratings yet
Final2019 Solutions
23 pages
Final f02
No ratings yet
Final f02
12 pages
hw5 1
No ratings yet
hw5 1
6 pages
IPR Assignment
No ratings yet
IPR Assignment
5 pages
Final F02soln
No ratings yet
Final F02soln
11 pages
Winter 21 Exam 1
No ratings yet
Winter 21 Exam 1
17 pages
Mandeville-The Grumbling Hive
No ratings yet
Mandeville-The Grumbling Hive
5 pages
Homework 2
No ratings yet
Homework 2
3 pages
ML June 2024
No ratings yet
ML June 2024
12 pages
HW 3
No ratings yet
HW 3
7 pages
Final F01soln
No ratings yet
Final F01soln
13 pages
2022 Exam2 Solution
No ratings yet
2022 Exam2 Solution
10 pages
2021 Exam2 Solution
No ratings yet
2021 Exam2 Solution
11 pages
Applied Modelling and Visualisation
No ratings yet
Applied Modelling and Visualisation
12 pages
IBM322 Last Year ETE
No ratings yet
IBM322 Last Year ETE
5 pages
DL Quiz1
No ratings yet
DL Quiz1
5 pages
INF8953CE Final Exam Questions 2020
No ratings yet
INF8953CE Final Exam Questions 2020
5 pages
PDS AlphaTec MANUAL PRESSURE TEST KIT 2003
No ratings yet
PDS AlphaTec MANUAL PRESSURE TEST KIT 2003
1 page
Ai ML Exam - 1march 16 2022-Michael Magreola
No ratings yet
Ai ML Exam - 1march 16 2022-Michael Magreola
8 pages
CSIR CLRI Junior Secretariat Assistant Paper II 2018 English
No ratings yet
CSIR CLRI Junior Secretariat Assistant Paper II 2018 English
24 pages
CSCI 5521 Spring 2025 Final Exam
No ratings yet
CSCI 5521 Spring 2025 Final Exam
8 pages
Performance and Security For The Internet of Things: Emerging Wireless Technologies Haya Shajaiah - Quickly Access The Ebook and Start Reading Today
100% (1)
Performance and Security For The Internet of Things: Emerging Wireless Technologies Haya Shajaiah - Quickly Access The Ebook and Start Reading Today
49 pages
Barangay Situational Analysis 2025
No ratings yet
Barangay Situational Analysis 2025
3 pages
ML End Sem Nov2024 Paper
No ratings yet
ML End Sem Nov2024 Paper
4 pages
Machine Learning Questions Final - Solutions
No ratings yet
Machine Learning Questions Final - Solutions
5 pages
11 Best Step - How To Plant An Avocado Seed in Soil - October 2024
No ratings yet
11 Best Step - How To Plant An Avocado Seed in Soil - October 2024
31 pages
2011 End Spring 2011 Computer Science Machine Learning
No ratings yet
2011 End Spring 2011 Computer Science Machine Learning
10 pages
Research Proposal
No ratings yet
Research Proposal
5 pages
Deep Learning
No ratings yet
Deep Learning
9 pages

Final Exam Solutions

Uploaded by

Final Exam Solutions

Uploaded by

CS689: Machine Learning - Fall 2023

Final Exam Solutions

Dec 15, 2023 Name:

Problem Topic Page Points Score

2 Capacity Control 3-4 10

3 Model Architectures 5-6 10

4 Maximum Likelihood 7-8 10

5 Probabilistic Classification 9-10 10

6 Probabilsitic Regression 11-12 10

7 Experiment Design 13-14 10

8 Product of Marginals 15-16 10

9 Non-Linear Factor Analysis 17-20 10

10 Mixture Models 21-22 10

4. Architectural Hyper-parameters: We can select architectural hyper-parameters like

Example Solution: The number of parameters in a convolution kernel is the number of

a. (5 pts) Explain why this is not a valid probabilistic classification model.

Example Solution: For the model to be valid, the distribution P (Y = y|X = x)

pθ (X = x, Z = z) = p(Z = z)pθ (X = x|Z = z)

Pθ (X1 = 1, X2:D = x2:D )

You might also like