Ass8 Solns
Ass8 Solns
Policies:
For all multiple-choice questions, note that multiple correct answers may exist. However, selecting
an incorrect option will cancel out a correct one. For example, if you select two answers, one
correct and one incorrect, you will receive zero points for that question. Similarly, if the number
of incorrect answers selected exceeds the correct ones, your score for that question will be zero.
Please note that it is not possible to receive negative marks. You must select all the correct
options to get full marks for the question.
While the syllabus initially indicated the need to submit a paragraph explaining the use of AI or
other resources in your assignments, this requirement no longer applies as we are now utilizing
eClass quizzes instead of handwritten submissions. Therefore, you are not required to submit any
explanation regarding the tools or resources (such as online tools or AI) used in completing this
quiz.
This PDF version of the questions has been provided for your convenience should you wish to print
them and work offline.
Only answers submitted through the eClass quiz system will be graded. Please do not
submit a written copy of your responses.
Question 1. [1 mark]
Let fˆERM be as defined in section 9.1.1 of the course notes. Which of the following is true?
b. It is always the case that fBayes outputs class 1 if fˆERM (x) ≥ 0.5, and class 0 otherwise.
d. fˆERM has the same closed-form solution as the ERM predictor for linear regression with the
squared loss.
Solution:
Answer: a.
a. True. The ERM predictor fˆERM (x) in logistic regression estimates the probability that a
data point x belongs to class 1. This is because the logistic function σ(x> w) outputs values
in the range [0, 1], which can be interpreted as probabilities.
b. False. In the binary classification setting fBayes (x) = arg maxy∈Y p(y|x). This is equivalent
to saying fBayes (x) = 1 if p(y = 1|x) ≥ 0.5 and fBayes (x) = 0 otherwise. Since fˆERM (x) is
only an estimate of p(y = 1|x), it is not always the case that fˆERM (x) = p(y = 1|x).
c. False. fBayes outputs class labels, while fˆERM outputs probabilities. The two are not the
same.
1/10
Fall 2024 CMPUT 267: Basics of Machine Learning
d. False. The ERM predictor for logistic regression does not have a closed-form solution and
typically requires iterative optimization algorithms such as gradient descent. On the other
hand, linear regression with squared loss has a closed-form solution.
Question 2. [1 mark]
In class we used the following function class for logistic regression:
n o
F = f | f : Rd+1 → [0, 1], where f (x) = σ(x> w), and w ∈ Rd+1 .
Suppose that we would like to use a larger function class that contains polynomial features of the
input x. Recall that φp (x) is the degree p polynomial feature map of x. We define the new function
class as follows
n o
Fp = f | f : Rd+1 → [0, 1], where f (x) = σ(φp (x)> w), and w ∈ Rp̄ ,
where p̄ = d+p
p is the number of features in the polynomial feature map φp (x). Is the following
statement true or false? For all p ∈ {2, . . . } it holds that F ⊂ Fp ⊂ Fp+1 .
Solution:
Answer: True.
Explanation: F is equivalent to F1 since it uses linear features. The function class Fp is a subset
of Fp+1 because polynomial features of degree p are included within those of degree p+1. Therefore,
for each p, it follows that F ⊂ Fp ⊂ Fp+1 .
Question 3. [1 mark]
Let everything be as defined in the previous question. Let fˆERM,p (x) = σ(φp (x)> wERM,p ) be the
ERM predictor for the function class Fp , where wERM,p is the minimizer of the estimated loss
(with the binary cross-entropy loss function). Is the following statement true or false? The binary
predictor fˆBin that outputs class 1 if fˆERM,p (x) ≥ 0.5 and class 0 otherwise can be equivalently
defined as (
1 if φp (x)> wERM,p ≥ 0,
fˆBin (x) =
0 otherwise.
Solution:
Answer: True.
Explanation: The sigmoid function σ(z) satisfies σ(z) ≥ 0.5 if and only if z ≥ 0. As a result,
fˆERM,p (x) ≥ 0.5 is equivalent to φp (x)> wERM,p ≥ 0. Thus, the two definitions of fˆBin are equivalent.
Question 4. [1 mark]
In class we worked out the MLE solution for binary classification. In this question we are going to
work out the MAP solution. Suppose that the setting is the same as in the MLE setting defined in
section 9.1 of the course notes. However, we will also assume that the weights w1∗ , . . . , wd∗ are i.i.d.
Gaussian random variables with mean 0 and variance 1/λ. The bias term w0∗ is also independent
of the other weights and has a uniform distribution on [−a, a], for a very large a. Which of the
following is equal to wMAP = arg maxw∈Rd+1 p(w | D)?
2/10
Fall 2024 CMPUT 267: Basics of Machine Learning
a.
n
X λ Xd
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) + wj2
w∈Rd+1 2
i=1 j=1
b.
n d
λ X
yi log σ(x>
>
X
arg min − i w) + (1 − yi ) log 1 − σ(xi w)
wj2
w∈Rd+1 2
i=1 j=1
c.
Xn 2 λ Xd
arg min yi − σ(x>
i w) + wj2
w∈Rd+1 2
i=1 j=1
d.
n
X d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ wj log(wj )
w∈Rd+1
i=1 j=1
Solution:
Answer: a.
To find the MAP estimate wMAP , we maximize the posterior probability p(w | D), which is pro-
portional to the product of the likelihood and the prior:
Prior:
The prior for wj (for j = 1, . . . , d) is:
√
λ λ 2
p(wj ) = √ exp − wj .
2π 2
Since w0 has a uniform prior over a very large interval, it contributes a constant to the posterior
and can be ignored in optimization.
The log-prior is:
3/10
Fall 2024 CMPUT 267: Basics of Machine Learning
d
λX 2
log p(w) = − wj + const.
2
j=1
Posterior:
Combining the log-likelihood and log-prior, the log-posterior (up to a constant) is:
MAP Estimate:
To find wMAP , we minimize the negative log-posterior:
Question 5. [1 mark]
In this question, we are going to work out the MAP solution for binary classification using a Laplace
prior. Suppose that the setting is the same as in the MLE setting defined in section 9.1 of the course
notes. However, we will also assume that the weights w1∗ , . . . , wd∗ are i.i.d. Laplace random variables
with mean 0 and scale parameter 1/λ. The bias term w0∗ is also independent of the other weights
and has a uniform distribution on [−a, a] for a very large a.
Which of the following is equal to wMAP = arg maxw∈Rd+1 p(w | D)?
a.
n
X d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ |wj |
w∈Rd+1
i=1 j=1
b.
n
X d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ |wj |
w∈Rd+1
i=1 j=0
c.
Xn 2 d
X
arg min yi − σ(x>
i w) +λ |wj |
w∈Rd+1
i=1 j=1
d.
n
X d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ log(|wj |)
w∈Rd+1
i=1 j=1
4/10
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
Answer: a.
To find the MAP estimate wMAP , we need to maximize the posterior probability p(w | D), which
is proportional to the product of the likelihood and the prior:
Prior:
The prior for each wj (for j = 1, . . . , d) is a Laplace distribution:
λ
p(wj ) =exp (−λ|wj |) .
2
Since w0 has a uniform prior over a very large interval, its contribution to the posterior is constant
and can be ignored during optimization.
The log-prior is therefore:
d
X
log p(w) = −λ |wj | + const.
j=1
Posterior:
Combining the log-likelihood and log-prior, the log-posterior (up to a constant) is:
MAP Estimate:
To find wMAP , we minimize the negative log-posterior:
5/10
Fall 2024 CMPUT 267: Basics of Machine Learning
Question 6. [1 mark]
Suppose that things are as defined in the previous question. However, we assume the weights
w0∗ , w1∗ , . . . , wd∗ are all i.i.d. uniform random variables on [−a, a] for a very large a. Is the following
statement true or false? The MAP solution with this prior is equivalent to the MLE solution.
Solution:
Answer: True.
Explanation: Given that each weight wj (for j = 0, 1, . . . , d) is uniformly distributed over [−a, a],
the prior probability is:
d+1
1
p(w) = p(w0 ) · p(w1 ) · · · p(wd ) =
2a
This prior is constant with respect to w.
The MAP estimate maximizes the posterior probability:
Since p(w) is constant, maximizing the posterior is equivalent to maximizing the likelihood:
Question 7. [1 mark]
Let fˆMul be as defined in section 9.2 of the course notes. Which of the following is true?
a. fˆMul outputs a vector of probabilities where the y-th element is the probability of class y.
b. σ(x> wMLE,0 , . . . , x> wMLE,K−1 ) outputs a vector of probabilities where the y-th element is
the probability of class y.
c. fˆERM as defined in section 9.2.1 of the course notes is approximately equal to fBayes .
Solution:
Answer: b., d.
a. False. fˆMul (x) ∈ Y, and Y is not a set of vectors.
c. False. fˆERM is a vector of probability estimates, while fBayes outputs class labels.
Question 8. [1 mark]
Let everything be as defined in the previous question. Which of the following is true?
6/10
Fall 2024 CMPUT 267: Basics of Machine Learning
a. If x> wMLE,y < x> wMLE,k for all k 6= y, then fˆMul (x) = y.
b. If x> wMLE,y > x> wMLE,k for all k 6= y, then fˆMul (x) = y.
Solution:
Answer: b., c., d.
b. True. fˆMul (x) is the y with the largest value of σy (x> wMLE,0 , . . . , x> wMLE,K−1 ), which is
the same as the y with the largest value of x> wMLE,y .
d. True. If x> (wMLE,y −wMLE,k ) = 0 then x> wMLE,y = x> wMLE,k . If you plug this into the def-
inition of p(y | x, wMLE,0 , . . . , wMLE,K−1 ) you will see that it is equal to p(k | x, wMLE,0 , . . . , wMLE,K−1 ).
Question 9. [1 mark]
Let Y = {0, 1} be the set of labels. Define the following two label functions:
(
(1, 0)> if y = 0,
h(y) =
(0, 1)> if y = 1.
Solution:
Answer: True.
Explanation: To verify the equivalence of the binary cross-entropy loss and the multiclass cross-
entropy loss under the given label mappings h(y) and r(ŷ), we analyze both loss functions in the
context of the provided definitions.
The binary cross-entropy loss for binary classification with labels y ∈ {0, 1} and predictions ŷ ∈
(0, 1) is defined as:
`binary (ŷ, y) = − [y log(ŷ) + (1 − y) log(1 − ŷ)] .
7/10
Fall 2024 CMPUT 267: Basics of Machine Learning
On the other hand, the multiclass cross-entropy loss for multiclass classification with one-hot en-
coded labels h(y) ∈ {(1, 0)> , (0, 1)> } and predictions r(ŷ) ∈ (0, 1)2 is defined as:
2
X
`multiclass (r(ŷ), h(y)) = − hj (y) log(rj (ŷ)),
j=1
where hj (y) and rj (ŷ) denote the j-th components of the vectors h(y) and r(ŷ), respectively.
Given the label mapping: (
(1, 0)> if y = 0,
h(y) =
(0, 1)> if y = 1,
and the prediction mapping:
r(ŷ) = (1 − ŷ, ŷ)> ,
we can simplify the multiclass cross-entropy loss based on the value of y:
Case 1: y = 0
When y = 0, the label mapping yields h(y) = (1, 0)> . Substituting into the multiclass loss function:
`multiclass (r(ŷ), h(0)) = − [h1 (0) log(r1 (ŷ)) + h2 (0) log(r2 (ŷ))] = − log(r1 (ŷ)).
Case 2: y = 1
When y = 1, the label mapping yields h(y) = (0, 1)> . Substituting into the multiclass loss function:
`multiclass (r(ŷ), h(1)) = − [h1 (1) log(r1 (ŷ)) + h2 (1) log(r2 (ŷ))] = − log(r2 (ŷ)).
In both cases, the multiclass cross-entropy loss `multiclass (r(ŷ), h(y)) simplifies to the binary cross-
entropy loss `binary (ŷ, y). This demonstrates that under the specified label and prediction mappings,
the two loss functions yield identical values for all y ∈ Y and ŷ ∈ (0, 1).
Solution:
Answer: False. As shown in section 9.2.2 of the course notes wMLE = wMLE,1 − wMLE,0 .
8/10
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
Answer: 141
Explanation: Each activation (other than the biases and the zeroth layer) has a weight vector
associated with it. Thus there are
50 + 40 + 30 + 20 + 1 = 141
Solution:
Answer: 8961
Explanation:
First layer: 50 · 101 = 5050 weights (50 neurons with 100 input features and 1 bias term).
Second layer: 40 · 51 = 2040 weights (40 neurons with 50 input features and 1 bias term).
Third layer: 30 · 41 = 1230 weights (30 neurons with 40 input features and 1 bias term).
Fourth layer: 20 · 31 = 620 weights (20 neurons with 30 input features and 1 bias term).
Fifth layer: 1 · 21 = 21 weights (1 neuron with 20 input features and 1 bias term).
Total number of weights: 5050 + 2040 + 1230 + 620 + 21 = 8961.
Solution:
Answer: True.
Explanation: The tanh function is a valid activation function because it is differentiable every-
where and is a function from R to (−1, 1) ⊂ R.
b. The neural network has B = 1 layer, d(1) = 1 neuron, activation h = σ, and one weight vector
(1)
w1 = w.
c. The neural network has B = 1 layer, d(1) = 1 neuron, activation h(z) = z, and one weight
(1)
vector w1 = w.
d. The neural network has B = 2 layers, d(1) = 1, d(2) = 1 neurons, activation h = σ in the first
(1) (2)
layer, activation h(z) = z in the second layer, and two weight vectors w1 = w, w1 = (0, 1)> .
9/10
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
Answer: b., d.
a. False. The neural network would output f (x) = σ((1, σ(x> w))> (1, 1)> ) = σ(1 + σ(x> w)) 6=
σ(x> w).
d. True. The neural network would output f (x) = (1, σ(x> w))> (0, 1)> = σ(x> w).
Suppose you get a feature vector x = (1, −1, 1)> . What is f (x)?
Solution:
Answer: 2
Explanation: The activations for the first layer are:
(1) (1)
a1 = ReLU(x> w1 ) = ReLU(1 · 1 + (−1) · 1 + 1 · 1) = ReLU(1) = 1
(1) (1)
a2 = ReLU(x> w2 ) = ReLU(1 · (−1) + (−1) · (−1) + 1 · (−1)) = ReLU(−1) = 0
(1) (1)
a3 = ReLU(x> w3 ) = ReLU(1 · (−1) + (−1) · 0 + 1 · 1) = ReLU(0) = 0
Thus, the activation vector for the first layer is a(1) = (1, 1, 0, 0)> .
The activations for the second layer are:
>
(2) (1) (2)
a1 = ReLU a w1 = ReLU(1 · 1 + 1 · 1 + 0 · 1 + 0 · 1) = ReLU(2) = 2
(2)
Thus, f (x) = a1 = 2.
10/10