0% found this document useful (0 votes)
8 views10 pages

Ass8 Solns

Uploaded by

ceadamtan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Ass8 Solns

Uploaded by

ceadamtan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Homework Assignment 8

Due: Friday, December 6, 2024, 11:59 p.m. Mountain time


Total marks: 15

Policies:
For all multiple-choice questions, note that multiple correct answers may exist. However, selecting
an incorrect option will cancel out a correct one. For example, if you select two answers, one
correct and one incorrect, you will receive zero points for that question. Similarly, if the number
of incorrect answers selected exceeds the correct ones, your score for that question will be zero.
Please note that it is not possible to receive negative marks. You must select all the correct
options to get full marks for the question.
While the syllabus initially indicated the need to submit a paragraph explaining the use of AI or
other resources in your assignments, this requirement no longer applies as we are now utilizing
eClass quizzes instead of handwritten submissions. Therefore, you are not required to submit any
explanation regarding the tools or resources (such as online tools or AI) used in completing this
quiz.
This PDF version of the questions has been provided for your convenience should you wish to print
them and work offline.
Only answers submitted through the eClass quiz system will be graded. Please do not
submit a written copy of your responses.

Question 1. [1 mark]
Let fˆERM be as defined in section 9.1.1 of the course notes. Which of the following is true?

a. We can think of fˆERM (x) as predicting the probability of x belonging to class 1.

b. It is always the case that fBayes outputs class 1 if fˆERM (x) ≥ 0.5, and class 0 otherwise.

c. fBayes is equal to fˆERM .

d. fˆERM has the same closed-form solution as the ERM predictor for linear regression with the
squared loss.

Solution:
Answer: a.

a. True. The ERM predictor fˆERM (x) in logistic regression estimates the probability that a
data point x belongs to class 1. This is because the logistic function σ(x> w) outputs values
in the range [0, 1], which can be interpreted as probabilities.

b. False. In the binary classification setting fBayes (x) = arg maxy∈Y p(y|x). This is equivalent
to saying fBayes (x) = 1 if p(y = 1|x) ≥ 0.5 and fBayes (x) = 0 otherwise. Since fˆERM (x) is
only an estimate of p(y = 1|x), it is not always the case that fˆERM (x) = p(y = 1|x).

c. False. fBayes outputs class labels, while fˆERM outputs probabilities. The two are not the
same.

1/10
Fall 2024 CMPUT 267: Basics of Machine Learning

d. False. The ERM predictor for logistic regression does not have a closed-form solution and
typically requires iterative optimization algorithms such as gradient descent. On the other
hand, linear regression with squared loss has a closed-form solution.

Question 2. [1 mark]
In class we used the following function class for logistic regression:
n o
F = f | f : Rd+1 → [0, 1], where f (x) = σ(x> w), and w ∈ Rd+1 .

Suppose that we would like to use a larger function class that contains polynomial features of the
input x. Recall that φp (x) is the degree p polynomial feature map of x. We define the new function
class as follows
n o
Fp = f | f : Rd+1 → [0, 1], where f (x) = σ(φp (x)> w), and w ∈ Rp̄ ,

where p̄ = d+p

p is the number of features in the polynomial feature map φp (x). Is the following
statement true or false? For all p ∈ {2, . . . } it holds that F ⊂ Fp ⊂ Fp+1 .

Solution:
Answer: True.
Explanation: F is equivalent to F1 since it uses linear features. The function class Fp is a subset
of Fp+1 because polynomial features of degree p are included within those of degree p+1. Therefore,
for each p, it follows that F ⊂ Fp ⊂ Fp+1 .

Question 3. [1 mark]
Let everything be as defined in the previous question. Let fˆERM,p (x) = σ(φp (x)> wERM,p ) be the
ERM predictor for the function class Fp , where wERM,p is the minimizer of the estimated loss
(with the binary cross-entropy loss function). Is the following statement true or false? The binary
predictor fˆBin that outputs class 1 if fˆERM,p (x) ≥ 0.5 and class 0 otherwise can be equivalently
defined as (
1 if φp (x)> wERM,p ≥ 0,
fˆBin (x) =
0 otherwise.

Solution:
Answer: True.
Explanation: The sigmoid function σ(z) satisfies σ(z) ≥ 0.5 if and only if z ≥ 0. As a result,
fˆERM,p (x) ≥ 0.5 is equivalent to φp (x)> wERM,p ≥ 0. Thus, the two definitions of fˆBin are equivalent.

Question 4. [1 mark]
In class we worked out the MLE solution for binary classification. In this question we are going to
work out the MAP solution. Suppose that the setting is the same as in the MLE setting defined in
section 9.1 of the course notes. However, we will also assume that the weights w1∗ , . . . , wd∗ are i.i.d.
Gaussian random variables with mean 0 and variance 1/λ. The bias term w0∗ is also independent
of the other weights and has a uniform distribution on [−a, a], for a very large a. Which of the
following is equal to wMAP = arg maxw∈Rd+1 p(w | D)?

2/10
Fall 2024 CMPUT 267: Basics of Machine Learning

a.  
n 
X     λ Xd
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) + wj2 
w∈Rd+1 2
i=1 j=1

b.
   
n d
λ X  
 yi log σ(x>
 
>
 X
arg min − i w) + (1 − yi ) log 1 − σ(xi w)
 wj2 
w∈Rd+1 2
i=1 j=1

c.  
Xn  2 λ Xd
arg min  yi − σ(x>
i w) + wj2 
w∈Rd+1 2
i=1 j=1

d.
 
n 
X     d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ wj log(wj )
w∈Rd+1
i=1 j=1

Solution:
Answer: a.
To find the MAP estimate wMAP , we maximize the posterior probability p(w | D), which is pro-
portional to the product of the likelihood and the prior:

p(w | D) ∝ p(D | w) p(w).


Likelihood:
For binary classification using logistic regression, the likelihood is:
n 
Y  1−yi 
> yi >
p(D | w) = σ(xi w) 1 − σ(xi w) ,
i=1
1
where σ(z) = is the sigmoid function.
1+e−z
Taking the logarithm, we get the log-likelihood:
n h
X    i
log p(D | w) = yi log σ(x>
i w) + (1 − y i ) log 1 − σ(x >
i w) .
i=1

Prior:
The prior for wj (for j = 1, . . . , d) is:
√  
λ λ 2
p(wj ) = √ exp − wj .
2π 2
Since w0 has a uniform prior over a very large interval, it contributes a constant to the posterior
and can be ignored in optimization.
The log-prior is:

3/10
Fall 2024 CMPUT 267: Basics of Machine Learning

d
λX 2
log p(w) = − wj + const.
2
j=1

Posterior:
Combining the log-likelihood and log-prior, the log-posterior (up to a constant) is:

log p(w | D) = log p(D | w) + log p(w)


Xn h    i λ Xd
= yi log σ(x>i w) + (1 − y i ) log 1 − σ(x >
i w) − wj2 + const.
2
i=1 j=1

MAP Estimate:
To find wMAP , we minimize the negative log-posterior:

wMAP = arg min [− log p(w | D)]


w∈Rd+1
 
Xn      λ Xd
= arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) + wj2  .
w∈Rd+1 2
i=1 j=1

Question 5. [1 mark]
In this question, we are going to work out the MAP solution for binary classification using a Laplace
prior. Suppose that the setting is the same as in the MLE setting defined in section 9.1 of the course
notes. However, we will also assume that the weights w1∗ , . . . , wd∗ are i.i.d. Laplace random variables
with mean 0 and scale parameter 1/λ. The bias term w0∗ is also independent of the other weights
and has a uniform distribution on [−a, a] for a very large a.
Which of the following is equal to wMAP = arg maxw∈Rd+1 p(w | D)?

a.  
n 
X     d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ |wj |
w∈Rd+1
i=1 j=1

b.  
n 
X     d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ |wj |
w∈Rd+1
i=1 j=0

c.  
Xn  2 d
X
arg min  yi − σ(x>
i w) +λ |wj |
w∈Rd+1
i=1 j=1

d.
 
n 
X     d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ log(|wj |)
w∈Rd+1
i=1 j=1

4/10
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
Answer: a.
To find the MAP estimate wMAP , we need to maximize the posterior probability p(w | D), which
is proportional to the product of the likelihood and the prior:

p(w | D) ∝ p(D | w) p(w).


Likelihood:
For binary classification using logistic regression, the likelihood is:
n 
Y  1−yi 
> yi >
p(D | w) = σ(xi w) 1 − σ(xi w) ,
i=1

where σ(z) = 1+e1−z is the sigmoid function.


Taking the logarithm, we obtain the log-likelihood:
n h
X    i
log p(D | w) = yi log σ(x>
i w) + (1 − y i ) log 1 − σ(x >
i w) .
i=1

Prior:
The prior for each wj (for j = 1, . . . , d) is a Laplace distribution:

λ
p(wj ) =exp (−λ|wj |) .
2
Since w0 has a uniform prior over a very large interval, its contribution to the posterior is constant
and can be ignored during optimization.
The log-prior is therefore:

d
X
log p(w) = −λ |wj | + const.
j=1

Posterior:
Combining the log-likelihood and log-prior, the log-posterior (up to a constant) is:

log p(w | D) = log p(D | w) + log p(w)


Xn h    i d
X
= yi log σ(x>i w) + (1 − y i ) log 1 − σ(x >
i w) − λ |wj | + const.
i=1 j=1

MAP Estimate:
To find wMAP , we minimize the negative log-posterior:

wMAP = arg min [− log p(w | D)]


w∈Rd+1
 
Xn      d
X
= arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ |wj | .
w∈Rd+1
i=1 j=1

5/10
Fall 2024 CMPUT 267: Basics of Machine Learning

Question 6. [1 mark]
Suppose that things are as defined in the previous question. However, we assume the weights
w0∗ , w1∗ , . . . , wd∗ are all i.i.d. uniform random variables on [−a, a] for a very large a. Is the following
statement true or false? The MAP solution with this prior is equivalent to the MLE solution.

Solution:
Answer: True.
Explanation: Given that each weight wj (for j = 0, 1, . . . , d) is uniformly distributed over [−a, a],
the prior probability is:
 d+1
1
p(w) = p(w0 ) · p(w1 ) · · · p(wd ) =
2a
This prior is constant with respect to w.
The MAP estimate maximizes the posterior probability:

wMAP = arg max p(w | D) = arg max p(D | w) p(w)


w w

Since p(w) is constant, maximizing the posterior is equivalent to maximizing the likelihood:

wMAP = arg max p(D | w) = wMLE


w
Therefore, the MAP estimate coincides with the MLE estimate when using a uniform prior over a
very large interval.

Question 7. [1 mark]
Let fˆMul be as defined in section 9.2 of the course notes. Which of the following is true?
a. fˆMul outputs a vector of probabilities where the y-th element is the probability of class y.

b. σ(x> wMLE,0 , . . . , x> wMLE,K−1 ) outputs a vector of probabilities where the y-th element is
the probability of class y.

c. fˆERM as defined in section 9.2.1 of the course notes is approximately equal to fBayes .

d. There is no closed-form solution for wMLE,k for any k.

Solution:
Answer: b., d.
a. False. fˆMul (x) ∈ Y, and Y is not a set of vectors.

b. True. By the definition of the MLE solution.

c. False. fˆERM is a vector of probability estimates, while fBayes outputs class labels.

d. True. Mentioned in the course notes.

Question 8. [1 mark]
Let everything be as defined in the previous question. Which of the following is true?

6/10
Fall 2024 CMPUT 267: Basics of Machine Learning

a. If x> wMLE,y < x> wMLE,k for all k 6= y, then fˆMul (x) = y.

b. If x> wMLE,y > x> wMLE,k for all k 6= y, then fˆMul (x) = y.

c. If σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > 0.5, then fˆMul (x) = y.

d. If x> (wMLE,y − wMLE,k ) = 0, then

p(y | x, wMLE,0 , . . . , wMLE,K−1 ) = p(k | x, wMLE,0 , . . . , wMLE,K−1 ).

Solution:
Answer: b., c., d.

a. False. See explanation for b.

b. True. fˆMul (x) is the y with the largest value of σy (x> wMLE,0 , . . . , x> wMLE,K−1 ), which is
the same as the y with the largest value of x> wMLE,y .

c. True. If σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > 0.5, then

σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > σk (x> wMLE,0 , . . . , x> wMLE,K−1 )

for all k 6= y, so fˆMul (x) = y.

d. True. If x> (wMLE,y −wMLE,k ) = 0 then x> wMLE,y = x> wMLE,k . If you plug this into the def-
inition of p(y | x, wMLE,0 , . . . , wMLE,K−1 ) you will see that it is equal to p(k | x, wMLE,0 , . . . , wMLE,K−1 ).

Question 9. [1 mark]
Let Y = {0, 1} be the set of labels. Define the following two label functions:
(
(1, 0)> if y = 0,
h(y) =
(0, 1)> if y = 1.

r(ŷ) = (1 − ŷ, ŷ)> .


Is the following statement true or false? For any ŷ ∈ (0, 1) and y ∈ Y (where (0, 1) is the open
interval from 0 to 1), the binary cross-entropy loss with input ŷ and y is equal to the multiclass
cross-entropy loss with input r(ŷ) and h(y).

Solution:
Answer: True.
Explanation: To verify the equivalence of the binary cross-entropy loss and the multiclass cross-
entropy loss under the given label mappings h(y) and r(ŷ), we analyze both loss functions in the
context of the provided definitions.
The binary cross-entropy loss for binary classification with labels y ∈ {0, 1} and predictions ŷ ∈
(0, 1) is defined as:
`binary (ŷ, y) = − [y log(ŷ) + (1 − y) log(1 − ŷ)] .

7/10
Fall 2024 CMPUT 267: Basics of Machine Learning

On the other hand, the multiclass cross-entropy loss for multiclass classification with one-hot en-
coded labels h(y) ∈ {(1, 0)> , (0, 1)> } and predictions r(ŷ) ∈ (0, 1)2 is defined as:
2
X
`multiclass (r(ŷ), h(y)) = − hj (y) log(rj (ŷ)),
j=1

where hj (y) and rj (ŷ) denote the j-th components of the vectors h(y) and r(ŷ), respectively.
Given the label mapping: (
(1, 0)> if y = 0,
h(y) =
(0, 1)> if y = 1,
and the prediction mapping:
r(ŷ) = (1 − ŷ, ŷ)> ,
we can simplify the multiclass cross-entropy loss based on the value of y:
Case 1: y = 0
When y = 0, the label mapping yields h(y) = (1, 0)> . Substituting into the multiclass loss function:

`multiclass (r(ŷ), h(0)) = − [h1 (0) log(r1 (ŷ)) + h2 (0) log(r2 (ŷ))] = − log(r1 (ŷ)).

Given r(ŷ) = (1 − ŷ, ŷ)> , we have r1 (ŷ) = 1 − ŷ. Therefore:

`multiclass (r(ŷ), h(0)) = − log(1 − ŷ) = `binary (ŷ, 0).

Case 2: y = 1
When y = 1, the label mapping yields h(y) = (0, 1)> . Substituting into the multiclass loss function:

`multiclass (r(ŷ), h(1)) = − [h1 (1) log(r1 (ŷ)) + h2 (1) log(r2 (ŷ))] = − log(r2 (ŷ)).

Given r(ŷ) = (1 − ŷ, ŷ)> , we have r2 (ŷ) = ŷ. Therefore:

`multiclass (r(ŷ), h(1)) = − log(ŷ) = `binary (ŷ, 1).

In both cases, the multiclass cross-entropy loss `multiclass (r(ŷ), h(y)) simplifies to the binary cross-
entropy loss `binary (ŷ, y). This demonstrates that under the specified label and prediction mappings,
the two loss functions yield identical values for all y ∈ Y and ŷ ∈ (0, 1).

Question 10. [1 mark]


Suppose you are in the binary classification setting as defined in section 9.1 of the course notes
and you solve for wMLE . Now suppose that the setting is the multiclass classification setting (with
K = 2) as defined in section 9.2 of the course notes and you solve for wMLE,0 , wMLE,1 . Is the
following true or false? The solution for wMLE in the binary classification setting is the same as
the solution for wMLE,1 in the multiclass classification setting.

Solution:
Answer: False. As shown in section 9.2.2 of the course notes wMLE = wMLE,1 − wMLE,0 .

Question 11. [1 mark]


You are designing a neural network architecture for a binary classification problem. You decide to
have B = 5 layers and d(1) = 50, d(2) = 40, d(3) = 30, d(4) = 20, d(5) = 1 neurons in each layer
respectively. The input dimension is d = 100. How many weight vectors are there in the network?

8/10
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
Answer: 141
Explanation: Each activation (other than the biases and the zeroth layer) has a weight vector
associated with it. Thus there are

50 + 40 + 30 + 20 + 1 = 141

weight vectors in the network.

Question 12. [1 mark]


Let everything be as defined in the previous question. If you sum up the dimension of all the weight
vectors in the neural network you get the number of weights in the network. How many weights
are there in the network?

Solution:
Answer: 8961
Explanation:
First layer: 50 · 101 = 5050 weights (50 neurons with 100 input features and 1 bias term).
Second layer: 40 · 51 = 2040 weights (40 neurons with 50 input features and 1 bias term).
Third layer: 30 · 41 = 1230 weights (30 neurons with 40 input features and 1 bias term).
Fourth layer: 20 · 31 = 620 weights (20 neurons with 30 input features and 1 bias term).
Fifth layer: 1 · 21 = 21 weights (1 neuron with 20 input features and 1 bias term).
Total number of weights: 5050 + 2040 + 1230 + 620 + 21 = 8961.

Question 13. [1 mark]


z−e −z
The tanh function is defined as tanh(z) = eez +e−z , where z ∈ R. Is the following statement true or
false? The tanh function is a valid activation function?

Solution:
Answer: True.
Explanation: The tanh function is a valid activation function because it is differentiable every-
where and is a function from R to (−1, 1) ⊂ R.

Question 14. [1 mark]


The logistic function is defined as σ(z) = 1+e1−z , where z ∈ R. You would like to define the function
f (x) = σ(x> w) where x ∈ Rd+1 , w ∈ Rd+1 as a neural network. Which of the following is correct?
a. The neural network has B = 2 layers, d(1) = 1, d(2) = 1 neurons, activation h = σ, and two
(1) (2)
weight vectors w1 = w, w1 = (1, 1)> .

b. The neural network has B = 1 layer, d(1) = 1 neuron, activation h = σ, and one weight vector
(1)
w1 = w.

c. The neural network has B = 1 layer, d(1) = 1 neuron, activation h(z) = z, and one weight
(1)
vector w1 = w.

d. The neural network has B = 2 layers, d(1) = 1, d(2) = 1 neurons, activation h = σ in the first
(1) (2)
layer, activation h(z) = z in the second layer, and two weight vectors w1 = w, w1 = (0, 1)> .

9/10
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
Answer: b., d.

a. False. The neural network would output f (x) = σ((1, σ(x> w))> (1, 1)> ) = σ(1 + σ(x> w)) 6=
σ(x> w).

b. True. The neural network would output f (x) = σ(x> w).

c. False. The neural network would output f (x) = x> w.

d. True. The neural network would output f (x) = (1, σ(x> w))> (0, 1)> = σ(x> w).

Question 15. [1 mark]


You have a neural network f with B = 2 layers and d(1) = 3, d(2) = 1 neurons in each layer
respectively. The input dimension is d = 2. You choose to use the ReLU activation function,
defined as ReLU(z) = max(0, z), where z ∈ R. The weight vectors have the following values:
(1) (1) (1) (2)
w1 = (1, 1, 1)> w2 = (−1, −1, −1)> w3 = (−1, 0, 1)> w1 = (1, 1, 1, 1)>

Suppose you get a feature vector x = (1, −1, 1)> . What is f (x)?

Solution:
Answer: 2
Explanation: The activations for the first layer are:
(1) (1)
a1 = ReLU(x> w1 ) = ReLU(1 · 1 + (−1) · 1 + 1 · 1) = ReLU(1) = 1
(1) (1)
a2 = ReLU(x> w2 ) = ReLU(1 · (−1) + (−1) · (−1) + 1 · (−1)) = ReLU(−1) = 0
(1) (1)
a3 = ReLU(x> w3 ) = ReLU(1 · (−1) + (−1) · 0 + 1 · 1) = ReLU(0) = 0
Thus, the activation vector for the first layer is a(1) = (1, 1, 0, 0)> .
The activations for the second layer are:
 > 
(2) (1) (2)
a1 = ReLU a w1 = ReLU(1 · 1 + 1 · 1 + 0 · 1 + 0 · 1) = ReLU(2) = 2

(2)
Thus, f (x) = a1 = 2.

10/10

You might also like