0% found this document useful (0 votes)

8 views10 pages

Ass8 Solns

Uploaded by

ceadamtan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

Ass8 Solns

Uploaded by

ceadamtan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Homework Assignment 8

Due: Friday, December 6, 2024, 11:59 p.m. Mountain time

Total marks: 15

Policies:
For all multiple-choice questions, note that multiple correct answers may exist. However, selecting
an incorrect option will cancel out a correct one. For example, if you select two answers, one
correct and one incorrect, you will receive zero points for that question. Similarly, if the number
of incorrect answers selected exceeds the correct ones, your score for that question will be zero.
Please note that it is not possible to receive negative marks. You must select all the correct
options to get full marks for the question.
While the syllabus initially indicated the need to submit a paragraph explaining the use of AI or
other resources in your assignments, this requirement no longer applies as we are now utilizing
eClass quizzes instead of handwritten submissions. Therefore, you are not required to submit any
explanation regarding the tools or resources (such as online tools or AI) used in completing this
quiz.
This PDF version of the questions has been provided for your convenience should you wish to print
them and work offline.
Only answers submitted through the eClass quiz system will be graded. Please do not
submit a written copy of your responses.

Question 1. [1 mark]
Let fˆERM be as defined in section 9.1.1 of the course notes. Which of the following is true?

a. We can think of fˆERM (x) as predicting the probability of x belonging to class 1.

b. It is always the case that fBayes outputs class 1 if fˆERM (x) ≥ 0.5, and class 0 otherwise.

c. fBayes is equal to fˆERM .

d. fˆERM has the same closed-form solution as the ERM predictor for linear regression with the
squared loss.

Solution:
Answer: a.

a. True. The ERM predictor fˆERM (x) in logistic regression estimates the probability that a
data point x belongs to class 1. This is because the logistic function σ(x> w) outputs values
in the range [0, 1], which can be interpreted as probabilities.

b. False. In the binary classification setting fBayes (x) = arg maxy∈Y p(y|x). This is equivalent
to saying fBayes (x) = 1 if p(y = 1|x) ≥ 0.5 and fBayes (x) = 0 otherwise. Since fˆERM (x) is
only an estimate of p(y = 1|x), it is not always the case that fˆERM (x) = p(y = 1|x).

c. False. fBayes outputs class labels, while fˆERM outputs probabilities. The two are not the
same.

1/10
Fall 2024 CMPUT 267: Basics of Machine Learning

d. False. The ERM predictor for logistic regression does not have a closed-form solution and
typically requires iterative optimization algorithms such as gradient descent. On the other
hand, linear regression with squared loss has a closed-form solution.

Question 2. [1 mark]
In class we used the following function class for logistic regression:
n o
F = f | f : Rd+1 → [0, 1], where f (x) = σ(x> w), and w ∈ Rd+1 .

Suppose that we would like to use a larger function class that contains polynomial features of the
input x. Recall that φp (x) is the degree p polynomial feature map of x. We define the new function
class as follows
n o
Fp = f | f : Rd+1 → [0, 1], where f (x) = σ(φp (x)> w), and w ∈ Rp̄ ,

where p̄ = d+p

p is the number of features in the polynomial feature map φp (x). Is the following
statement true or false? For all p ∈ {2, . . . } it holds that F ⊂ Fp ⊂ Fp+1 .

Solution:
Answer: True.
Explanation: F is equivalent to F1 since it uses linear features. The function class Fp is a subset
of Fp+1 because polynomial features of degree p are included within those of degree p+1. Therefore,
for each p, it follows that F ⊂ Fp ⊂ Fp+1 .

Question 3. [1 mark]
Let everything be as defined in the previous question. Let fˆERM,p (x) = σ(φp (x)> wERM,p ) be the
ERM predictor for the function class Fp , where wERM,p is the minimizer of the estimated loss
(with the binary cross-entropy loss function). Is the following statement true or false? The binary
predictor fˆBin that outputs class 1 if fˆERM,p (x) ≥ 0.5 and class 0 otherwise can be equivalently
defined as (
1 if φp (x)> wERM,p ≥ 0,
fˆBin (x) =
0 otherwise.

Solution:
Answer: True.
Explanation: The sigmoid function σ(z) satisfies σ(z) ≥ 0.5 if and only if z ≥ 0. As a result,
fˆERM,p (x) ≥ 0.5 is equivalent to φp (x)> wERM,p ≥ 0. Thus, the two definitions of fˆBin are equivalent.

Question 4. [1 mark]
In class we worked out the MLE solution for binary classification. In this question we are going to
work out the MAP solution. Suppose that the setting is the same as in the MLE setting defined in
section 9.1 of the course notes. However, we will also assume that the weights w1∗ , . . . , wd∗ are i.i.d.
Gaussian random variables with mean 0 and variance 1/λ. The bias term w0∗ is also independent
of the other weights and has a uniform distribution on [−a, a], for a very large a. Which of the
following is equal to wMAP = arg maxw∈Rd+1 p(w | D)?

2/10
Fall 2024 CMPUT 267: Basics of Machine Learning

a.  
n
X λ Xd
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) + wj2 
w∈Rd+1 2
i=1 j=1

b.
   
n d
λ X
 yi log σ(x>

>
X
arg min − i w) + (1 − yi ) log 1 − σ(xi w)
 wj2 
w∈Rd+1 2
i=1 j=1

c.  
Xn 2 λ Xd
arg min  yi − σ(x>
i w) + wj2 
w∈Rd+1 2
i=1 j=1

d.
 
n
X d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ wj log(wj )
w∈Rd+1
i=1 j=1

Solution:
Answer: a.
To find the MAP estimate wMAP , we maximize the posterior probability p(w | D), which is pro-
portional to the product of the likelihood and the prior:

p(w | D) ∝ p(D | w) p(w).

Likelihood:
For binary classification using logistic regression, the likelihood is:
n
Y 1−yi
> yi >
p(D | w) = σ(xi w) 1 − σ(xi w) ,
i=1
1
where σ(z) = is the sigmoid function.
1+e−z
Taking the logarithm, we get the log-likelihood:
n h
X i
log p(D | w) = yi log σ(x>
i w) + (1 − y i ) log 1 − σ(x >
i w) .
i=1

Prior:
The prior for wj (for j = 1, . . . , d) is:
√
λ λ 2
p(wj ) = √ exp − wj .
2π 2
Since w0 has a uniform prior over a very large interval, it contributes a constant to the posterior
and can be ignored in optimization.
The log-prior is:

3/10
Fall 2024 CMPUT 267: Basics of Machine Learning

d
λX 2
log p(w) = − wj + const.
2
j=1

Posterior:
Combining the log-likelihood and log-prior, the log-posterior (up to a constant) is:

log p(w | D) = log p(D | w) + log p(w)

Xn h i λ Xd
= yi log σ(x>i w) + (1 − y i ) log 1 − σ(x >
i w) − wj2 + const.
2
i=1 j=1

MAP Estimate:
To find wMAP , we minimize the negative log-posterior:

wMAP = arg min [− log p(w | D)]

w∈Rd+1
 
Xn λ Xd
= arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) + wj2  .
w∈Rd+1 2
i=1 j=1

Question 5. [1 mark]
In this question, we are going to work out the MAP solution for binary classification using a Laplace
prior. Suppose that the setting is the same as in the MLE setting defined in section 9.1 of the course
notes. However, we will also assume that the weights w1∗ , . . . , wd∗ are i.i.d. Laplace random variables
with mean 0 and scale parameter 1/λ. The bias term w0∗ is also independent of the other weights
and has a uniform distribution on [−a, a] for a very large a.
Which of the following is equal to wMAP = arg maxw∈Rd+1 p(w | D)?

a.  
n
X d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ |wj |
w∈Rd+1
i=1 j=1

b.  
n
X d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ |wj |
w∈Rd+1
i=1 j=0

c.  
Xn 2 d
X
arg min  yi − σ(x>
i w) +λ |wj |
w∈Rd+1
i=1 j=1

d.
 
n
X d
X
arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ log(|wj |)
w∈Rd+1
i=1 j=1

4/10
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
Answer: a.
To find the MAP estimate wMAP , we need to maximize the posterior probability p(w | D), which
is proportional to the product of the likelihood and the prior:

p(w | D) ∝ p(D | w) p(w).

Likelihood:
For binary classification using logistic regression, the likelihood is:
n
Y 1−yi
> yi >
p(D | w) = σ(xi w) 1 − σ(xi w) ,
i=1

where σ(z) = 1+e1−z is the sigmoid function.

Taking the logarithm, we obtain the log-likelihood:
n h
X i
log p(D | w) = yi log σ(x>
i w) + (1 − y i ) log 1 − σ(x >
i w) .
i=1

Prior:
The prior for each wj (for j = 1, . . . , d) is a Laplace distribution:

λ
p(wj ) =exp (−λ|wj |) .
2
Since w0 has a uniform prior over a very large interval, its contribution to the posterior is constant
and can be ignored during optimization.
The log-prior is therefore:

d
X
log p(w) = −λ |wj | + const.
j=1

Posterior:
Combining the log-likelihood and log-prior, the log-posterior (up to a constant) is:

log p(w | D) = log p(D | w) + log p(w)

Xn h i d
X
= yi log σ(x>i w) + (1 − y i ) log 1 − σ(x >
i w) − λ |wj | + const.
i=1 j=1

MAP Estimate:
To find wMAP , we minimize the negative log-posterior:

wMAP = arg min [− log p(w | D)]

w∈Rd+1
 
Xn d
X
= arg min − yi log σ(x> >
i w) + (1 − yi ) log 1 − σ(xi w) +λ |wj | .
w∈Rd+1
i=1 j=1

5/10
Fall 2024 CMPUT 267: Basics of Machine Learning

Question 6. [1 mark]
Suppose that things are as defined in the previous question. However, we assume the weights
w0∗ , w1∗ , . . . , wd∗ are all i.i.d. uniform random variables on [−a, a] for a very large a. Is the following
statement true or false? The MAP solution with this prior is equivalent to the MLE solution.

Solution:
Answer: True.
Explanation: Given that each weight wj (for j = 0, 1, . . . , d) is uniformly distributed over [−a, a],
the prior probability is:
d+1
1
p(w) = p(w0 ) · p(w1 ) · · · p(wd ) =
2a
This prior is constant with respect to w.
The MAP estimate maximizes the posterior probability:

wMAP = arg max p(w | D) = arg max p(D | w) p(w)

w w

Since p(w) is constant, maximizing the posterior is equivalent to maximizing the likelihood:

wMAP = arg max p(D | w) = wMLE

w
Therefore, the MAP estimate coincides with the MLE estimate when using a uniform prior over a
very large interval.

Question 7. [1 mark]
Let fˆMul be as defined in section 9.2 of the course notes. Which of the following is true?
a. fˆMul outputs a vector of probabilities where the y-th element is the probability of class y.

b. σ(x> wMLE,0 , . . . , x> wMLE,K−1 ) outputs a vector of probabilities where the y-th element is
the probability of class y.

c. fˆERM as defined in section 9.2.1 of the course notes is approximately equal to fBayes .

d. There is no closed-form solution for wMLE,k for any k.

Solution:
Answer: b., d.
a. False. fˆMul (x) ∈ Y, and Y is not a set of vectors.

b. True. By the definition of the MLE solution.

c. False. fˆERM is a vector of probability estimates, while fBayes outputs class labels.

d. True. Mentioned in the course notes.

Question 8. [1 mark]
Let everything be as defined in the previous question. Which of the following is true?

6/10
Fall 2024 CMPUT 267: Basics of Machine Learning

a. If x> wMLE,y < x> wMLE,k for all k 6= y, then fˆMul (x) = y.

b. If x> wMLE,y > x> wMLE,k for all k 6= y, then fˆMul (x) = y.

c. If σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > 0.5, then fˆMul (x) = y.

d. If x> (wMLE,y − wMLE,k ) = 0, then

p(y | x, wMLE,0 , . . . , wMLE,K−1 ) = p(k | x, wMLE,0 , . . . , wMLE,K−1 ).

Solution:
Answer: b., c., d.

a. False. See explanation for b.

b. True. fˆMul (x) is the y with the largest value of σy (x> wMLE,0 , . . . , x> wMLE,K−1 ), which is
the same as the y with the largest value of x> wMLE,y .

c. True. If σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > 0.5, then

σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > σk (x> wMLE,0 , . . . , x> wMLE,K−1 )

for all k 6= y, so fˆMul (x) = y.

d. True. If x> (wMLE,y −wMLE,k ) = 0 then x> wMLE,y = x> wMLE,k . If you plug this into the def-
inition of p(y | x, wMLE,0 , . . . , wMLE,K−1 ) you will see that it is equal to p(k | x, wMLE,0 , . . . , wMLE,K−1 ).

Question 9. [1 mark]
Let Y = {0, 1} be the set of labels. Define the following two label functions:
(
(1, 0)> if y = 0,
h(y) =
(0, 1)> if y = 1.

r(ŷ) = (1 − ŷ, ŷ)> .

Is the following statement true or false? For any ŷ ∈ (0, 1) and y ∈ Y (where (0, 1) is the open
interval from 0 to 1), the binary cross-entropy loss with input ŷ and y is equal to the multiclass
cross-entropy loss with input r(ŷ) and h(y).

Solution:
Answer: True.
Explanation: To verify the equivalence of the binary cross-entropy loss and the multiclass cross-
entropy loss under the given label mappings h(y) and r(ŷ), we analyze both loss functions in the
context of the provided definitions.
The binary cross-entropy loss for binary classification with labels y ∈ {0, 1} and predictions ŷ ∈
(0, 1) is defined as:
`binary (ŷ, y) = − [y log(ŷ) + (1 − y) log(1 − ŷ)] .

7/10
Fall 2024 CMPUT 267: Basics of Machine Learning

On the other hand, the multiclass cross-entropy loss for multiclass classification with one-hot en-
coded labels h(y) ∈ {(1, 0)> , (0, 1)> } and predictions r(ŷ) ∈ (0, 1)2 is defined as:
2
X
`multiclass (r(ŷ), h(y)) = − hj (y) log(rj (ŷ)),
j=1

where hj (y) and rj (ŷ) denote the j-th components of the vectors h(y) and r(ŷ), respectively.
Given the label mapping: (
(1, 0)> if y = 0,
h(y) =
(0, 1)> if y = 1,
and the prediction mapping:
r(ŷ) = (1 − ŷ, ŷ)> ,
we can simplify the multiclass cross-entropy loss based on the value of y:
Case 1: y = 0
When y = 0, the label mapping yields h(y) = (1, 0)> . Substituting into the multiclass loss function:

`multiclass (r(ŷ), h(0)) = − [h1 (0) log(r1 (ŷ)) + h2 (0) log(r2 (ŷ))] = − log(r1 (ŷ)).

Given r(ŷ) = (1 − ŷ, ŷ)> , we have r1 (ŷ) = 1 − ŷ. Therefore:

`multiclass (r(ŷ), h(0)) = − log(1 − ŷ) = `binary (ŷ, 0).

Case 2: y = 1
When y = 1, the label mapping yields h(y) = (0, 1)> . Substituting into the multiclass loss function:

`multiclass (r(ŷ), h(1)) = − [h1 (1) log(r1 (ŷ)) + h2 (1) log(r2 (ŷ))] = − log(r2 (ŷ)).

Given r(ŷ) = (1 − ŷ, ŷ)> , we have r2 (ŷ) = ŷ. Therefore:

`multiclass (r(ŷ), h(1)) = − log(ŷ) = `binary (ŷ, 1).

In both cases, the multiclass cross-entropy loss `multiclass (r(ŷ), h(y)) simplifies to the binary cross-
entropy loss `binary (ŷ, y). This demonstrates that under the specified label and prediction mappings,
the two loss functions yield identical values for all y ∈ Y and ŷ ∈ (0, 1).

Question 10. [1 mark]

Suppose you are in the binary classification setting as defined in section 9.1 of the course notes
and you solve for wMLE . Now suppose that the setting is the multiclass classification setting (with
K = 2) as defined in section 9.2 of the course notes and you solve for wMLE,0 , wMLE,1 . Is the
following true or false? The solution for wMLE in the binary classification setting is the same as
the solution for wMLE,1 in the multiclass classification setting.

Solution:
Answer: False. As shown in section 9.2.2 of the course notes wMLE = wMLE,1 − wMLE,0 .

Question 11. [1 mark]

You are designing a neural network architecture for a binary classification problem. You decide to
have B = 5 layers and d(1) = 50, d(2) = 40, d(3) = 30, d(4) = 20, d(5) = 1 neurons in each layer
respectively. The input dimension is d = 100. How many weight vectors are there in the network?

8/10
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
Answer: 141
Explanation: Each activation (other than the biases and the zeroth layer) has a weight vector
associated with it. Thus there are

50 + 40 + 30 + 20 + 1 = 141

weight vectors in the network.

Question 12. [1 mark]

Let everything be as defined in the previous question. If you sum up the dimension of all the weight
vectors in the neural network you get the number of weights in the network. How many weights
are there in the network?

Solution:
Answer: 8961
Explanation:
First layer: 50 · 101 = 5050 weights (50 neurons with 100 input features and 1 bias term).
Second layer: 40 · 51 = 2040 weights (40 neurons with 50 input features and 1 bias term).
Third layer: 30 · 41 = 1230 weights (30 neurons with 40 input features and 1 bias term).
Fourth layer: 20 · 31 = 620 weights (20 neurons with 30 input features and 1 bias term).
Fifth layer: 1 · 21 = 21 weights (1 neuron with 20 input features and 1 bias term).
Total number of weights: 5050 + 2040 + 1230 + 620 + 21 = 8961.

Question 13. [1 mark]

z−e −z
The tanh function is defined as tanh(z) = eez +e−z , where z ∈ R. Is the following statement true or
false? The tanh function is a valid activation function?

Solution:
Answer: True.
Explanation: The tanh function is a valid activation function because it is differentiable every-
where and is a function from R to (−1, 1) ⊂ R.

Question 14. [1 mark]

The logistic function is defined as σ(z) = 1+e1−z , where z ∈ R. You would like to define the function
f (x) = σ(x> w) where x ∈ Rd+1 , w ∈ Rd+1 as a neural network. Which of the following is correct?
a. The neural network has B = 2 layers, d(1) = 1, d(2) = 1 neurons, activation h = σ, and two
(1) (2)
weight vectors w1 = w, w1 = (1, 1)> .

b. The neural network has B = 1 layer, d(1) = 1 neuron, activation h = σ, and one weight vector
(1)
w1 = w.

c. The neural network has B = 1 layer, d(1) = 1 neuron, activation h(z) = z, and one weight
(1)
vector w1 = w.

d. The neural network has B = 2 layers, d(1) = 1, d(2) = 1 neurons, activation h = σ in the first
(1) (2)
layer, activation h(z) = z in the second layer, and two weight vectors w1 = w, w1 = (0, 1)> .

9/10
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
Answer: b., d.

a. False. The neural network would output f (x) = σ((1, σ(x> w))> (1, 1)> ) = σ(1 + σ(x> w)) 6=
σ(x> w).

b. True. The neural network would output f (x) = σ(x> w).

c. False. The neural network would output f (x) = x> w.

d. True. The neural network would output f (x) = (1, σ(x> w))> (0, 1)> = σ(x> w).

Question 15. [1 mark]

You have a neural network f with B = 2 layers and d(1) = 3, d(2) = 1 neurons in each layer
respectively. The input dimension is d = 2. You choose to use the ReLU activation function,
defined as ReLU(z) = max(0, z), where z ∈ R. The weight vectors have the following values:
(1) (1) (1) (2)
w1 = (1, 1, 1)> w2 = (−1, −1, −1)> w3 = (−1, 0, 1)> w1 = (1, 1, 1, 1)>

Suppose you get a feature vector x = (1, −1, 1)> . What is f (x)?

Solution:
Answer: 2
Explanation: The activations for the first layer are:
(1) (1)
a1 = ReLU(x> w1 ) = ReLU(1 · 1 + (−1) · 1 + 1 · 1) = ReLU(1) = 1
(1) (1)
a2 = ReLU(x> w2 ) = ReLU(1 · (−1) + (−1) · (−1) + 1 · (−1)) = ReLU(−1) = 0
(1) (1)
a3 = ReLU(x> w3 ) = ReLU(1 · (−1) + (−1) · 0 + 1 · 1) = ReLU(0) = 0
Thus, the activation vector for the first layer is a(1) = (1, 1, 0, 0)> .
The activations for the second layer are:
>
(2) (1) (2)
a1 = ReLU a w1 = ReLU(1 · 1 + 1 · 1 + 0 · 1 + 0 · 1) = ReLU(2) = 2

(2)
Thus, f (x) = a1 = 2.

10/10

Murray Aitkin - Introduction To Statistical Modelling and Inference-CRC Press - Chapman & Hall (2022)
No ratings yet
Murray Aitkin - Introduction To Statistical Modelling and Inference-CRC Press - Chapman & Hall (2022)
391 pages
SMAI Question Papers
No ratings yet
SMAI Question Papers
13 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Concordia University Machine Learning Assaignment With Solutions
No ratings yet
Concordia University Machine Learning Assaignment With Solutions
8 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
16 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
CS725 2021 Quiz1
No ratings yet
CS725 2021 Quiz1
5 pages
Wa0030.
No ratings yet
Wa0030.
36 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
3 pages
Exam 21
No ratings yet
Exam 21
17 pages
MLvsMAP Merged
No ratings yet
MLvsMAP Merged
208 pages
Quiz3 2024
No ratings yet
Quiz3 2024
2 pages
COMPSCI5014 1 Machine Learning (M) 201904
No ratings yet
COMPSCI5014 1 Machine Learning (M) 201904
7 pages
Solutions Problem Set 1
No ratings yet
Solutions Problem Set 1
7 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Midterm Solutions PDF
No ratings yet
Midterm Solutions PDF
17 pages
Midterm Solutions Machine
100% (1)
Midterm Solutions Machine
17 pages
Version 1
No ratings yet
Version 1
18 pages
Assignment 5 Solution
No ratings yet
Assignment 5 Solution
6 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Tute1 Questions
No ratings yet
Tute1 Questions
4 pages
Logistic - Regression Class 3
No ratings yet
Logistic - Regression Class 3
88 pages
DDA3020 Lecture 06 Logistic Regression
No ratings yet
DDA3020 Lecture 06 Logistic Regression
47 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
No ratings yet
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
6 pages
Output 25
No ratings yet
Output 25
8 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
No ratings yet
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
27 pages
Output 23
No ratings yet
Output 23
6 pages
Final: CS 189 Spring 2013 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2013 Introduction To Machine Learning
9 pages
CS 229, Autumn 2017 Problem Set #2: Supervised Learning II
No ratings yet
CS 229, Autumn 2017 Problem Set #2: Supervised Learning II
6 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
Quiz2 Mock Solutions
No ratings yet
Quiz2 Mock Solutions
19 pages
CS-31002 (ML) - CS End April 2025
No ratings yet
CS-31002 (ML) - CS End April 2025
19 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Midterm Sp16 Solutions
100% (1)
Midterm Sp16 Solutions
17 pages
Lecture 05
No ratings yet
Lecture 05
5 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
ML Assignments 2025
No ratings yet
ML Assignments 2025
91 pages
Exam 2011
No ratings yet
Exam 2011
22 pages
ML Assignment 3
No ratings yet
ML Assignment 3
5 pages
Statistical Learning
No ratings yet
Statistical Learning
4 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Dis 1
No ratings yet
Dis 1
5 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Epfl Machine Learning Final Exam 2021 Solutions
No ratings yet
Epfl Machine Learning Final Exam 2021 Solutions
21 pages
ML 2023a Midsem Solution
No ratings yet
ML 2023a Midsem Solution
9 pages
CMPUT 466/551 - Assignment 1: Paradox?
No ratings yet
CMPUT 466/551 - Assignment 1: Paradox?
6 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
10 23073151312 PDF
No ratings yet
10 23073151312 PDF
13 pages
Homework Problems Stat 490C
No ratings yet
Homework Problems Stat 490C
44 pages
Multivariate Generalized Linear Mixed Models For Count Data: Guilherme P. Silva Henrique A. Laureano
No ratings yet
Multivariate Generalized Linear Mixed Models For Count Data: Guilherme P. Silva Henrique A. Laureano
22 pages
Encyclopedia of Computational Neuroscience Decision Making Models PDF
No ratings yet
Encyclopedia of Computational Neuroscience Decision Making Models PDF
20 pages
Verbeek e Nijman - Testing For Selectivity Bias in Panel Data Models
No ratings yet
Verbeek e Nijman - Testing For Selectivity Bias in Panel Data Models
24 pages
Surge Pricing Moves Uber's Driver-Partners
No ratings yet
Surge Pricing Moves Uber's Driver-Partners
24 pages
Frontmatter
No ratings yet
Frontmatter
22 pages
Grassland Birds
No ratings yet
Grassland Birds
174 pages
FATA-Skilloscopy Bayesian Modeling of Decision Makers Skill
No ratings yet
FATA-Skilloscopy Bayesian Modeling of Decision Makers Skill
12 pages
Probabilistic Principal Component Analysis (Tipping, Bishop)
No ratings yet
Probabilistic Principal Component Analysis (Tipping, Bishop)
13 pages
23 Who Aquatownwsp Web2
No ratings yet
23 Who Aquatownwsp Web2
32 pages
ML Unit 2
No ratings yet
ML Unit 2
25 pages
Analyzing Raised Median Safety Impacts Using Bayesian Methods - 2011
No ratings yet
Analyzing Raised Median Safety Impacts Using Bayesian Methods - 2011
8 pages
Introducing Credibility Theory Into Glms For Ratemaking On Auto Portfolio
No ratings yet
Introducing Credibility Theory Into Glms For Ratemaking On Auto Portfolio
105 pages
Specification Test: Vid Adrison
No ratings yet
Specification Test: Vid Adrison
18 pages
Probability Presentation
No ratings yet
Probability Presentation
22 pages
Dexter
No ratings yet
Dexter
51 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
Encyclopedia of Psychology of Decision Making Denis Murphy Daniel Longo Download
No ratings yet
Encyclopedia of Psychology of Decision Making Denis Murphy Daniel Longo Download
76 pages
Mil780 Classification
No ratings yet
Mil780 Classification
18 pages
RBF2
No ratings yet
RBF2
40 pages
The Statistical Analysis of Multivariate Failure Time Data A Marginal Modeling Approach, 1st Edition Educational Ebook Download
100% (19)
The Statistical Analysis of Multivariate Failure Time Data A Marginal Modeling Approach, 1st Edition Educational Ebook Download
16 pages
Linear - Classification
No ratings yet
Linear - Classification
72 pages
Journal of The IRC-79 Part 2 PDF
No ratings yet
Journal of The IRC-79 Part 2 PDF
64 pages
Estimating and Optimizing The Impact of Inventory On Consumer Choices in A Fashion Retail Setting
No ratings yet
Estimating and Optimizing The Impact of Inventory On Consumer Choices in A Fashion Retail Setting
30 pages
Supp2 2
No ratings yet
Supp2 2
307 pages
Construction and Evaluation of Actuarial Models Exam
No ratings yet
Construction and Evaluation of Actuarial Models Exam
6 pages
Theory of Estimation Notes
No ratings yet
Theory of Estimation Notes
19 pages
Solution Manual
No ratings yet
Solution Manual
17 pages

Ass8 Solns

Uploaded by

Ass8 Solns

Uploaded by

Homework Assignment 8

Due: Friday, December 6, 2024, 11:59 p.m. Mountain time

a. We can think of fˆERM (x) as predicting the probability of x belonging to class 1.

c. fBayes is equal to fˆERM .

p(w | D) ∝ p(D | w) p(w).

log p(w | D) = log p(D | w) + log p(w)

wMAP = arg min [− log p(w | D)]

p(w | D) ∝ p(D | w) p(w).

where σ(z) = 1+e1−z is the sigmoid function.

log p(w | D) = log p(D | w) + log p(w)

wMAP = arg min [− log p(w | D)]

wMAP = arg max p(w | D) = arg max p(D | w) p(w)

wMAP = arg max p(D | w) = wMLE

d. There is no closed-form solution for wMLE,k for any k.

b. True. By the definition of the MLE solution.

d. True. Mentioned in the course notes.

c. If σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > 0.5, then fˆMul (x) = y.

d. If x> (wMLE,y − wMLE,k ) = 0, then

p(y | x, wMLE,0 , . . . , wMLE,K−1 ) = p(k | x, wMLE,0 , . . . , wMLE,K−1 ).

a. False. See explanation for b.

c. True. If σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > 0.5, then

σy (x> wMLE,0 , . . . , x> wMLE,K−1 ) > σk (x> wMLE,0 , . . . , x> wMLE,K−1 )

for all k 6= y, so fˆMul (x) = y.

r(ŷ) = (1 − ŷ, ŷ)> .

Given r(ŷ) = (1 − ŷ, ŷ)> , we have r1 (ŷ) = 1 − ŷ. Therefore:

`multiclass (r(ŷ), h(0)) = − log(1 − ŷ) = `binary (ŷ, 0).

Given r(ŷ) = (1 − ŷ, ŷ)> , we have r2 (ŷ) = ŷ. Therefore:

`multiclass (r(ŷ), h(1)) = − log(ŷ) = `binary (ŷ, 1).

Question 10. [1 mark]

Question 11. [1 mark]

weight vectors in the network.

Question 12. [1 mark]

Question 13. [1 mark]

Question 14. [1 mark]

b. True. The neural network would output f (x) = σ(x> w).

c. False. The neural network would output f (x) = x> w.

Question 15. [1 mark]

You might also like