0% found this document useful (0 votes)
8 views18 pages

Version 1

The document outlines an assignment on linear and logistic regression, covering key concepts such as maximum likelihood estimation (MLE), gradient descent, and decision theory. It includes specific problems related to estimating parameters, weighted linear regression, decision-making under cost constraints, and maximum A-posteriori estimation. Additionally, it discusses the perceptron algorithm and includes a bonus problem on MLE for multi-class logistic regression.

Uploaded by

Fabian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

Version 1

The document outlines an assignment on linear and logistic regression, covering key concepts such as maximum likelihood estimation (MLE), gradient descent, and decision theory. It includes specific problems related to estimating parameters, weighted linear regression, decision-making under cost constraints, and maximum A-posteriori estimation. Additionally, it discusses the perceptron algorithm and includes a bonus problem on MLE for multi-class logistic regression.

Uploaded by

Fabian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Franklin F.

Lucero

AI534 — Written Homework Assignment 1 (30 pts + 6 bonus pts)


This written assignment covers the contents of linear regression and logistic regression. The key concepts
covered here include:

• Maximum likelihood estimation (MLE)

• Gradient descent learning

• Decision theory for probabilistic classifiers

• Maximum A Posteriori (MAP) parameter estimation

• Perceptron

1. MLE for uniform distribution. [3pt]


Given a set of IID observed samples x1 , ..., xn ∼ uniform(0, θ), we wish to estimate the parameter θ.

(a) (1 pt) Write down the likelihood function of θ.

(b) (2 pts) Derive the maximum likelihood estimation for θ, which is the value for theta that maximizes
the function of part (a). (Hint: The likelihood function is a monotonic function. So the maximizing
solution is at the extreme— there is no need to take derivative for this case.)
.

2. Weighted linear regression. [10pt] In class when discussing linear regression, we assume that the
Gaussian noise is iid (identically independently distributed). In practice, we may have some extra
information regarding the fidelity of each data point. For example, we may know that some examples
have higher noise variance than others. To model this, we can model the noise variable i , 2 , · · · n
as distinct Gaussian’s, i.e., i ∼ N (0, σi2 ) with known variance σi2 . How will this influence our linear
regression model? Let’s work it out.

(a) (3pts) Write down the log likelihood function of w under this new modeling assumption.

(b) (1pts) Show that maximizing


n the log likelihood is equivalent to minimizing a weighted square
loss function J(W) = i=1 ai (wT xi − yi )2 , and express each ai in terms of σi .

(c) (3 pts) Take the gradient of the loss function J(w) and provide the batch gradient descent update
rule for optimizing w.

(d) (3 pts) Derive a closed form solution to this optimization problem. Hint: begin by rewrite the
objective into matrix form using a diagonal matrix A with A(i, i) = ai .

3. Decision theory: working with expectations. [6pt]


In this problem, you will analyze a scenario where the Maximum A-Posteriori (MAP) decision rule,
which you learned in class, is not appropriate. Instead, you’ll explore how to make decisions based on
minimizing expected costs.
Consider a spam filter that predicts whether an email is spam, using probabilistic predictions. For
this filter, there are costs associated with making errors (misclassifying emails), but these costs are not
symmetric. Misclassifying a non-spam email as spam (i.e., filtering out an important email) is more
costly than misclassifying a spam email as non-spam.
The following table shows the cost of each possible outcome:

1
Franklin F. Lucero

predicted true label y


label ŷ non-spam spam
non-spam 0 1
spam 10 0

Table 1: A mis-classification cost matrix for the spam filter problem.

• If the filter’s prediction is correct, there is no cost.


• If a non-spam email is classified as spam, there is a cost of 10.
• If a spam email is classified as non-spam, there is a cost of 1.

Here we will go through some questions to help you figure out how to use the probability and misclas-
sification costs to make predictions.

(a) (2 pt) You received an email for which the spam filter predicts that it is a spam with p = 0.8. We
want to make the decision that minimizes the expected cost.
Question: Should you classify this particular email as spam or non-spam? [Hint: Compare the
expected cost of classifying the email as spam versus non-spam. Choose the classification that
results in the lower expected cost.]

(b) (2 pts)The MAP decision rule would classify an email as spam if p > 0.5, but this rule does not
minimize expected cost in this case. We need a new rule that compares p to a different threshold
θ. The value of θ should be chosen to minimize the expected cost based on the costs in the table.
Question: What is the value of θ that works for the costs specified in Table 1? [Hint: To find
the threshold θ, set up the decision rule by comparing the expected cost of each decision, as you
did in (a), then Solve for p in terms of the costs.]

(c) (2pts) Now, imagine that the optimal decision rule would use θ = 1/5 as the threshold for
classifying an email as spam. Question: Can you provide a new cost table where this would be
the case? [Hint: Use the relationship between the costs and θ that you derived in part (b). Based
on this relationship, adjust the misclassification costs in the table to achieve θ = 1/5.]

4. Maximum A-Posteriori Estimation. [8pt] Suppose we observe the values of n IID random vari-
ables X1 , . . . , Xn drawn from a single Bernoulli distribution with parameter θ. In other words, for
each Xi , we know that P (Xi = 1) = θ and P (Xi = 0) = 1 − θ. In the Bayesian framework, we
treat θ as a random variable, and use a prior probability distribution over θ to express our prior
knowledge/preference about θ. In this framework, X1 , . . . , Xn can be viewed as generated by:
• First, the value of θ is drawn from a given prior probability distribution
• Second, X1 , . . . , Xn are drawn independently from a Bernoulli distribution with this θ value.

In this setting, Maximum A-Posteriori (MAP) estimation is a way to estimate θ by finding the value
that maximizes the posterior probability, given both its prior distribution and the observed data.The
MAP estimation of θ is given by:

θ̂M AP = argmax P (θ = θ̂|X1 , . . . , Xn )


θ̂

By applying Bayes’ theorem, this becomes:

θ̂M AP = argmax P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂) = argmax L(θ̂)p(θ̂)


θ̂ θ̂

where L(θ̂) is the likelihood function of the data given θ, and p(θ̂) is the prior distribution over θ.

2
Franklin F. Lucero

Now consider using a beta distribution as the prior: θ ∼ Beta(α, β), whose PDF function is

θ̂(α−1) (1 − θ̂)(β−1)
p(θ̂) =
B(α, β)

where B(α, β) is a normalizing constant.

(a) (3 pts) Derive the posterior distribution p(θ̂|X1 , . . . , Xn , α, β). Compare the form of the posterior
distribution with that of the beta distribution, you will see the posterior is also a beta distribution.
What the updated α and β parameters for the posterior?
(b) (2 pts) Suppose we use Beta(2, 2) as the prior, what Beta distribution do we get for the posterior
after we observe 5 coin tosses and 2 of them are head? What is the posterior distribution of θ after
we observe 50 coin tosses and 20 of them are head? (you don’t need to write out the distributions,
simply provide the α and β distribution would suffice.
(c) (1pt) Plot the pdf function of the prior Beta(2, 2) and the two posterior distributions. You can
use any software (e.g., R, Python, Matlab) for this plot.
(d) (2pt) Assume that θ = 0.4 is the true probability, as we observe more and more coin tosses from
this coin, how would the shape of the posterior change as more data is observed? Will the MAP
estimate converge toward the true value?

5. Perceptron. [3pt] Assume a data set consists only of a single data point {(x, +1)}. How many times
would the Perceptron algorithm mis-classify this point x before convergence? What if the initial weight
vector w0 was initialized randomly and not as the all-zero vector? Derive the number of times as a
function of w0 and x.

(a) (1 pts) Case 1: w0 = 0.


(b) (2 pts) Case 2: w0 ! = 0:

6. Bonus: MLE for multi-class logistic regression. [6 pts] Consider the maximum likelihood
estimation problem for multi-class logistic regression using the soft-max function defined below:

exp(wkT x)
p(y = k|x) = K
T
j=1 exp(wj x)

We can write out the likelihood function as:


N 
 K
L(w) = p(y = k|xi )yik
i=1 k=1

where yik is an indicator variable taking value 1 if yi = k.

(a) (1 pts) Provide the log-likelihood function.


(b) (5 pts) Derive the gradient of the log-likelihood function w.r.t the weight vector wc of class c.
[Hint: the solution to this problem is provided in the Logistic regression lecture slide. You just
need to fill in themissing derivation. Note that for any example xi , the denominator in the
softmax function j exp(wjT xi ) is the same for all k— denoting it as zi makes it simpler to work
through the derivation, but be sure to remember that zi is a function of all wk ’s.]

3
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
10/20/24 10:12 PM Code.m 1 of 1

clear;
close all;
clc;

% Vector of x values
x = 0:0.001:1;

% Prior Beta(2, 2)
prior = betapdf(x, 2, 2);

% Beta(4, 5)
after1 = betapdf(x, 4, 5);

% Beta(22, 32)
after2 = betapdf(x, 22, 32);

% Plot
figure;
plot(x, prior, 'DisplayName', 'Prior Beta(2,2)', 'LineWidth', 1.5);
hold on;
plot(x, after1, 'DisplayName', 'Posterior Beta(4,5)', 'LineWidth', 1.5);
plot(x, after2, 'DisplayName', 'Posterior Beta(22,32)', 'LineWidth', 1.5);
xlabel('Theta');
ylabel('Density');
legend('show');
grid on;
hold off;
6
Prior Beta(2,2)
Posterior Beta(4,5)
5 Posterior Beta(22,32)

4
Density

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Theta
CamScanner
CamScanner
CamScanner

You might also like