0% found this document useful (0 votes)

8 views18 pages

Version 1

The document outlines an assignment on linear and logistic regression, covering key concepts such as maximum likelihood estimation (MLE), gradient descent, and decision theory. It includes specific problems related to estimating parameters, weighted linear regression, decision-making under cost constraints, and maximum A-posteriori estimation. Additionally, it discusses the perceptron algorithm and includes a bonus problem on MLE for multi-class logistic regression.

Uploaded by

Fabian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views18 pages

Version 1

Uploaded by

Fabian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Franklin F.

Lucero

AI534 — Written Homework Assignment 1 (30 pts + 6 bonus pts)

This written assignment covers the contents of linear regression and logistic regression. The key concepts
covered here include:

• Maximum likelihood estimation (MLE)

• Gradient descent learning

• Decision theory for probabilistic classifiers

• Maximum A Posteriori (MAP) parameter estimation

• Perceptron

1. MLE for uniform distribution. [3pt]

Given a set of IID observed samples x1 , ..., xn ∼ uniform(0, θ), we wish to estimate the parameter θ.

(a) (1 pt) Write down the likelihood function of θ.

(b) (2 pts) Derive the maximum likelihood estimation for θ, which is the value for theta that maximizes
the function of part (a). (Hint: The likelihood function is a monotonic function. So the maximizing
solution is at the extreme— there is no need to take derivative for this case.)
.

2. Weighted linear regression. [10pt] In class when discussing linear regression, we assume that the
Gaussian noise is iid (identically independently distributed). In practice, we may have some extra
information regarding the fidelity of each data point. For example, we may know that some examples
have higher noise variance than others. To model this, we can model the noise variable i , 2 , · · · n
as distinct Gaussian’s, i.e., i ∼ N (0, σi2 ) with known variance σi2 . How will this influence our linear
regression model? Let’s work it out.

(a) (3pts) Write down the log likelihood function of w under this new modeling assumption.

(b) (1pts) Show that maximizing

n the log likelihood is equivalent to minimizing a weighted square
loss function J(W) = i=1 ai (wT xi − yi )2 , and express each ai in terms of σi .

(c) (3 pts) Take the gradient of the loss function J(w) and provide the batch gradient descent update
rule for optimizing w.

(d) (3 pts) Derive a closed form solution to this optimization problem. Hint: begin by rewrite the
objective into matrix form using a diagonal matrix A with A(i, i) = ai .

3. Decision theory: working with expectations. [6pt]

In this problem, you will analyze a scenario where the Maximum A-Posteriori (MAP) decision rule,
which you learned in class, is not appropriate. Instead, you’ll explore how to make decisions based on
minimizing expected costs.
Consider a spam filter that predicts whether an email is spam, using probabilistic predictions. For
this filter, there are costs associated with making errors (misclassifying emails), but these costs are not
symmetric. Misclassifying a non-spam email as spam (i.e., filtering out an important email) is more
costly than misclassifying a spam email as non-spam.
The following table shows the cost of each possible outcome:

1
Franklin F. Lucero

predicted true label y

label ŷ non-spam spam
non-spam 0 1
spam 10 0

Table 1: A mis-classification cost matrix for the spam filter problem.

• If the filter’s prediction is correct, there is no cost.

• If a non-spam email is classified as spam, there is a cost of 10.
• If a spam email is classified as non-spam, there is a cost of 1.

Here we will go through some questions to help you figure out how to use the probability and misclas-
sification costs to make predictions.

(a) (2 pt) You received an email for which the spam filter predicts that it is a spam with p = 0.8. We
want to make the decision that minimizes the expected cost.
Question: Should you classify this particular email as spam or non-spam? [Hint: Compare the
expected cost of classifying the email as spam versus non-spam. Choose the classification that
results in the lower expected cost.]

(b) (2 pts)The MAP decision rule would classify an email as spam if p > 0.5, but this rule does not
minimize expected cost in this case. We need a new rule that compares p to a different threshold
θ. The value of θ should be chosen to minimize the expected cost based on the costs in the table.
Question: What is the value of θ that works for the costs specified in Table 1? [Hint: To find
the threshold θ, set up the decision rule by comparing the expected cost of each decision, as you
did in (a), then Solve for p in terms of the costs.]

(c) (2pts) Now, imagine that the optimal decision rule would use θ = 1/5 as the threshold for
classifying an email as spam. Question: Can you provide a new cost table where this would be
the case? [Hint: Use the relationship between the costs and θ that you derived in part (b). Based
on this relationship, adjust the misclassification costs in the table to achieve θ = 1/5.]

4. Maximum A-Posteriori Estimation. [8pt] Suppose we observe the values of n IID random vari-
ables X1 , . . . , Xn drawn from a single Bernoulli distribution with parameter θ. In other words, for
each Xi , we know that P (Xi = 1) = θ and P (Xi = 0) = 1 − θ. In the Bayesian framework, we
treat θ as a random variable, and use a prior probability distribution over θ to express our prior
knowledge/preference about θ. In this framework, X1 , . . . , Xn can be viewed as generated by:
• First, the value of θ is drawn from a given prior probability distribution
• Second, X1 , . . . , Xn are drawn independently from a Bernoulli distribution with this θ value.

In this setting, Maximum A-Posteriori (MAP) estimation is a way to estimate θ by finding the value
that maximizes the posterior probability, given both its prior distribution and the observed data.The
MAP estimation of θ is given by:

θ̂M AP = argmax P (θ = θ̂|X1 , . . . , Xn )

θ̂

By applying Bayes’ theorem, this becomes:

θ̂M AP = argmax P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂) = argmax L(θ̂)p(θ̂)

θ̂ θ̂

where L(θ̂) is the likelihood function of the data given θ, and p(θ̂) is the prior distribution over θ.

2
Franklin F. Lucero

Now consider using a beta distribution as the prior: θ ∼ Beta(α, β), whose PDF function is

θ̂(α−1) (1 − θ̂)(β−1)
p(θ̂) =
B(α, β)

where B(α, β) is a normalizing constant.

(a) (3 pts) Derive the posterior distribution p(θ̂|X1 , . . . , Xn , α, β). Compare the form of the posterior
distribution with that of the beta distribution, you will see the posterior is also a beta distribution.
What the updated α and β parameters for the posterior?
(b) (2 pts) Suppose we use Beta(2, 2) as the prior, what Beta distribution do we get for the posterior
after we observe 5 coin tosses and 2 of them are head? What is the posterior distribution of θ after
we observe 50 coin tosses and 20 of them are head? (you don’t need to write out the distributions,
simply provide the α and β distribution would suffice.
(c) (1pt) Plot the pdf function of the prior Beta(2, 2) and the two posterior distributions. You can
use any software (e.g., R, Python, Matlab) for this plot.
(d) (2pt) Assume that θ = 0.4 is the true probability, as we observe more and more coin tosses from
this coin, how would the shape of the posterior change as more data is observed? Will the MAP
estimate converge toward the true value?

5. Perceptron. [3pt] Assume a data set consists only of a single data point {(x, +1)}. How many times
would the Perceptron algorithm mis-classify this point x before convergence? What if the initial weight
vector w0 was initialized randomly and not as the all-zero vector? Derive the number of times as a
function of w0 and x.

(a) (1 pts) Case 1: w0 = 0.

(b) (2 pts) Case 2: w0 ! = 0:

6. Bonus: MLE for multi-class logistic regression. [6 pts] Consider the maximum likelihood
estimation problem for multi-class logistic regression using the soft-max function defined below:

exp(wkT x)
p(y = k|x) = K
T
j=1 exp(wj x)

We can write out the likelihood function as:

N
K
L(w) = p(y = k|xi )yik
i=1 k=1

where yik is an indicator variable taking value 1 if yi = k.

(a) (1 pts) Provide the log-likelihood function.

(b) (5 pts) Derive the gradient of the log-likelihood function w.r.t the weight vector wc of class c.
[Hint: the solution to this problem is provided in the Logistic regression lecture slide. You just
need to fill in themissing derivation. Note that for any example xi , the denominator in the
softmax function j exp(wjT xi ) is the same for all k— denoting it as zi makes it simpler to work
through the derivation, but be sure to remember that zi is a function of all wk ’s.]

3
Franklin F. Lucero

CamScanner
Franklin F. Lucero

CamScanner
10/20/24 10:12 PM Code.m 1 of 1

clear;
close all;
clc;

% Vector of x values
x = 0:0.001:1;

% Prior Beta(2, 2)
prior = betapdf(x, 2, 2);

% Beta(4, 5)
after1 = betapdf(x, 4, 5);

% Beta(22, 32)
after2 = betapdf(x, 22, 32);

% Plot
figure;
plot(x, prior, 'DisplayName', 'Prior Beta(2,2)', 'LineWidth', 1.5);
hold on;
plot(x, after1, 'DisplayName', 'Posterior Beta(4,5)', 'LineWidth', 1.5);
plot(x, after2, 'DisplayName', 'Posterior Beta(22,32)', 'LineWidth', 1.5);
xlabel('Theta');
ylabel('Density');
legend('show');
grid on;
hold off;
6
Prior Beta(2,2)
Posterior Beta(4,5)
5 Posterior Beta(22,32)

4
Density

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Theta
CamScanner
CamScanner
CamScanner

CSE (CS) - JNTUA - R23 - B.tech. Cyber Security III & IV Year Course Structure & Syllabus (Repaired)
No ratings yet
CSE (CS) - JNTUA - R23 - B.tech. Cyber Security III & IV Year Course Structure & Syllabus (Repaired)
64 pages
User Manual 2.30 EN 01
No ratings yet
User Manual 2.30 EN 01
428 pages
A Survey of Vectorization Methods in Topological Data Analysis
No ratings yet
A Survey of Vectorization Methods in Topological Data Analysis
14 pages
Week3 Exame3
75% (4)
Week3 Exame3
31 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Organization Culture
No ratings yet
Organization Culture
105 pages
VCS-SH30 Datasheet20221011
No ratings yet
VCS-SH30 Datasheet20221011
2 pages
Exam 2011
No ratings yet
Exam 2011
22 pages
Sanika - Thakare - Cisco AICTE Internship
No ratings yet
Sanika - Thakare - Cisco AICTE Internship
28 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Unit 1: MPI (CST-282, ITT-282) SUBMISSION DATE: 14.02.2020
No ratings yet
Unit 1: MPI (CST-282, ITT-282) SUBMISSION DATE: 14.02.2020
9 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
SERIES 300 Manual Transfer Switch Solutions: Flexibility For Every Manual Backup Power Switching Application
No ratings yet
SERIES 300 Manual Transfer Switch Solutions: Flexibility For Every Manual Backup Power Switching Application
6 pages
Iris Recognition With Off-the-Shelf CNN Features: A Deep Learning Perspective
No ratings yet
Iris Recognition With Off-the-Shelf CNN Features: A Deep Learning Perspective
8 pages
DGTL Brkcol 2385
No ratings yet
DGTL Brkcol 2385
93 pages
聊天機器人之檢索效能之研究以BERT為方法
No ratings yet
聊天機器人之檢索效能之研究以BERT為方法
79 pages
2018 THSF mt8173 PCM
No ratings yet
2018 THSF mt8173 PCM
95 pages
Inference Quals 1992-2019
No ratings yet
Inference Quals 1992-2019
66 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
No ratings yet
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
8 pages
Lecture17 Mle Map
No ratings yet
Lecture17 Mle Map
29 pages
1 ML Introduction
No ratings yet
1 ML Introduction
36 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Fpga Interview Question
100% (1)
Fpga Interview Question
35 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Final Exam Solutions
No ratings yet
Final Exam Solutions
12 pages
StewartPCalc7 01 04 Output
No ratings yet
StewartPCalc7 01 04 Output
33 pages
ICT Skills - II (Part - A - Unit - 3)
No ratings yet
ICT Skills - II (Part - A - Unit - 3)
28 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
Final f02
No ratings yet
Final f02
12 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages
Ass8 Solns
No ratings yet
Ass8 Solns
10 pages
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 1: Aarti Singh Carnegie Mellon University
6 pages
Database Programming With SQL Section 10 Quiz
No ratings yet
Database Programming With SQL Section 10 Quiz
20 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
Mid Sem Final - Solutions
No ratings yet
Mid Sem Final - Solutions
9 pages
Output 25
No ratings yet
Output 25
8 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 2
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 2
8 pages
04 Lecturenote MLE MAP Discriminative
No ratings yet
04 Lecturenote MLE MAP Discriminative
6 pages
HW 1
No ratings yet
HW 1
11 pages
MLE and MAP Ex PG 1-4 Print
No ratings yet
MLE and MAP Ex PG 1-4 Print
10 pages
Solutions Problem Set 1
No ratings yet
Solutions Problem Set 1
7 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
2011 End Spring 2011 Computer Science Machine Learning
No ratings yet
2011 End Spring 2011 Computer Science Machine Learning
10 pages
Mid Sem Solution 2019
No ratings yet
Mid Sem Solution 2019
9 pages
Zarin Tasnim
No ratings yet
Zarin Tasnim
11 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Output 23
No ratings yet
Output 23
6 pages
Machine 2021 Jan-Apr
No ratings yet
Machine 2021 Jan-Apr
45 pages
ML Midsem 2018 Solutions
No ratings yet
ML Midsem 2018 Solutions
7 pages
MCA 401 (Unit 05)
No ratings yet
MCA 401 (Unit 05)
6 pages
Homework 2
No ratings yet
Homework 2
4 pages
2014 Springer Varian Beyond Big Data H.r.varian - Beyondbigdata
No ratings yet
2014 Springer Varian Beyond Big Data H.r.varian - Beyondbigdata
6 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
Hw2 - Raymond Von Mizener - Chirag Mahapatra
No ratings yet
Hw2 - Raymond Von Mizener - Chirag Mahapatra
13 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
ThinkStation P2 Tower 30FS000ELM
No ratings yet
ThinkStation P2 Tower 30FS000ELM
2 pages
DSGW-060 Smart Gateway
No ratings yet
DSGW-060 Smart Gateway
7 pages
Gcse Information and Communication Technology
No ratings yet
Gcse Information and Communication Technology
20 pages
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
Ai ML Exam - 1march 16 2022-Michael Magreola
No ratings yet
Ai ML Exam - 1march 16 2022-Michael Magreola
8 pages
University of Edinburgh College of Science and Engineering School of Informatics
No ratings yet
University of Edinburgh College of Science and Engineering School of Informatics
5 pages
Quiz3 2024
No ratings yet
Quiz3 2024
2 pages
Dis 1
No ratings yet
Dis 1
5 pages
Machine Learning Questions Final - Solutions
No ratings yet
Machine Learning Questions Final - Solutions
5 pages
Quiz3 2023
No ratings yet
Quiz3 2023
2 pages
Internal Routine (B.Tech)
No ratings yet
Internal Routine (B.Tech)
1 page
SVM Problems1
No ratings yet
SVM Problems1
5 pages
Software Testing Practical No 6
No ratings yet
Software Testing Practical No 6
2 pages
CS725 2020 Quiz1
No ratings yet
CS725 2020 Quiz1
3 pages
DAG1000-1/2S Analog Telephone Adapter
No ratings yet
DAG1000-1/2S Analog Telephone Adapter
3 pages
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
No ratings yet
Taller 3 (A. NG.) - Introducción Al Aprendizaje Supervisado
8 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
CMPUT 466/551 - Assignment 1: Paradox?
No ratings yet
CMPUT 466/551 - Assignment 1: Paradox?
6 pages
hw3 Red
No ratings yet
hw3 Red
4 pages
HW 3
No ratings yet
HW 3
7 pages
6117991xF2-8 - ADTRAN1148SVX Host - Client
No ratings yet
6117991xF2-8 - ADTRAN1148SVX Host - Client
4 pages
Ps 1
No ratings yet
Ps 1
5 pages
C# Lab Report VTU
No ratings yet
C# Lab Report VTU
75 pages
Assign 1
No ratings yet
Assign 1
5 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
ECMA
No ratings yet
ECMA
7 pages

Version 1

Uploaded by

Version 1

Uploaded by

Franklin F.

AI534 — Written Homework Assignment 1 (30 pts + 6 bonus pts)

• Maximum likelihood estimation (MLE)

• Gradient descent learning

• Decision theory for probabilistic classifiers

• Maximum A Posteriori (MAP) parameter estimation

1. MLE for uniform distribution. [3pt]

(a) (1 pt) Write down the likelihood function of θ.

(b) (1pts) Show that maximizing

3. Decision theory: working with expectations. [6pt]

predicted true label y

Table 1: A mis-classification cost matrix for the spam filter problem.

• If the filter’s prediction is correct, there is no cost.

θ̂M AP = argmax P (θ = θ̂|X1 , . . . , Xn )

By applying Bayes’ theorem, this becomes:

θ̂M AP = argmax P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂) = argmax L(θ̂)p(θ̂)

where B(α, β) is a normalizing constant.

(a) (1 pts) Case 1: w0 = 0.

We can write out the likelihood function as:

where yik is an indicator variable taking value 1 if yi = k.

(a) (1 pts) Provide the log-likelihood function.

You might also like