0% found this document useful (0 votes)
6 views2 pages

Tut3 Questions

MLPR questions

Uploaded by

Amir Sharifi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views2 pages

Tut3 Questions

MLPR questions

Uploaded by

Amir Sharifi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

MLPR Tutorial1 Sheet 3

Reminders: Attempt the tutorial questions, and ideally discuss them, before your tutorial.
You can seek clarifications and hints on the class forum. Full answers will be released.
This week has less linear algebra! Try to spend some time preparing clear explanations.

1. A Gaussian classifier:
A training set consists of one-dimensional examples from two classes. The training
examples from class 1 are {0.5, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.35, 0.25} and the examples
from class 2 are {0.9, 0.8, 0.75, 1.0}.

a) Fit a one-dimensional Gaussian to each class by matching the mean and variance.
Also estimate the class probabilities π1 and π2 by matching the observed class
fractions. (This procedure fits the model with maximum likelihood: it selects the
parameters that give the training data the highest probability.) Sketch a plot of
the scores p( x, y) = P(y) p( x | y) for each class y, as functions of input location x.

b) What is the probability that the test point x = 0.6 belongs to class 1? Mark the
decision boundary/ies on your sketch, the location(s) where P(class 1 | x ) =
P(class 2 | x ) = 0.5. You are not required to calculate the location(s) exactly.

c) Are the decisions that the model makes reasonable for very negative x and very
positive x? Are there any changes we could consider making to the model if we
wanted to change the model’s asymptotic behaviour?

2. Gradient descent:
Let E(w) be a differentiable function. Consider the gradient descent procedure

w(t+1) ← w(t) − η ∇w E.

a) Are the following true or false? Prepare a clear explanation, stating any necessary
assumptions:

i) Let w(1) be the result of taking one gradient step. Then the error never gets
worse, i.e., E(w(1) ) ≤ E(w(0) ).

ii) There exists some choice of the step size η such that E(w(1) ) < E(w(0) ).

b) A common programming mistake is to forget the minus sign in either the descent
procedure or in the gradient evaluation. As a result one unintentionally writes a
procedure that does w(t+1) ← w(t) + η ∇w E. What happens?

3. Maximum likelihood and logistic regression:


Maximum likelihood logistic regression maximizes the log probability of the labels,

∑ log P(y(n) | x(n) , w),


n

with respect to the weights w. As usual, y(n) is a binary label at input location x(n) .
The training data is said to be linearly separable if the two classes can be completely

1. Parts of this tutorial sheet are based on previous versions by Amos Storkey, Charles Sutton, and Chris Williams

MLPR:tut3 Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2018/ 1


separated by a hyperplane. That means we can find a decision boundary
1
P(y(n) = 1 | x(n) , w, b) = σ (w> x(n) + b) = 0.5, where σ ( a) = ,
1 + e− a
such that all the y = 1 labels are on one side (with probability greater than 0.5), and all
of the y 6= 1 labels are on the other side.

a) Show that if the training data is linearly separable with a decision hyperplane
specified by w and b, the data is also separable with the boundary given by w̃
and b̃, where w̃ = cw and b̃ = cb for any scalar c > 0.

b) What consequence does the above result have for maximum likelihood training
of logistic regression for linearly separable data?

4. Logistic regression and maximum likelihood: (Murphy, Exercise 8.7, by Jaaakkola.)


Consider the following data set:

a) Suppose that we fit a logistic regression model with a bias weight w0 , that
is p(y = 1 | x, w) = σ (w0 + w1 x1 + w2 x2 ), by maximum likelihood, obtaining
parameters ŵ. Sketch a possible decision boundary corresponding to ŵ. Is your
answer unique? How many classification errors does your method make on the
training set?

b) Now suppose that we regularize only the w0 parameter, that is, we minimize
J0 (w) = −`(w) + λw02 ,
where ` is the log-likelihood of w (the log-probability of the labels given those
parameters).
Suppose λ is a very large number, so we regularize w0 all the way to 0, but
all other parameters are unregularized. Sketch a possible decision boundary.
How many classification errors does your method make on the training set?
Hint: consider the behaviour of simple linear regression, w0 + w1 x1 + w2 x2 when
x1 = x2 = 0.

c) Now suppose that we regularize only the w1 parameter, i.e., we minimize


J1 (w) = −`(w) + λw12 ,
Again suppose λ is a very large number. Sketch a possible decision boundary.
How many classification errors does your method make on the training set?

d) Now suppose that we regularize only the w2 parameter, i.e., we minimize


J2 (w) = −`(w) + λw22 ,
Again suppose λ is a very large number. Sketch a possible decision boundary.
How many classification errors does your method make on the training set?

MLPR:tut3 Iain Murray, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2018/ 2

You might also like