Tut3 Questions
Tut3 Questions
Reminders: Attempt the tutorial questions, and ideally discuss them, before your tutorial.
You can seek clarifications and hints on the class forum. Full answers will be released.
This week has less linear algebra! Try to spend some time preparing clear explanations.
1. A Gaussian classifier:
A training set consists of one-dimensional examples from two classes. The training
examples from class 1 are {0.5, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.35, 0.25} and the examples
from class 2 are {0.9, 0.8, 0.75, 1.0}.
a) Fit a one-dimensional Gaussian to each class by matching the mean and variance.
Also estimate the class probabilities π1 and π2 by matching the observed class
fractions. (This procedure fits the model with maximum likelihood: it selects the
parameters that give the training data the highest probability.) Sketch a plot of
the scores p( x, y) = P(y) p( x | y) for each class y, as functions of input location x.
b) What is the probability that the test point x = 0.6 belongs to class 1? Mark the
decision boundary/ies on your sketch, the location(s) where P(class 1 | x ) =
P(class 2 | x ) = 0.5. You are not required to calculate the location(s) exactly.
c) Are the decisions that the model makes reasonable for very negative x and very
positive x? Are there any changes we could consider making to the model if we
wanted to change the model’s asymptotic behaviour?
2. Gradient descent:
Let E(w) be a differentiable function. Consider the gradient descent procedure
w(t+1) ← w(t) − η ∇w E.
a) Are the following true or false? Prepare a clear explanation, stating any necessary
assumptions:
i) Let w(1) be the result of taking one gradient step. Then the error never gets
worse, i.e., E(w(1) ) ≤ E(w(0) ).
ii) There exists some choice of the step size η such that E(w(1) ) < E(w(0) ).
b) A common programming mistake is to forget the minus sign in either the descent
procedure or in the gradient evaluation. As a result one unintentionally writes a
procedure that does w(t+1) ← w(t) + η ∇w E. What happens?
with respect to the weights w. As usual, y(n) is a binary label at input location x(n) .
The training data is said to be linearly separable if the two classes can be completely
1. Parts of this tutorial sheet are based on previous versions by Amos Storkey, Charles Sutton, and Chris Williams
a) Show that if the training data is linearly separable with a decision hyperplane
specified by w and b, the data is also separable with the boundary given by w̃
and b̃, where w̃ = cw and b̃ = cb for any scalar c > 0.
b) What consequence does the above result have for maximum likelihood training
of logistic regression for linearly separable data?
a) Suppose that we fit a logistic regression model with a bias weight w0 , that
is p(y = 1 | x, w) = σ (w0 + w1 x1 + w2 x2 ), by maximum likelihood, obtaining
parameters ŵ. Sketch a possible decision boundary corresponding to ŵ. Is your
answer unique? How many classification errors does your method make on the
training set?
b) Now suppose that we regularize only the w0 parameter, that is, we minimize
J0 (w) = −`(w) + λw02 ,
where ` is the log-likelihood of w (the log-probability of the labels given those
parameters).
Suppose λ is a very large number, so we regularize w0 all the way to 0, but
all other parameters are unregularized. Sketch a possible decision boundary.
How many classification errors does your method make on the training set?
Hint: consider the behaviour of simple linear regression, w0 + w1 x1 + w2 x2 when
x1 = x2 = 0.