6.867 Machine Learning: Mid-Term Exam October 13, 2004
6.867 Machine Learning: Mid-Term Exam October 13, 2004
Mid-term exam
Problem 1
1 1 1
noise
noise
noise
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x x x
A B C
A B C
linear regression ( ) ( ) ( )
quadratic regression ( ) ( ) ( )
1
Problem 2
Here we explore a regression model where the noise variance is a function of the input
(variance increases as a function of input). Specifically
y = wx +
where the noise is normally distributed with mean 0 and standard deviation σx. The
value of σ is assumed known and the input x is restricted to the interval [1, 4]. We can
write the model more compactly as y ∼ N (wx, σ 2 x2 ).
If we let x vary within [1, 4] and sample outputs y from this model with some w, the
regression plot might look like
10
6
y
0
1 2 3 4
x
2. Suppose we now have n training points and targets {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )},
where each xi is chosen at random from [1, 4] and the corresponding yi is subsequently
sampled from yi ∼ N (w∗ xi , σ 2 x2i ) with some true underlying parameter value w∗ ; the
value of σ 2 is the same as in our model.
2
(a) (3 points) What is the maximum-likelihood estimate of w as a function of the
training data?
(b) (3 points) What is the variance of this estimator due to the noise in the target
outputs as a function of n and σ 2 for fixed inputs x1 , . . . , xn ? For later utility
(if you omit this answer) you can denote the answer as V (n, σ 2 ).
(b) (2 points) Which xn+1 would we choose (within [1, 4]) if we were to next select
x with the maximum variance of f (x; ŵn )?
(c) (T/F – 2 points) Since the variance of f (x; ŵn ) only depends on x,
n, and σ 2 , we could equally well select the next point at random from
[1, 4] and obtain the same reduction in the maximum variance.
3
1
0.9
0.8
0.6
0.5
0.4
(1) P (y = 1|x, ŵ)
0.3
0.2
0.1
y=0 y=1 y=0
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 1: Two possible logistic regression solutions for the three labeled points.
Problem 3
Consider a simple one dimensional logistic regression model
P (y = 1|x, w) = g(w0 + w1 x)
(a) (2 points) Please indicate the number of classification errors for each condi-
tional given the labeled examples in the same figure
Conditional (1) makes ( ) classification errors
Conditional (2) makes ( ) classification errors
(b) (3 points) One of the conditionals in Figure 3 corresponds to the
maximum likelihood setting of the parameters ŵ based on the labeled
data in the figure. Which one is the ML solution (1 or 2)?
(c) (2 points) Would adding a regularization penalty |w1 |2 /2 to the log-
likelihood estimation criterion affect your choice of solution (Y/N)?
4
1
−0.5
−1
−1.5
0 50 100 150 200 250 300
number of training examples
Figure 2: The expected log-likelihood of test labels as a function of the number of training
examples.
2. (4 points) We can estimate the logistic regression parameters more accurately with
more training data. Figure 2 shows the expected log-likelihood of test labels for a
simple logistic regression model as a function of the number of training examples and
labels. Mark in the figure the structural error (SE) and approximation error (AE),
where “error” is measured in terms of log-likelihood.
5
x2
(0,1) (1,1)
o x
(0,0) (1,0)
x o
x1
Problem 4
Here we will look at methods for selecting input features for a logistic regression model
P (y = 1|x, w) = g(w0 + w1 x1 + w2 x2 )
The available training examples are very simple, involving only binary valued inputs:
Number of copies x1 x2 y
10 1 1 1
10 0 1 0
10 1 0 0
10 0 0 1
So, for example, there are 10 copies of x = [1, 1]T in the training set, all labeled y = 1.
The correct label is actually a deterministic function of the two features: y = 1 if x1 = x2
and zero otherwise.
We define greedy selection in this context as follows: we start with no features (train only
with w0 ) and successively try to add new features provided that each addition strictly
improves the training log-likelihood. We use no other stopping criterion.
6
3. (3 points) Suppose we define another possible feature to include, a function of x1
and x2 . Which of the following features, if any, would permit us to correctly classify
all the training examples when used in combination with x1 and x2 in the logistic
regression model:
( ) x1 − x2
( ) x1 x2
( ) x22
7
Problem 5
3
(0,3)
2.5
(2,2)
2
1.5
(h,1)
1
0.5
(0,0)
0
0 0.5 1 1.5 2 2.5 3
Suppose we only have four training examples in two dimensions (see Figure 4):
positive examples at x1 = [0, 0]T , x2 = [2, 2]T and
negative examples at x3 = [h, 1]T , x4 = [0, 3]T .
where we treat h ≥ 0 as a parameter.
1. (2 points) How large can h ≥ 0 be so that the training points are still
linearly separable?
8
3. (4 points)What is the margin achieved by the maximum margin boundary as a
function of h?
4. (3 points) Assume that h = 1/2 (as in the figure) and that we can
only observe the x2 -component of the input vectors. Without the other
component, the labeled training points reduce to (0, y = 1), (2, y = 1),
(1, y = −1), and (3, y = −1). What is the lowest order p of polynomial
kernel that would allow us to correctly classify these points?
9
Additional set of figures
1 1 1
noise
noise
noise
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x x x
A B C
10
6
y
0
1 2 3 4
x
0.9
0.8
0.6
0.5
0.4
(1) P (y = 1|x, ŵ)
0.3
0.2
0.1
y=0 y=1 y=0
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
10
1
−0.5
−1
−1.5
0 50 100 150 200 250 300
number of training examples
3
(0,3)
2.5
(2,2)
2
x2
1.5
0.5
(0,0) (1,0)
x o (0,0)
x1 0
0 0.5 1 1.5 2 2.5 3
11