0% found this document useful (0 votes)
67 views11 pages

6.867 Machine Learning: Mid-Term Exam October 13, 2004

The document is a midterm exam for a machine learning course. It contains 4 problems regarding regression, active learning, logistic regression, and feature selection. Problem 1 involves identifying which plots could represent errors from linear or quadratic regression models. Problem 2 explores regression where the noise variance increases with the input, and asks about the maximum likelihood estimate of the regression parameter and its variance. Problem 3 is about active learning for regression, asking how the variance of predictions depends on the input and how to select the next point to minimize variance. Problem 4 considers greedy feature selection and classification error for a logistic regression on a simple binary dataset.

Uploaded by

Dickson Samwel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views11 pages

6.867 Machine Learning: Mid-Term Exam October 13, 2004

The document is a midterm exam for a machine learning course. It contains 4 problems regarding regression, active learning, logistic regression, and feature selection. Problem 1 involves identifying which plots could represent errors from linear or quadratic regression models. Problem 2 explores regression where the noise variance increases with the input, and asks about the maximum likelihood estimate of the regression parameter and its variance. Problem 3 is about active learning for regression, asking how the variance of predictions depends on the input and how to select the next point to minimize variance. Problem 4 considers greedy feature selection and classification error for a logistic regression on a simple binary dataset.

Uploaded by

Dickson Samwel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

6.

867 Machine learning

Mid-term exam

October 13, 2004

(2 points) Your name and MIT ID:

Problem 1
1 1 1
noise

noise

noise

0 0 0

−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x x x

A B C

1. (6 points) Each plot above claims to represent prediction errors as a function of


x for a trained regression model based on some dataset. Some of these plots could
potentially be prediction errors for linear or quadratic regression models, while oth-
ers couldn’t. The regression models are trained with the least squares estimation
criterion. Please indicate compatible models and plots.

A B C
linear regression ( ) ( ) ( )
quadratic regression ( ) ( ) ( )

1
Problem 2
Here we explore a regression model where the noise variance is a function of the input
(variance increases as a function of input). Specifically

y = wx + 

where the noise  is normally distributed with mean 0 and standard deviation σx. The
value of σ is assumed known and the input x is restricted to the interval [1, 4]. We can
write the model more compactly as y ∼ N (wx, σ 2 x2 ).
If we let x vary within [1, 4] and sample outputs y from this model with some w, the
regression plot might look like
10

6
y

0
1 2 3 4
x

1. (2 points) How is the ratio y/x distributed for a fixed (constant) x?

2. Suppose we now have n training points and targets {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )},
where each xi is chosen at random from [1, 4] and the corresponding yi is subsequently
sampled from yi ∼ N (w∗ xi , σ 2 x2i ) with some true underlying parameter value w∗ ; the
value of σ 2 is the same as in our model.

2
(a) (3 points) What is the maximum-likelihood estimate of w as a function of the
training data?

(b) (3 points) What is the variance of this estimator due to the noise in the target
outputs as a function of n and σ 2 for fixed inputs x1 , . . . , xn ? For later utility
(if you omit this answer) you can denote the answer as V (n, σ 2 ).

Some potentially useful relations: if z ∼ N (µ, σ 2 ), then az ∼ N (aµ, σ 2 a2 ) for a


fixed a. If z1 ∼ N (µ1 , σ12 ) and z2 ∼ N (µ2 , σ22 ) and they are independent, then
Var(z1 + z2 ) = σ12 + σ22 .
3. In sequential active learning we are free to choose the next training input xn+1 , here
within [1, 4], for which we will then receive the corresponding noisy target yn+1 , sam-
pled from the underlying model. Suppose we already have {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}
and are trying to figure out which xn+1 to select. The goal is to choose xn+1 so as to
help minimize the variance of the predictions f (x; ŵn ) = ŵn x, where ŵn is the maxi-
mum likelihood estimate of the parameter w based on the first n training examples.
(a) (2 points) What is the variance of f (x; ŵn ) due to the noise in the training out-
puts as a function of x, n, and σ 2 given fixed (already chosen) inputs x1 , . . . , xn ?

(b) (2 points) Which xn+1 would we choose (within [1, 4]) if we were to next select
x with the maximum variance of f (x; ŵn )?

(c) (T/F – 2 points) Since the variance of f (x; ŵn ) only depends on x,
n, and σ 2 , we could equally well select the next point at random from
[1, 4] and obtain the same reduction in the maximum variance.

3
1

0.9

0.8

0.7 (2) P (y = 1|x, ŵ)

0.6

0.5

0.4
(1) P (y = 1|x, ŵ)
0.3

0.2

0.1
y=0 y=1 y=0
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 1: Two possible logistic regression solutions for the three labeled points.

Problem 3
Consider a simple one dimensional logistic regression model

P (y = 1|x, w) = g(w0 + w1 x)

where g(z) = (1 + exp(−z))−1 is the logistic function.

1. Figure 3 shows two possible conditional distributions P (y = 1|x, w), viewed as a


function of x, that we can get by changing the parameters w.

(a) (2 points) Please indicate the number of classification errors for each condi-
tional given the labeled examples in the same figure
Conditional (1) makes ( ) classification errors
Conditional (2) makes ( ) classification errors
(b) (3 points) One of the conditionals in Figure 3 corresponds to the
maximum likelihood setting of the parameters ŵ based on the labeled
data in the figure. Which one is the ML solution (1 or 2)?
(c) (2 points) Would adding a regularization penalty |w1 |2 /2 to the log-
likelihood estimation criterion affect your choice of solution (Y/N)?

4
1

expected log−likelihood of test labels


0.5

−0.5

−1

−1.5
0 50 100 150 200 250 300
number of training examples

Figure 2: The expected log-likelihood of test labels as a function of the number of training
examples.

2. (4 points) We can estimate the logistic regression parameters more accurately with
more training data. Figure 2 shows the expected log-likelihood of test labels for a
simple logistic regression model as a function of the number of training examples and
labels. Mark in the figure the structural error (SE) and approximation error (AE),
where “error” is measured in terms of log-likelihood.

3. (T/F – 2 points) In general for small training sets, we are likely


to reduce the approximation error by adding a regularization penalty
|w1 |2 /2 to the log-likelihood criterion.

5
x2

(0,1) (1,1)
o x

(0,0) (1,0)
x o
x1

Figure 3: Equally likely input configurations in the training set

Problem 4
Here we will look at methods for selecting input features for a logistic regression model

P (y = 1|x, w) = g(w0 + w1 x1 + w2 x2 )

The available training examples are very simple, involving only binary valued inputs:
Number of copies x1 x2 y
10 1 1 1
10 0 1 0
10 1 0 0
10 0 0 1
So, for example, there are 10 copies of x = [1, 1]T in the training set, all labeled y = 1.
The correct label is actually a deterministic function of the two features: y = 1 if x1 = x2
and zero otherwise.
We define greedy selection in this context as follows: we start with no features (train only
with w0 ) and successively try to add new features provided that each addition strictly
improves the training log-likelihood. We use no other stopping criterion.

1. (2 points) Could greedy selection add either x1 or x2 in this case?


Answer Y or N.

2. (2 points) What is the classification error of the training examples that


we could achieve by including both x1 and x2 in the logistic regression
model?

6
3. (3 points) Suppose we define another possible feature to include, a function of x1
and x2 . Which of the following features, if any, would permit us to correctly classify
all the training examples when used in combination with x1 and x2 in the logistic
regression model:

( ) x1 − x2
( ) x1 x2
( ) x22

4. (2 points) Could the greedy selection method choose this feature as


the first feature to add when the available features are x1 , x2 and your
choice of the new feature? Answer Y or N.

7
Problem 5

3
(0,3)

2.5

(2,2)
2

1.5

(h,1)
1

0.5

(0,0)
0
0 0.5 1 1.5 2 2.5 3

Figure 4: Labeled training examples

Suppose we only have four training examples in two dimensions (see Figure 4):
positive examples at x1 = [0, 0]T , x2 = [2, 2]T and
negative examples at x3 = [h, 1]T , x4 = [0, 3]T .
where we treat h ≥ 0 as a parameter.

1. (2 points) How large can h ≥ 0 be so that the training points are still
linearly separable?

2. (2 points) Does the orientation of the maximum margin decision


boundary change as a function of h when the points are separable?
Answer Y or N.

8
3. (4 points)What is the margin achieved by the maximum margin boundary as a
function of h?

4. (3 points) Assume that h = 1/2 (as in the figure) and that we can
only observe the x2 -component of the input vectors. Without the other
component, the labeled training points reduce to (0, y = 1), (2, y = 1),
(1, y = −1), and (3, y = −1). What is the lowest order p of polynomial
kernel that would allow us to correctly classify these points?

9
Additional set of figures
1 1 1
noise

noise

noise
0 0 0

−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x x x

A B C
10

6
y

0
1 2 3 4
x

0.9

0.8

0.7 (2) P (y = 1|x, ŵ)

0.6

0.5

0.4
(1) P (y = 1|x, ŵ)
0.3

0.2

0.1
y=0 y=1 y=0
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

10
1

expected log−likelihood of test labels


0.5

−0.5

−1

−1.5
0 50 100 150 200 250 300
number of training examples

3
(0,3)

2.5

(2,2)
2
x2
1.5

(0,1) (1,1) (h,1)


o x 1

0.5

(0,0) (1,0)
x o (0,0)
x1 0
0 0.5 1 1.5 2 2.5 3

11

You might also like