Midterm2008f Sol
Midterm2008f Sol
Midterm
Professor: Eric Xing Date: October 20, 2008
. There are 6 questions in this exam (11 pages including this cover sheet)
. This exam is open to book and notes. Computers, PDAs, Cell phones are not allowed.
. Good luck!
1
1 Assorted Questions [24 points]
Please explain your answer in a single line for all True/False questions
1. (True or False, 2 pts) E(X + Y ) = E(X) + E(Y ) holds for any two random variables X, Y .
Solutions: T
2. (True or False, 2 pts) V ar(X + Y ) = V ar(X) + V ar(Y ) holds for any two random variables
X, Y .
3. (True or False, 2 pts) E(XY ) = E(X) · E(Y ) holds for any two random variables X, Y .
Solutions: F ( if X and Y are independent with each other).
5. (True or False, 2 pts) In any finite space, if we are given 1 positive example and 1 negative
example as the training data set, then 1-NN (using L2 distance) is always a linear classifier.
Solutions: T
6. (True or False, 2 pts) For binary Naive Bayes classifier, if P (xi |y), (i = 1, ..., d; y = ±1)
follows Gaussian distributions, then the resulting classifier is always a linear classifier. Re-
member that xi is the ith dimensional feature; d is the dimension of the feature vector; and
y is the class label.
7. (True or False, 2 pts) Both bagging and boosting perform re-sampling on the training set.
Solutions: T
8. (True or False, 2 pts) For small k, a k-nearest neighbor classifier has small variance.
9. (True or False, 3 pts) The expected extra number of bits required to send a code, if an
optimal code for the (wrong) distribution q is used, instead of the optimal code for the true
distribution p, is given by KL(q||p).
10. (5 pts) Given 4 positive examples: (0.707, 0.707), (0.707, −0.707), (−0.707, 0.707), (−0.707, −0.707)
and 4 negative examples: (3, 0), (5, 0), (4, −1), (4, 1) in 2-d space (see Fig. 1.(a)), we will use
each of the following algorithms to learn a classifier from this training data set:
(a) Naive Bayes, assuming that we have the same variance (i.e., Var(x1 |y = 0) = Var(x1 |y =
1) = Var(x2 |y = 0) = Var(x2 |y = 1) and the same class prior (i.e., P (y = 0) = P (y = 1)).
(b) Linear SVM with hard margin.
2
(a) training data set (b)
(c) (d)
hints: for each classifier, there might be more than one possible corresponding figures.
3
2 Linear Regression [16 Points]
Consider the regression problem where the two dimensional input points x = [x1 , x2 ]T are con-
strained to lie within the unit square: xi ∈ [−1, 1], i = 1, 2. The training and test input points x
are sampled uniformly at random within the unit square. The target outputs y are governed by
the following model
y ∼ N (x51 x52 + 20x21 − 30x1 x2 + 5x1 − 1, 1)
We learn to predict y given x using linear regression models with 1st through 12th order polynomial
features. The models are nested in the sense that the higher order models will include all the lower
order features. The estimation criterion is the mean squared error. We first train a 1st, 2nd, 10th,
and 12th order model using n = 20 training points, and then test the predictions on a large number
of independently sampled points.
(Hint 1: Examples of 10th order polynomial features are x10 9 8 2
1 , x1 x2 , x1 x2 , . . . .)
(Hint 2: A 1st order model involves features x1 , x2 , and 1. A 2nd order model involves features
x21 , x1 x2 , x22 , x1, x2, and 1. Similarly, it can be generalized to higher order models.)
Solution: The 12th order model. On training error, a higher order model can do no worse
than a lower order model, because it has an extended feature set. The most expressive model
gets the lowest training error. (Explanation is not required in exam.)
Solution: The 1st order model. The least expressive model gets the highest training error.
(Explanation is not required in exam.)
3. (4 pts) If we have xi ∈ [−10, 10] instead of xi ∈ [−1, 1], which model is likely to get a lowest
test error? Use one sentence to explain your selection.
Solution: The 10th order model, because it is the closest match to the ground truth model.
Lower order models can not capture the x51 x52 term which is dominant in this case (high bias).
Higher order models are likely to overfit which will result in higher test error (high variance).
The 10th order model has 0 bias and reasonable variance, and is likely to give the lowest test
error.
4. (4 pts) Go back to the case where xi ∈ [−1, 1]. Among the 2nd, 10th, and 12th order model,
which one would typically get a lowest test error now? Briefly explain your selection.
Solution: The 2nd order model. The term x51 x52 is small, so the 2nd order model is a good
approximation to the ground truth model. On the other hand, the small first term, x51 x52 , is
unlikely to be distinguished from noise, so the 10th order model is more likely to overfit the
data than the 2nd order model. A 12th order model is even worse. In short, the 2nd order
model has the lowest variance and a small bias, and is likely to give the lowest test error.
4
3 Support Vector Machine for Regression (SVR) [15 pts]
Using the similar idea for classification, we can also use support vector machine for regression
(referred to as ‘SVR’). In SVR, we are given m training examples (x1 , y1 ), (x2 , y2 ), ...(xm , ym ), where
xi is the feature vector and yi ∈ R is the output for the ith example, respectively(i = 1, 2, ..., m).
In the linear SVR, we aim to learn a linear function f from the training examples, which takes
the following form: f = w′ x + b, where w and b are the parameters to be learned, and w′ is the
transpose of the weight vector w .
In the simplest case for linear SVR, we can formulate it as the following convex optimization
problem:
1
minimize : kwk2
2
subject to : yi − (w′ xi + b) ≤ ǫ; and (w′ xi + b) − yi ≤ ǫ (i = 1, 2, ...m) (1)
The intuition of eq. (1) is that we want to learn the parameters w and b so that (1) the resulting
f is as smooth as possible (i.e., small kwk); and (2) for each training example, the prediction error
by f is at most ǫ, where ǫ ≥ 0 is a given parameter in the algorithm. Notice that the bigger ǫ is,
the more smooth function f we are looking for.
Now, given 3 training examples (1, 1), (2, 2), (3, 3) (See Fig. 3), suppose that we want to use eq. (1)
to learn a linear SVR.
Solutions: w = 1
Solutions: w = 0.5
Solutions: w = 0
5
4 Neural Networks [20 Points]
Here is a simple 2-layer neural network with 2 hidden units and a single output unit.
P
Consider the linear activation function y = C · a = C · i wi xi where C is a constant multiplied by
a which is the weighted sum of its inputs. Also, consider the non-linear logistic activation function
y = σ(a) where
1
σ(a) =
1 + e−a
1. (a) (3 pts) Assume all units are linear. Can this 2-layer network represent decision bound-
aries that a standard regression model y = b0 + b1 x1 + b2 x2 + ǫ cannot?
Solution:
No. A network with all linear units can be reduced to a simple linear model.
(b) (3 pts) Assume the hidden units use logistic activation functions and the output unit
uses a linear activation unit. Can this network represent non-linear decision boundaries?
Solution:
Yes.
(c) (4 pts) Using logistic activation functions for both hidden and output units, it is possible
to approximate any complicated decision surface by combining many piecewise linear de-
cision boundaries. Explain what changes you would need to make to the above network
so you could approximate any decision boundary.
Solution:
Yes, you need additional hidden units. More hidden units lead to a more complicated
decision surface.
2. (10 pts) Consider the XOR function: y = (x1 ∧ ¬ x2 ) ∨ (x2 ∧ ¬ x1 ). Assume all units are
logistic. We can implement the XOR function using the two layer network above and the
decision rule:
y > 1/2 then 1
y < 1/2 then 0
Select the weights that implement (x1 XOR x2 ). Hint: There are many solutions, but a
simple one is where all weights come from the set {−10, 10, 100}.
6
Solution:
w10 = -10
w20 = -10
w0 = -10
w11 = 100
w12 = -100
w21 = -100
w22 = 100
w1 = 100
w2 = 100
7
5 AdaBoost
Given N examples (xi , yi ), where yi is the label and yi = +1 or yi = −1. Let I(·) be the indicator
function, which is 1 if the condition in () is true and 0 otherwise. In this problem, we use the
following version for AdaBoost algorithm:
2. For t = 1, ..., M ,
a. P
Learn a weak classifier ht (x) by minimizing the weighed error function Gt , where Gt =
N t
i=1 wi I(ht (xi ) 6= yi );
b. Compute the error rate for the learnt weak classifier ht (x): ǫt = N t
P
i=1 wi I(ht (xi ) 6= yi );
c. Compute the weight for ht (x): αt = 12 ln 1−ǫ
ǫt ;
t
w t exp{−α y ht (xi )}
d. Update the weight for each example: wit+1 = i Zt
t i
, where Zt is the normal-
ization factor for wit+1 : Zt = N t
P
i=1 wi exp{−αt yi ht (xi )}.
∂ N 2
P
∂E i=1 (yi − fm (xi ))
= (2)
∂αm ∂αm
N
X ∂(yi − fm (xi ))
= 2(yi − fm (xi )) (3)
∂αm
i=1
8
In this equation, fm−1 is independent of αm . Substituting this in the derivative equation, we get
N
∂E X ∂(yi − fm (xi ))
= 2(yi − fm (xi )) (5)
∂αm ∂αm
i=1
N
X ∂(yi − αm hm (xi ) − fm−1 (xi ))
= 2(yi − fm (xi )) (6)
∂αm
i=1
XN
= 2(yi − fm (xi ))(−hm (xi )) (7)
i=1
In Homework 3, we proved that the training error ǫtraining of AdaBoost is upper bounded by
QM 1 PN
t=1 Zt , where the training error ǫtraining = N i=1 I(H(xi ) 6= yi ). We will now modify this
bound to rewrite it in terms of the error rate ǫt of the weak classifiers.
Prove that Zt = 2 ǫt (1 − ǫt ) and hence the training error ǫtraining is upper bounded by M
p Q p
t=1 [2 ǫt (1 − ǫt )].
Show all relevant steps.
Solution:
N
X
Zt = wit exp{−αt yi ht (xi )} (11)
i=1
N
X
= wit exp{−αt yi ht (xi )}(I[yi = ht (xi )] + I[yi 6= ht (xi )]) (12)
i=1
XN N
X
= wit exp{−αt }I[yi = ht (xi )] + wit exp{αt }I[yi 6= ht (xi )] (13)
i=1 i=1
= exp{−αt }(1 − ǫt ) + exp{αt }ǫt (14)
Substituting αt = 12 ln 1−ǫ
ǫt , we get
t
9
QM p
Since this equality holds for all t, we get that ǫtraining is upper bounded by t=1 [2 ǫt (1 − ǫt )].
10
6 VC Dimension and Hypothesis Spaces [15 Points]
Consider a collection of N points lying in K dimensional space ℜK . For each point xk , we can
assign some label y ∈ {0, 1}. Thus for N points, there are 2N possible labelings of the data.
Assuming we have some weight wk for each of the K inputs, we can classify a point using a linear
threshold function:
X K
y=f wk xk
k=1
where
1 if a > 0
f (a) =
0 if a ≤ 0
Solution:
VC = 1
2. (4 pts) Assuming K = 1, how many different labelings (i.e. hypotheses) of N points can be
realized by changing the weights in f (a) ?
Solution:
2
Solution:
VC=2
4. (3 pts) Assuming K = 2, how many different labelings (i.e. hypotheses) of N points can be
realized by changing the weights in f (a) ? You can assume no two points exist on a line that
passes through the origin. Hint: Your answer should be in terms of N .
Solution:
2·N
11
12