07au Midterm

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

10-701 Midterm Exam, Fall 2007

1. Personal info:

• Name:
• Andrew account:
• E-mail address:

2. There should be 17 numbered pages in this exam (including this cover sheet).

3. You can use any material you brought: any book, class notes, your print outs of class
materials that are on the class website, including my annotated slides and relevant
readings, and Andrew Moore’s tutorials. You cannot use materials brought by other
students. Calculators are not necessary. Laptops, PDAs, phones and Internet access
are not allowed.

4. If you need more room to work out your answer to a question, use the back of the page
and clearly mark on the front of the page if we are to look at what’s on the back.

5. Work efficiently. Some questions are easier, some more difficult. Be sure to give yourself
time to answer all of the easy ones, and avoid getting bogged down in the more difficult
ones before you have answered the easier ones.

6. Note there are extra-credit sub-questions. The grade curve will be made without
considering students’ extra credit points. The extra credit will then be used to try to
bump your grade up without affecting anyone else’s grade.

7. You have 80 minutes.

8. Good luck!

Question Topic Max. score Score


1 Short questions 20 + 0.1010 extra
2 Loss Functions 12
3 Kernel Regression 12
4 Model Selection 14
5 Support Vector Machine 12
6 Decision Trees and Ensemble Methods 30

1
1 [20 Points] Short Questions
The following short questions should be answered with at most two sentences, and/or a
picture. For yes/no questions, make sure to provide a short justification.

1. [2 point] Does a 2-class Gaussian Naive Bayes classifier with parameters µ1k , σ1k , µ2k ,
σ2k for attributes k = 1, ..., m have exactly the same representational power as logistic
regression (i.e., a linear decision boundary), given no assumptions about the variance
2
values σik ?

2. [2 points] For linearly separable data, can a small slack penalty (“C”) hurt the training
accuracy when using a linear SVM (no kernel)? If so, explain how. If not, why not?

3. [3 points] Consider running AdaBoost with Multinomial Naive Bayes as the weak
learner for two classes and k binary features. After t iterations, of AdaBoost, how
many parameters do you need to remember? In other words, how many numbers do
you need to keep around to predict the label of a new example? Assume that the
weak-learner training error is non-zero at iteration t. Don’t forget to mention where
the parameters come from.

4. [2 points] In boosting, would you stop the iteration if the following happens? Justify
your answer with at most two sentences each question.

• The error rate of the combined classifier on the original training data is 0.

2
• The error rate of the current weak classifier on the weighted training data is 0.

5. [4 points] Given n linearly independent feature vectors in n dimensions, show that


for any assignment to the binary labels you can always construct a linear classifier
with weight vector w which separates the points. Assume that the classifier has the
form sign(w · x). Note that a square matrix composed of linearly independent rows is
invertible.

6. [3 points] Construct a one dimensional classification dataset for which the Leave-one-
out cross validation error of the One Nearest Neighbors algorithm is always 1. Stated
another way, the One Nearest Neighbor algorithm never correctly predicts the held out
point.

7. [2 points] Would we expect that running AdaBoost using the ID3 decision tree learning
algorithm (without pruning) as the weak learning algorithm would have a better true
error rate than running ID3 alone (i.e., without boosting (also without pruning))?
Explain.

3
8. [1 point] Suppose there is a coin with unknown bias p. Does there exist some value of p
for which we would expect the maximum a-posteriori estimate of p, using a Beta(4, 2)
prior, to require more coin flips before it is close to the true value of p, compared to
the number of flips required of the maximum likelihood estimate of p? Explain.
(The Beta(4, 2) distribution is given in the figure below.)
HΒH =4., ΒT =2.L
pHΘL

2.0

1.5

1.0

0.5

0.0 Θ
0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: Beta(4, 2) distribution

9. [1 point] Suppose there is a coin with unknown bias p. Does there exist some value
of p for which we would expect the maximum a-posteriori estimate of p, using a
U nif orm([0, 1]) prior, to require more coin flips before it is close to the true value
of p, compared to the number of flips required of the maximum likelihood estimate of
p? Explain.

10. [0.1010 extra credit] Can a linear classifier separate the positive from the negative
examples in the dataset below? Justify.

Colbert
U2
f or
Loosing my religion
president

T he Beatles
N irvana
T here is a season...
Grunge
T urn! T urn! T urn!

4
2 [12 points] Loss Function
Generally speaking, a classifier can be written as H(x) = sign(F (x)), where H(x) : Rd →
{−1, 1} and F (x) : Rd → R. To obtain the P parameters in F (x), we need to minimize the
i i
loss function averaged over the training set: i L(y F (x )). Here L is a function of yF (x).
For example, for linear classifiers, F (x) = w0 + dj=1 wj xj , and yF (x) = y(w0 + dj=1 wj xj )
P P

1. [4 points] Which loss functions below are appropriate to use in classification? For the
ones that are not appropriate, explain why not. In general, what conditions does L
have to satisfy in order to be an appropriate loss function? The x axis is yF (x), and
the y axis is L(yF (x)).

12 1 1

0.9 0.9
10
0.8 0.8

0.7 0.7
8
0.6 0.6

6 0.5 0.5

0.4 0.4
4
0.3 0.3

0.2 0.2
2
0.1 0.1

0 0 0
!10 !5 0 5 10 !10 !5 0 5 10 !10 !5 0 5 10

(a) (b) (c)

12 !

"$+
10
"$*

"$)
8
"$(

6 "$#

"$'
4
"$&

"$%
2
"$!

0 "
!10 !5 0 5 10 !!" !# " # !"

(d) (e)

5
2. [4 points] Of the above loss functions appropriate to use in classification, which one is
the most robust to outliers? Justify your answer.

3. [4 points] Let F (x) = w0 + dj=1 wj xj and L(yF (x)) = 1+exp(yF


1
P
(x))
. Suppose you use
gradient descent to obtain the optimal parameters w0 and wj . Give the update rules
for these parameters.

6
3 [12 points] Kernel Regression, k-NN
1. [4 points] Sketch the fit Y given X for the dataset given below using kernel regression
with a box kernel

1 if − h ≤ xi − xj < h
K(xi , xj ) = I(−h ≤ xi − xj < h) =
0 otherwise

for h = 0.5, 2.

• h = 0.5
4

3.5

2.5

2
y

1.5

0.5

0
0 1 2 3 4 5 6
x

• h=2
4

3.5

2.5

2
y

1.5

0.5

0
0 1 2 3 4 5 6
x

7
2. [4 points] Sketch or describe a dataset where kernel regression with the box kernel
above with h = 0.5 gives the same regression values as 1-NN but not as 2-NN in the
domain x ∈ [0, 6] below.

4.5

3.5

2.5

1.5

0.5

0
0 1 2 3 4 5 6

3. [4 points] Sketch or describe a dataset where kernel regression with the box kernel
above with h = 0.5 gives the same regression values as 2-NN but not as 1-NN in the
domain x ∈ (0, 6) below.

4.5

3.5

2.5

1.5

0.5

0
0 1 2 3 4 5 6

8
4 [14 Points] Model Selection
A central theme in machine learning is model selection. In this problem you will have the
opportunity to demonstrate your understanding of various model selection techniques and
their consequences. To make things more concrete we will consider the dataset D given in ??
consisting of n independent identically distributed observations. The features of D consist
of pairs (xi1 , xi2 ) ∈ R2 and the observations y i ∈ R are continuous valued.

D = {((x11 , x12 ), y 1 ), ((x21 , x22 ), y 2 ), . . . , ((xn1 , xn2 ), y n )} (1)

Consider the abstract model given ??. The function fθ1 ,θ2 is a mapping from the features in
R2 to an observation in R1 which depends on two parameters θ1 and θ2 . The i correspond
to the noise. Here we will assume that the i ∼ N (0, σ 2 ) are independent Gaussians with
zero mean and variance σ 2 .

y i = fθ1 ,θ2 (xi1 , xi2 ) + i (2)

1. [4 Points] Show that the log likelihood of the data given the parameters is equal to ??.

n √
1 X i i i 2

l(D; θ1 , θ2 ) = − 2 (y − fθ1 ,θ2 (x1 , x2 )) − n log 2πσ (3)
2σ i=1

Recall the probability density function of the N (µ, σ 2 ) Gaussian distribution is given
by ??.

(x − µ)2
 
1
p(x) = √ exp − (4)
2πσ 2σ 2

9
2. [1 Point] If we disregard the parts that do not depend on fθ1 ,θ2 and Y the negative of
the log-likelihood given in ?? is equivalent to what commonly used loss function?

3. [2 Points] Many common techniques used to find the maximum likelihood estimates of
θ1 and θ2 rely on our ability to compute the gradient of the log-likelihood. Compute
the gradient of the log likelihood with respect to θ1 and θ2 . Express you answer in
terms of:
∂ ∂
yi, fθ1 ,θ2 (xi1 , xi2 ), fθ1 ,θ2 (xi1 , xi2 ), fθ1 ,θ2 (xi1 , xi2 )
∂θ1 ∂θ2

4. [2 Points] Given the learning rate η, what update rule would you use in gradient descent
to maximize the likelihood.

10
5. [3 Points] Suppose you are given some function h such that h(θ1 , θ2 ) ∈ R is large when
fθ1 ,θ2 is complicated and small when fθ1 ,θ2 is simple. Use the function h along with
the negative log-likelihood to write down an expression for the regularized loss with
parameter λ.

6. [2 Points] For small and large values of λ describe the bias variance trade off with
respect to the regularized loss provided in the previous part.

11
5 [12 points] Support Vector Machine
1. [2 points] Suppose we are using a linear SVM (i.e., no kernel), with some large C value,
and are given the following data set.

X2 3

1 2 3 4 5
X1

Draw the decision boundary of linear SVM. Give a brief explanation.

2. [3 points] In the following image, circle the points such that removing that example
from the training set and retraining SVM, we would get a different decision boundary
than training on the full sample.

3
X2

1 2 3 4 5
X1

You do not need to provide a formal proof, but give a one or two sentence explanation.

12
3. [3 points] Suppose instead of SVM, we use regularized logistic regression to learn the
classifier. That is,
(i)
kwk2 X 1 e(w·x +b)
(w, b) = arg min − 1[y (i)
= 0] ln + 1[y (i)
= 1] ln .
w∈R2 ,b∈R 2 i
1 + e(w·x(i) +b) 1 + e(w·x(i) +b)

In the following image, circle the points such that removing that example from the
training set and running regularized logistic regression, we would get a different decision
boundary than training with regularized logistic regression on the full sample.

3
X2

1 2 3 4 5
X1

You do not need to provide a formal proof, but give a one or two sentence explanation.

13
4. [4 points] Suppose we have a kernel K(·, ·), such that there is an implicit high-dimensional
feature map φP: Rd → RD that satisfies ∀x, z ∈ Rd , K(x, z) = φ(x) · φ(z), where
φ(x) · φ(z) = D i=1 φ(x)i φ(z)i is the dot product in the D-dimensional space.
Show how to calculate the Euclidean distance in the D-dimensional space
v
u D
uX
kφ(x) − φ(z)k = t (φ(x)i − φ(z)i )2
i=1

without explicitly calculating the values in the D-dimensional vectors. For this ques-
tion, you should provide a formal proof.

14
6 [30 points] Decision Tree and Ensemble Methods
An ensemble classifier HT (x) is a collection of T weak classifiers ht (x), each with some weight
αt , t = 1, . . . , T . Given a data point x ∈ Rd , HT (x) predicts its label based on the weighted
majority vote P of the ensemble. In the binary case where the class label is either 1 or -1,
HT (x) = sgn( Tt=1 αt ht (x)), where ht (x) : Rd → {−1, 1}, and sgn(z) = 1 if z > 0 and
sgn(z) = −1 if z ≤ 0. Boosting is an example of ensemble classifiers where the weights are
calculated based on the training error of the weak classifier on the weighted training set.

1. [10 points] For the following data set,

1.5

0.5

!0.5

!1

!1.5

!2
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2

• Describe a binary decision tree with the minimum depth and consistent with the
data;

• Describe an ensemble classifier H2 (x) with 2 weak classifiers that is consistent


with the data. The weak classifiers should be simple decision stumps. Specify the
weak classifiers and their weights.

15
2. [10 points] For the following XOR data set,

1.5

0.5

!0.5

!1

!1.5

!2
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2

• Describe a binary decision tree with the minimum depth and consistent with the
data;

• Let the ensemble classifier consist of the four binary classifiers shown below (the
arrow means that the corresponding classifier classifies every data point in that
direction as +), prove that there are no weights α1 , . . . , α4 , that make the ensemble
classifier consistent with the data.

1.5 h4

1
h1
0.5

!0.5
h2

!1

!1.5 h3

!2
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2

16
3. [10 points] Suppose that for each data point, the feature vector x ∈ {0, 1}m , i.e., x
y ∈ {−1, 1}, and the true classifier
consists of m binary valued features, the class labelP
is a majority vote over the features, i.e. y = sgn( m i=1 (2xi − 1)), where xi is the i
th

component of the feature vector.

• Describe a binary decision tree with the minimum depth and consistent with the
data. How many leaves does it have?

• Describe an ensemble classifier with the minimum number of weak classifiers.


Specify the weak classifiers and their weights.

17

You might also like