0% found this document useful (0 votes)
31 views

Final 2003

This document contains instructions for a final exam with 10 questions. It states that the exam is 3 hours long, contains 10 questions, and the maximum possible score is 100 points. Unless otherwise stated, showing work is not required. Students are advised that if they get stuck on a question, they should move on to other questions and come back later. Good luck is wished for the exam.

Uploaded by

Muhammad Murtaza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Final 2003

This document contains instructions for a final exam with 10 questions. It states that the exam is 3 hours long, contains 10 questions, and the maximum possible score is 100 points. Unless otherwise stated, showing work is not required. Students are advised that if they get stuck on a question, they should move on to other questions and come back later. Good luck is wished for the exam.

Uploaded by

Muhammad Murtaza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

ANDREW ID (CAPITALS):

NAME (CAPITALS):

10-701/15-781 Final, Fall 2003

 You have 3 hours.


 There are 10 questions. If you get stu k on one question, move on to others and ome
ba k to the diÆ ult question later.
 The maximum possible total s ore is 100.
 Unless otherwise stated there is no need to show your working.
 Good lu k!

1
1 Short Questions (16 points)
(a) Traditionally, when we have a real-valued input attribute during de ision-tree learning
we onsider a binary split a ording to whether the attribute is above or below some
threshold. Pat suggests that instead we should just have a multiway split with one
bran h for ea h of the distin t values of the attribute. From the list below hoose the
single biggest problem with Pat's suggestion:
(i) It is too omputationally expensive.
(ii) It would probably result in a de ision tree that s ores badly on the training set
and a testset.
(iii) It would probably result in a de ision tree that s ores well on the training set but
badly on a testset.
(iv) It would probably result in a de ision tree that s ores well on a testset but badly
on a training set.
(b) You have a dataset with three ategori al input attributes A, B and C. There is one
ategori al output attribute Y. You are trying to learn a Naive Bayes Classi er for
predi ting Y. Whi h of these Bayes Net diagrams represents the naive bayes lassi er
assumption?

(i) (ii)
A B C A B C

Y Y

(iii) Y (iv)

A B C Y
A B C

( ) For a neural network, whi h one of these stru tural assumptions is the one that most
a e ts the trade-o between under tting (i.e. a high bias model) and over tting (i.e.
a high varian e model):
(i) The number of hidden nodes
(ii) The learning rate
(iii) The initial hoi e of weights
(iv) The use of a onstant-term unit input

2
(d) For polynomial regression, whi h one of these stru tural assumptions is the one that
most a e ts the trade-o between under tting and over tting:
(i) The polynomial degree
(ii) Whether we learn the weights by matrix inversion or gradient des ent
(iii) The assumed varian e of the Gaussian noise
(iv) The use of a onstant-term unit input
(e) For a Gaussian Bayes lassi er, whi h one of these stru tural assumptions is the one
that most a e ts the trade-o between under tting and over tting:
(i) Whether we learn the lass enters by Maximum Likelihood or Gradient Des ent
(ii) Whether we assume full lass ovarian e matri es or diagonal lass ovarian e
matri es
(iii) Whether we have equal lass priors or priors estimated from the data.
(iv) Whether we allow lasses to have di erent mean ve tors or we for e them to share
the same mean ve tor
(f) For Kernel Regression, whi h one of these stru tural assumptions is the one that most
a e ts the trade-o between under tting and over tting:
(i) Whether kernel fun tion is Gaussian versus triangular versus box-shaped
(ii) Whether we use Eu lidian versus L1 versus L1 metri s
(iii) The kernel width
(iv) The maximum height of the kernel fun tion
(g) (True or False) Given two lassi ers A and B, if A has a lower VC-dimension than
B then A almost ertainly will perform better on a testset.
(h) P (Good Movie j In ludes Tom Cruise) = 0:01
P (Good Movie j Tom Cruise absent) = 0:1
P (Tom Cruise in a randomly hosen movie) = 0:01
What is P (Tom Cruise is in the movie j Not a Good Movie)?

3
2 Markov De ision Pro esses (13 points)
For this question it might be helpful to re all the following geometri identities, whi h assume
0  < 1.

Xk i = 1 k+1 X i =
1
1
i=0
1 i=0
1
The following gure shows an MDP with N states. All states have two a tions (North
and Right) ex ept Sn , whi h an only self-loop. Unlike most MDPs, all state transitions are
deterministi . Assume dis ount fa tor .

p=1 p=1 p=1 p=1 p=1

s1 s2 s3 sn-1 sn
p=1 p=1 p=1
... p=1 r = 10
r=1 r=1 r=1 r=1

For questions (a){(e), express your answer as a nite expression (no summation
signs or : : : 's) in terms of n and/or .
(a) What is J  (Sn )?

(b) There is a unique optimal poli y. What is it?

( ) What is J  (S1 )?

(d) Suppose you try to solve this MDP using value iteration. What is J 1 (S1 )?

4
(e) Suppose you try to solve this MDP using value iteration. What is J 2 (S1 )?

(f) Suppose your omputer has exa t arithmeti (no rounding errors). How many itera-
tions of value iteration will be needed before all states re ord their exa t ( orre t to
in nite de imal pla es) J  value? Pi k one:
(i) Less than 2n
(ii) Between 2n and n2
(iii) Between n2 + 1 and 2n
(iv) It will never happen
(g) Suppose you run poli y iteration. During one step of poli y iteration you ompute the
value of the urrent poli y by omputing the exa t solution to the appropriate system
of n equations in n unknowns. Suppose too that when hoosing the a tion during the
poli y improvement step, ties are broken by hoosing North.
Suppose poli y iteration begins with all states hoosing North.
How many steps of poli y iteration will be needed before all states re ord their exa t
( orre t to in nite de imal pla es) J  value? Pi k one:
(i) Less than 2n
(ii) Between 2n and n2
(iii) Between n2 + 1 and 2n
(iv) It will never happen

5
3 Reinfor ement Learning (10 points)
This question uses the same MDP as the previous question, repeated here for your onve-
nien e. Again, assume = 21 .

p=1 p=1 p=1 p=1 p=1

s1 s2 s3 sn-1 sn
p=1 p=1 p=1
... p=1 r = 10
r=1 r=1 r=1 r=1

Suppose we are dis overing the optimal poli y via Q-learning. We begin with a Q-table
initialized with 0's everywhere:
Q(Si ; North) = 0 for all i
Q(Si ; Right) = 0 for all i
Be ause the MDP is determisti , we run Q-learning with a learning rate = 1. Assume we
start Q-learning at state S1 .

(a) Suppose our exploration poli y is to always hoose a random a tion. How many steps
do we expe t to take before we rst enter state Sn ?
(i) O(n) steps
(ii) O(n2 ) steps
(iii) O(n3 ) steps
(iv) O(2n) steps
(v) It will ertainly never happen
(b) Suppose our exploration is greedy and we break ties by going North:
Choose North if Q(Si ; North)  Q(Si ; Right)
Choose Right if Q(Si ; North) < Q(Si ; Right)
How many steps do we expe t to take before we rst enter state Sn ?
(i) O(n) steps
(ii) O(n2 ) steps
(iii) O(n3 ) steps
(iv) O(2n) steps
(v) It will ertainly never happen

6
( ) Suppose our exploration is greedy and we break ties by going Right:
Choose North if Q(Si ; North) > Q(Si ; Right)
Choose Right if Q(Si ; North)  Q(Si ; Right)
How many steps do we expe t to take before we rst enter state Sn ?
(i) O(n) steps
(ii) O(n2 ) steps
(iii) O(n3 ) steps
(iv) O(2n) steps
(v) It will ertainly never happen
WARNING: Question (d) is only worth 1 point so you should probably just
guess the answer unless you have plenty of time.
(d) In this question we work with a similar MDP ex ept that ea h state other than Sn has
a punishment (-1) instead of a reward (+1). Sn remains the same large reward (10).
The new MDP is shown below:

p=1 p=1 p=1 p=1 p=1

s1 s2 s3 sn-1 sn
p=1 p=1 p=1
... p=1 r = 10
r = -1 r = -1 r = -1 r = -1

Suppose our exploration is greedy and we break ties by going North:


Choose North if Q(Si ; North)  Q(Si ; Right)
Choose Right if Q(Si ; North) < Q(Si ; Right)
How many steps do we expe t to take before we rst enter state Sn ?
(i) O(n) steps
(ii) O(n2 ) steps
(iii) O(n3 ) steps
(iv) O(2n) steps
(v) It will ertainly never happen

7
4 Bayesian Networks (11 points)
Constru tion. Two astronomers in two di erent parts of the world, make measurements
M1 and M2 of the number of stars N in some small regions of the sky, using their teles opes.
Normally, there is a small possibility of error by up to one star in ea h dire tion. Ea h
teles ope an be, with a mu h smaller probability, badly out of fo us (events F1 and F2 ). In
su h a ase the s ientist will under ount by three or more stars or, if N is less than three,
fail to dete t any stars at all.
For questions (a) and (b), onsider the four networks shown below.

(i) M1 M2 (ii) F1 N F2

F1 N F2 M1 M2

(iii) M1 M2 (iv) F1 F2

N M1 M2

F1 F2 N

(a) Whi h of them orre tly, but not ne essarily eÆ iently, represents the above informa-
tion? Note that there may be multiple answers.

(b) Whi h is the best network?

8
Inferen e. A student of the Ma hine Learning lass noti es that people driving SUVs
(S ) onsume large amounts of gas (G) and are involved in more a idents than the national
average (A). He also noti ed that there are two types of people that drive SUVs: people
from Pennsylvania (L) and people with large families (F ). After olle ting some statisti s,
he arrives at the following Bayesian network.

P(L)=0.4 L F P(F)=0.6
P(S|L,F)=0.8
P(S|~L,F) = 0.5
S
P(S|L,~F)=0.6
P(S|~L,~F)=0.3
P(A|S)=0.7 A G P(G|S)=0.8
P(A|~S)=0.3 P(G|~S)=0.2

( ) What is P (S )?

(d) What is P (S jA)?

Consider the following Bayesian network. State whether the given onditional independen es
are implied by the net stru ture.

A B

C D

F E

(f) (True or False) I<A,fg,B>


(g) (True or False) I<A,fEg,D>
(h) (True or False) I<A,fFg,D>

9
5 Instan e Based Learning (8 points)
Consider the following dataset with one real-valued input x and one X Y
binary output y . We are going to use k-NN with unweighted Eu- -0.1 -
lidean distan e to predi t y for x. 0.7 +
1.0 +
1.6 -
2.0 +
– + + – + + – – + + 2.5 +
3.2 -
-0.1 0.7 1.0 1.6 2.0 2.5 3.2 3.5 4.1 4.9 3.5 -
4.1 +
4.9 +

(a) What is the leave-one-out ross-validation error of 1-NN on this dataset? Give your
answer as the number of mis lassi ations.

(b) What is the leave-one-out ross-validation error of 3-NN on this dataset? Give your
answer as the number of mis lassi ations.

Consider a dataset with N examples: f(xi ; yi)j1  i  N g, where both xi and yi are real
valued for all i. Examples are generated by yi = w0 + w1 xi + ei where ei is a Gaussian
random variable with mean 0 and standard deviation 1.

N P
( ) We use least square linear regression to solve w0 and w1 , that is
fw0; w1g = arg fwmin 2
i=1 (yi w0 w1 xi ) :
0 ;w1 g

We assume the solution is unique. Whi h one of the following statements is true?
PNi=1(yi w0 w1 xi )yi = 0
(i)
PNi=1(yi w0 w1 xi )x2i = 0
(ii)
PNi=1(yi w0 w1 xi )xi = 0
(iii)
(iv)
PNi=1(yi w0 w1 xi )2 = 0

P
(d) We hange the optimization riterion to in lude lo al weights, that is
fw0; w1g = arg min Ni=1 i2(yi w0 w1 xi)2
fw0 ;w1 g
where i is a lo al weight. Whi h one of the following statements is true?
P
N 2
i=1 i (yi w0 w1 xi )(xi + i ) = 0
(i)
P
 

N
i=1 i (yi w0 w1 xi )xi = 0
(ii)
P
 

N 2
i=1 i (yi w0 w1 xi )(xi yi + w1 ) = 0
(iii)
P
  

N 2
(iv) i=1 i (yi w0 w1 xi )xi = 0
 

10
6 VC-dimension (9 points)
Let H denote a hypothesis lass, and V C (H ) denote its VC dimension.
(a) (True or False) If there exists a set of k instan es that annot be shattered by H ,
then V C (H ) < k.
(b) (True or False) If two hypothesis lasses H1 and H2 satisfy H1  H2 , then
V C (H1 )  V C (H2 ).
( ) (True or False) If three hypothesis lasses H1 ; H2 and H3 satisfy H1 = H2 [ H3 ,
then V C (H1 )  V C (H2 ) + V C (H3 ) .

For questions (d){(f), give V C (H ). No explanation is required.


(d) H = fh j0   1; h (x) = 1 i x  otherwise h (x) = 0g.

(e) H is the set of all per eptrons in 2D plane, i.e.


H = fhw jhw = (w0 + w1 x1 + w2 x2 ) where (z ) = 1 i z  0 otherwise z = 0g.

(f) H is the set of all ir les in 2D plane. Points inside the ir les are lassi ed as 1
otherwise 0.

11
7 SVM and Kernel Methods (8 points)
(a) Kernel fun tions impli itly de ne some mapping fun tion () that transforms an input
instan e x 2 R d to a high dimensional feature spa e Q by giving the form of dot produ t
in Q: K (xi ; xj ) = (xi )  (xj ).
Assume we use radial basis kernel fun tion K (xi ; xj ) = exp( 12 kxi xj k2 ). Thus we
assume that there's some impli it unknown fun tion (x) su h that

1
(xi )  (xj ) = K (xi ; xj ) = exp( kx xj k2 )
2 i
Prove that for any two input instan es xi and xj , the squared Eu lidean distan e
of their orresponding points in the feature spa e Q is less than 2, i.e. prove that
k(xi) (xj )k2 < 2.

(b) With the help of a kernel fun tion, SVM attempts to onstru t a hyper-plane in the
feature spa e Q that maximizes the margin between two lasses. The lassi ation
de ision of any x is made on the basis of the sign of
^ T (x) + w^0 =
w
X y K (x ; x) + w^
i i i 0 = f (x; ; w^0);
i2SV
where w ^ and w^0 are parameters for the lassi ation hyper-plane in the feature spa e
Q, SV is the set of support ve tors, and i is the oeÆ ient for the support ve tor.
Again we use the radial basis kernel fun tion. Assume that the training instan es are
linearly separable in the feature spa e Q, and assume that the SVM nds a margin
that perfe tly separates the points.
(True or False) If we hoose a test point xfar whi h is far away from any training
instan e xi (distan e here is measured in the original spa e R d ), we will observe that
f (xfar ; ; w^0 )  w^0 .
( ) (True or False) The SVM learning algorithm is guaranteed to nd the globally
optimal hypothesis with respe t to its obje t fun tion.
(d) (True or False) The VC dimension of a Per eptron is smaller than the VC dimension
of a simple linear SVM.

12
(e) (True or False) After being mapped into feature spa e Q through a radial basis
kernel fun tion, a Per eptron may be able to a hieve better lassi ation performan e
than in its original spa e (though we an't guarantee this).
(f) (True or False) After mapped into feature spa e Q through a radial basis kernel
fun tion, 1-NN using unweighted Eu lidean distan e may be able to a hieve better
lassi ation performan e than in original spa e (though we an't guarantee this).

13
8 GMM (8 points)
Consider the lassi ation problem illustrated in the following gure. The data points in the
gure are labeled, where \o" orresponds to lass 0 and \+" orresponds to lass 1. We now
estimate a GMM onsisting of 2 Gaussians, one Gaussian per lass, with the onstraint that
the ovarian e matri es are identity matri es. The mixing proportions ( lass frequen ies)
and the means of the two Gaussians are free parameters.

1.5
2

1
x

0.5

0
0 0.5 1 1.5 2
x
1

(a) Plot the maximum likelihood estimates of the means of the two Gaussians in the gure.
Mark the means as points \x" and label them \0" and \1" a ording to the lass.

(b) Based on the learned GMM, what is the probability of generating a new data point
that belongs to lass 0?

( ) How many data points are lassi ed in orre tly?

(d) Draw the de ision boundary in the same gure.

14
9 K-means Clustering (9 points)
There is a set S onsisting of 6 points in the plane shown as below, a = (0; 0), b = (8; 0),
= (16; 0), d = (0; 6), e = (8; 6), f = (16; 6). Now we run the k-means algorithm on those
points with k = 3. The algorithm uses the Eu lidean distan e metri (i.e. the straight line
distan e between two points) to assign ea h point to its nearest entroid. Ties are broken in
favor of the entroid to the left/down. Two de nitions:
 A k-starting on guration is a subset of k starting points from S that form the
initial entroids, e.g. fa; b; g.
 A k-partition is a partition of S into k non-empty subsets, e.g. fa; b; eg; f ; dg; ff g is
a 3-partition.
Clearly any k-partition indu es a set of k entroids in the natural manner. A k-partition
is alled stable if a repetition of the k-means iteration with the indu ed entroids leaves it
un hanged.
8

6 d e f
4
y

2
a b c
0
0 4 8 12 16 20
x

(a) How many 3-starting on gurations are there? (Remember, a 3-starting on guration
is just a subset, of size 3, of the six datapoints).
(b) Fill in the following table:
3-partition Is it sta- An example 3-starting on gura- The number of
ble? tion that an arrive at the 3- unique starting
partition after 0 or more itera- on gurations that
tions of k -means (or write \none" an arrive at the
if no su h 3-starting on gura- 3-partition
tion)
fa; b; eg; f ; dg; ff g
fa; bg; fd; eg; f ; f g
fa; dg; fb; eg; f ; f g
fag; fdg; fb; ; e; f g
fa; bg; fdg; f ; e; f g
fa; b; dg; f g; fe; f g

15
10 Hidden Markov Models (8 points)
Consider a hidden Markov model illustrated as the gure shown below, whi h shows the
hidden state transitions and the asso iated probabilities along with the initial state distribu-
tion. We assume that the state dependent outputs ( oin ips) are governed by the following
distributions
P (x = headsjs = 1) = 0:51
P (x = headsjs = 2) = 0:49
P (x = tailsjs = 1) = 0:49
P (x = tailsjs = 2) = 0:51
In other words, our oin is slightly biased towards heads in state 1 whereas in state 2 tails
is a somewhat more probable out ome.
0.9 0.9
1 1 1 ...
0.01 0.1 0.1

0.99 0.1 0.1


2 0.9 2 0.9 2 ...
t=0 t=1 t=2
(a) Now, suppose we observe three oin ips all resulting in heads. The sequen e of
observations is therefore heads; heads; heads. What is the most likely state sequen e
given these three observations? (It is not ne essary to use the Viterbi algorithm to
dedu e this, nor any subsequent questions).

(b) What happens to the most likely state sequen e if we observe a long sequen e of all
heads (e.g., 106 heads in a row)?

16
( ) Consider the following 3-state HMM, 1 , 2 and 3 are the probabilities of starting from
ea h state S 1, S 2 and S 3. Give a set of values so that the resulting HMM maximizes
the likelihood of the output sequen e ABA.

3 = _ A
S3
_
B
_ _
_ _
_
2 =
1 =
S1 S2
_ _

_
_ _ _ _
PSfrag repla ements

A B A B

17
(d) We're going to use EM to learn the parameters for the following HMM. Before the rst
iteration of EM we have initialized the parameters as shown in the following gure.
(True or False) For these initial values, EM will su essfully onverge to the model
that maximizes the likelihood of the training sequen e ABA.

PSfrag repla ements


1 = 1/3
2 =
3 = 3 = 1/3 1/3 A
S3
2/3
B
1/3 1/3
1/3 1/3
1/3
2 = 1/3
1 = 1/3
S1 S2
1/3 1/3
1/3
1/3 2/3 1/3 2/3

A B A B

(e) (True or False) In general when are trying to learn an HMM with a small number of
states from a large number of observations, we an almost always in rease the training
data likelihood by permitting more hidden states.

18

You might also like