Final 2003
Final 2003
NAME (CAPITALS):
1
1 Short Questions (16 points)
(a) Traditionally, when we have a real-valued input attribute during de
ision-tree learning
we
onsider a binary split a
ording to whether the attribute is above or below some
threshold. Pat suggests that instead we should just have a multiway split with one
bran
h for ea
h of the distin
t values of the attribute. From the list below
hoose the
single biggest problem with Pat's suggestion:
(i) It is too
omputationally expensive.
(ii) It would probably result in a de
ision tree that s
ores badly on the training set
and a testset.
(iii) It would probably result in a de
ision tree that s
ores well on the training set but
badly on a testset.
(iv) It would probably result in a de
ision tree that s
ores well on a testset but badly
on a training set.
(b) You have a dataset with three
ategori
al input attributes A, B and C. There is one
ategori
al output attribute Y. You are trying to learn a Naive Bayes Classier for
predi
ting Y. Whi
h of these Bayes Net diagrams represents the naive bayes
lassier
assumption?
(i) (ii)
A B C A B C
Y Y
(iii) Y (iv)
A B C Y
A B C
(
) For a neural network, whi
h one of these stru
tural assumptions is the one that most
ae
ts the trade-o between undertting (i.e. a high bias model) and overtting (i.e.
a high varian
e model):
(i) The number of hidden nodes
(ii) The learning rate
(iii) The initial
hoi
e of weights
(iv) The use of a
onstant-term unit input
2
(d) For polynomial regression, whi
h one of these stru
tural assumptions is the one that
most ae
ts the trade-o between undertting and overtting:
(i) The polynomial degree
(ii) Whether we learn the weights by matrix inversion or gradient des
ent
(iii) The assumed varian
e of the Gaussian noise
(iv) The use of a
onstant-term unit input
(e) For a Gaussian Bayes
lassier, whi
h one of these stru
tural assumptions is the one
that most ae
ts the trade-o between undertting and overtting:
(i) Whether we learn the
lass
enters by Maximum Likelihood or Gradient Des
ent
(ii) Whether we assume full
lass
ovarian
e matri
es or diagonal
lass
ovarian
e
matri
es
(iii) Whether we have equal
lass priors or priors estimated from the data.
(iv) Whether we allow
lasses to have dierent mean ve
tors or we for
e them to share
the same mean ve
tor
(f) For Kernel Regression, whi
h one of these stru
tural assumptions is the one that most
ae
ts the trade-o between undertting and overtting:
(i) Whether kernel fun
tion is Gaussian versus triangular versus box-shaped
(ii) Whether we use Eu
lidian versus L1 versus L1 metri
s
(iii) The kernel width
(iv) The maximum height of the kernel fun
tion
(g) (True or False) Given two
lassiers A and B, if A has a lower VC-dimension than
B then A almost
ertainly will perform better on a testset.
(h) P (Good Movie j In
ludes Tom Cruise) = 0:01
P (Good Movie j Tom Cruise absent) = 0:1
P (Tom Cruise in a randomly
hosen movie) = 0:01
What is P (Tom Cruise is in the movie j Not a Good Movie)?
3
2 Markov De
ision Pro
esses (13 points)
For this question it might be helpful to re
all the following geometri
identities, whi
h assume
0 < 1.
Xk i = 1 k+1 X i =
1
1
i=0
1 i=0
1
The following gure shows an MDP with N states. All states have two a
tions (North
and Right) ex
ept Sn , whi
h
an only self-loop. Unlike most MDPs, all state transitions are
deterministi
. Assume dis
ount fa
tor
.
s1 s2 s3 sn-1 sn
p=1 p=1 p=1
... p=1 r = 10
r=1 r=1 r=1 r=1
For questions (a){(e), express your answer as a nite expression (no summation
signs or : : : 's) in terms of n and/or
.
(a) What is J (Sn )?
( ) What is J (S1 )?
(d) Suppose you try to solve this MDP using value iteration. What is J 1 (S1 )?
4
(e) Suppose you try to solve this MDP using value iteration. What is J 2 (S1 )?
(f) Suppose your
omputer has exa
t arithmeti
(no rounding errors). How many itera-
tions of value iteration will be needed before all states re
ord their exa
t (
orre
t to
innite de
imal pla
es) J value? Pi
k one:
(i) Less than 2n
(ii) Between 2n and n2
(iii) Between n2 + 1 and 2n
(iv) It will never happen
(g) Suppose you run poli
y iteration. During one step of poli
y iteration you
ompute the
value of the
urrent poli
y by
omputing the exa
t solution to the appropriate system
of n equations in n unknowns. Suppose too that when
hoosing the a
tion during the
poli
y improvement step, ties are broken by
hoosing North.
Suppose poli
y iteration begins with all states
hoosing North.
How many steps of poli
y iteration will be needed before all states re
ord their exa
t
(
orre
t to innite de
imal pla
es) J value? Pi
k one:
(i) Less than 2n
(ii) Between 2n and n2
(iii) Between n2 + 1 and 2n
(iv) It will never happen
5
3 Reinfor
ement Learning (10 points)
This question uses the same MDP as the previous question, repeated here for your
onve-
nien
e. Again, assume
= 21 .
s1 s2 s3 sn-1 sn
p=1 p=1 p=1
... p=1 r = 10
r=1 r=1 r=1 r=1
Suppose we are dis
overing the optimal poli
y via Q-learning. We begin with a Q-table
initialized with 0's everywhere:
Q(Si ; North) = 0 for all i
Q(Si ; Right) = 0 for all i
Be
ause the MDP is determisti
, we run Q-learning with a learning rate = 1. Assume we
start Q-learning at state S1 .
(a) Suppose our exploration poli
y is to always
hoose a random a
tion. How many steps
do we expe
t to take before we rst enter state Sn ?
(i) O(n) steps
(ii) O(n2 ) steps
(iii) O(n3 ) steps
(iv) O(2n) steps
(v) It will
ertainly never happen
(b) Suppose our exploration is greedy and we break ties by going North:
Choose North if Q(Si ; North) Q(Si ; Right)
Choose Right if Q(Si ; North) < Q(Si ; Right)
How many steps do we expe
t to take before we rst enter state Sn ?
(i) O(n) steps
(ii) O(n2 ) steps
(iii) O(n3 ) steps
(iv) O(2n) steps
(v) It will
ertainly never happen
6
(
) Suppose our exploration is greedy and we break ties by going Right:
Choose North if Q(Si ; North) > Q(Si ; Right)
Choose Right if Q(Si ; North) Q(Si ; Right)
How many steps do we expe
t to take before we rst enter state Sn ?
(i) O(n) steps
(ii) O(n2 ) steps
(iii) O(n3 ) steps
(iv) O(2n) steps
(v) It will
ertainly never happen
WARNING: Question (d) is only worth 1 point so you should probably just
guess the answer unless you have plenty of time.
(d) In this question we work with a similar MDP ex
ept that ea
h state other than Sn has
a punishment (-1) instead of a reward (+1). Sn remains the same large reward (10).
The new MDP is shown below:
s1 s2 s3 sn-1 sn
p=1 p=1 p=1
... p=1 r = 10
r = -1 r = -1 r = -1 r = -1
7
4 Bayesian Networks (11 points)
Constru
tion. Two astronomers in two dierent parts of the world, make measurements
M1 and M2 of the number of stars N in some small regions of the sky, using their teles
opes.
Normally, there is a small possibility of error by up to one star in ea
h dire
tion. Ea
h
teles
ope
an be, with a mu
h smaller probability, badly out of fo
us (events F1 and F2 ). In
su
h a
ase the s
ientist will under
ount by three or more stars or, if N is less than three,
fail to dete
t any stars at all.
For questions (a) and (b),
onsider the four networks shown below.
(i) M1 M2 (ii) F1 N F2
F1 N F2 M1 M2
(iii) M1 M2 (iv) F1 F2
N M1 M2
F1 F2 N
(a) Whi
h of them
orre
tly, but not ne
essarily eÆ
iently, represents the above informa-
tion? Note that there may be multiple answers.
8
Inferen
e. A student of the Ma
hine Learning
lass noti
es that people driving SUVs
(S )
onsume large amounts of gas (G) and are involved in more a
idents than the national
average (A). He also noti
ed that there are two types of people that drive SUVs: people
from Pennsylvania (L) and people with large families (F ). After
olle
ting some statisti
s,
he arrives at the following Bayesian network.
P(L)=0.4 L F P(F)=0.6
P(S|L,F)=0.8
P(S|~L,F) = 0.5
S
P(S|L,~F)=0.6
P(S|~L,~F)=0.3
P(A|S)=0.7 A G P(G|S)=0.8
P(A|~S)=0.3 P(G|~S)=0.2
( ) What is P (S )?
Consider the following Bayesian network. State whether the given
onditional independen
es
are implied by the net stru
ture.
A B
C D
F E
9
5 Instan
e Based Learning (8 points)
Consider the following dataset with one real-valued input x and one X Y
binary output y . We are going to use k-NN with unweighted Eu- -0.1 -
lidean distan
e to predi
t y for x. 0.7 +
1.0 +
1.6 -
2.0 +
– + + – + + – – + + 2.5 +
3.2 -
-0.1 0.7 1.0 1.6 2.0 2.5 3.2 3.5 4.1 4.9 3.5 -
4.1 +
4.9 +
(a) What is the leave-one-out
ross-validation error of 1-NN on this dataset? Give your
answer as the number of mis
lassi
ations.
(b) What is the leave-one-out
ross-validation error of 3-NN on this dataset? Give your
answer as the number of mis
lassi
ations.
Consider a dataset with N examples: f(xi ; yi)j1 i N g, where both xi and yi are real
valued for all i. Examples are generated by yi = w0 + w1 xi + ei where ei is a Gaussian
random variable with mean 0 and standard deviation 1.
N P
(
) We use least square linear regression to solve w0 and w1 , that is
fw0; w1g = arg fwmin 2
i=1 (yi w0 w1 xi ) :
0 ;w1 g
We assume the solution is unique. Whi
h one of the following statements is true?
PNi=1(yi w0 w1 xi )yi = 0
(i)
PNi=1(yi w0 w1 xi )x2i = 0
(ii)
PNi=1(yi w0 w1 xi )xi = 0
(iii)
(iv)
PNi=1(yi w0 w1 xi )2 = 0
P
(d) We
hange the optimization
riterion to in
lude lo
al weights, that is
fw0; w1g = arg min Ni=1 i2(yi w0 w1 xi)2
fw0 ;w1 g
where i is a lo
al weight. Whi
h one of the following statements is true?
P
N 2
i=1 i (yi w0 w1 xi )(xi + i ) = 0
(i)
P
N
i=1 i (yi w0 w1 xi )xi = 0
(ii)
P
N 2
i=1 i (yi w0 w1 xi )(xi yi + w1 ) = 0
(iii)
P
N 2
(iv) i=1 i (yi w0 w1 xi )xi = 0
10
6 VC-dimension (9 points)
Let H denote a hypothesis
lass, and V C (H ) denote its VC dimension.
(a) (True or False) If there exists a set of k instan
es that
annot be shattered by H ,
then V C (H ) < k.
(b) (True or False) If two hypothesis
lasses H1 and H2 satisfy H1 H2 , then
V C (H1 ) V C (H2 ).
(
) (True or False) If three hypothesis
lasses H1 ; H2 and H3 satisfy H1 = H2 [ H3 ,
then V C (H1 ) V C (H2 ) + V C (H3 ) .
(f) H is the set of all
ir
les in 2D plane. Points inside the
ir
les are
lassied as 1
otherwise 0.
11
7 SVM and Kernel Methods (8 points)
(a) Kernel fun
tions impli
itly dene some mapping fun
tion () that transforms an input
instan
e x 2 R d to a high dimensional feature spa
e Q by giving the form of dot produ
t
in Q: K (xi ; xj ) = (xi ) (xj ).
Assume we use radial basis kernel fun
tion K (xi ; xj ) = exp( 12 kxi xj k2 ). Thus we
assume that there's some impli
it unknown fun
tion (x) su
h that
1
(xi ) (xj ) = K (xi ; xj ) = exp( kx xj k2 )
2 i
Prove that for any two input instan
es xi and xj , the squared Eu
lidean distan
e
of their
orresponding points in the feature spa
e Q is less than 2, i.e. prove that
k(xi) (xj )k2 < 2.
(b) With the help of a kernel fun
tion, SVM attempts to
onstru
t a hyper-plane in the
feature spa
e Q that maximizes the margin between two
lasses. The
lassi
ation
de
ision of any x is made on the basis of the sign of
^ T (x) + w^0 =
w
X y K (x ; x) + w^
i i i 0 = f (x; ; w^0);
i2SV
where w ^ and w^0 are parameters for the
lassi
ation hyper-plane in the feature spa
e
Q, SV is the set of support ve
tors, and i is the
oeÆ
ient for the support ve
tor.
Again we use the radial basis kernel fun
tion. Assume that the training instan
es are
linearly separable in the feature spa
e Q, and assume that the SVM nds a margin
that perfe
tly separates the points.
(True or False) If we
hoose a test point xfar whi
h is far away from any training
instan
e xi (distan
e here is measured in the original spa
e R d ), we will observe that
f (xfar ; ; w^0 ) w^0 .
(
) (True or False) The SVM learning algorithm is guaranteed to nd the globally
optimal hypothesis with respe
t to its obje
t fun
tion.
(d) (True or False) The VC dimension of a Per
eptron is smaller than the VC dimension
of a simple linear SVM.
12
(e) (True or False) After being mapped into feature spa
e Q through a radial basis
kernel fun
tion, a Per
eptron may be able to a
hieve better
lassi
ation performan
e
than in its original spa
e (though we
an't guarantee this).
(f) (True or False) After mapped into feature spa
e Q through a radial basis kernel
fun
tion, 1-NN using unweighted Eu
lidean distan
e may be able to a
hieve better
lassi
ation performan
e than in original spa
e (though we
an't guarantee this).
13
8 GMM (8 points)
Consider the
lassi
ation problem illustrated in the following gure. The data points in the
gure are labeled, where \o"
orresponds to
lass 0 and \+"
orresponds to
lass 1. We now
estimate a GMM
onsisting of 2 Gaussians, one Gaussian per
lass, with the
onstraint that
the
ovarian
e matri
es are identity matri
es. The mixing proportions (
lass frequen
ies)
and the means of the two Gaussians are free parameters.
1.5
2
1
x
0.5
0
0 0.5 1 1.5 2
x
1
(a) Plot the maximum likelihood estimates of the means of the two Gaussians in the gure.
Mark the means as points \x" and label them \0" and \1" a
ording to the
lass.
(b) Based on the learned GMM, what is the probability of generating a new data point
that belongs to
lass 0?
14
9 K-means Clustering (9 points)
There is a set S
onsisting of 6 points in the plane shown as below, a = (0; 0), b = (8; 0),
= (16; 0), d = (0; 6), e = (8; 6), f = (16; 6). Now we run the k-means algorithm on those
points with k = 3. The algorithm uses the Eu
lidean distan
e metri
(i.e. the straight line
distan
e between two points) to assign ea
h point to its nearest
entroid. Ties are broken in
favor of the
entroid to the left/down. Two denitions:
A k-starting
onguration is a subset of k starting points from S that form the
initial
entroids, e.g. fa; b;
g.
A k-partition is a partition of S into k non-empty subsets, e.g. fa; b; eg; f
; dg; ff g is
a 3-partition.
Clearly any k-partition indu
es a set of k
entroids in the natural manner. A k-partition
is
alled stable if a repetition of the k-means iteration with the indu
ed
entroids leaves it
un
hanged.
8
6 d e f
4
y
2
a b c
0
0 4 8 12 16 20
x
(a) How many 3-starting
ongurations are there? (Remember, a 3-starting
onguration
is just a subset, of size 3, of the six datapoints).
(b) Fill in the following table:
3-partition Is it sta- An example 3-starting
ongura- The number of
ble? tion that
an arrive at the 3- unique starting
partition after 0 or more itera-
ongurations that
tions of k -means (or write \none"
an arrive at the
if no su
h 3-starting
ongura- 3-partition
tion)
fa; b; eg; f
; dg; ff g
fa; bg; fd; eg; f
; f g
fa; dg; fb; eg; f
; f g
fag; fdg; fb;
; e; f g
fa; bg; fdg; f
; e; f g
fa; b; dg; f
g; fe; f g
15
10 Hidden Markov Models (8 points)
Consider a hidden Markov model illustrated as the gure shown below, whi
h shows the
hidden state transitions and the asso
iated probabilities along with the initial state distribu-
tion. We assume that the state dependent outputs (
oin
ips) are governed by the following
distributions
P (x = headsjs = 1) = 0:51
P (x = headsjs = 2) = 0:49
P (x = tailsjs = 1) = 0:49
P (x = tailsjs = 2) = 0:51
In other words, our
oin is slightly biased towards heads in state 1 whereas in state 2 tails
is a somewhat more probable out
ome.
0.9 0.9
1 1 1 ...
0.01 0.1 0.1
(b) What happens to the most likely state sequen
e if we observe a long sequen
e of all
heads (e.g., 106 heads in a row)?
16
(
) Consider the following 3-state HMM, 1 , 2 and 3 are the probabilities of starting from
ea
h state S 1, S 2 and S 3. Give a set of values so that the resulting HMM maximizes
the likelihood of the output sequen
e ABA.
3 = _ A
S3
_
B
_ _
_ _
_
2 =
1 =
S1 S2
_ _
_
_ _ _ _
PSfrag repla
ements
A B A B
17
(d) We're going to use EM to learn the parameters for the following HMM. Before the rst
iteration of EM we have initialized the parameters as shown in the following gure.
(True or False) For these initial values, EM will su
essfully
onverge to the model
that maximizes the likelihood of the training sequen
e ABA.
A B A B
(e) (True or False) In general when are trying to learn an HMM with a small number of
states from a large number of observations, we
an almost always in
rease the training
data likelihood by permitting more hidden states.
18