Final2008f Solution
Final2008f Solution
Final Exam
Professor: Eric Xing Date: December 8, 2008
. There are 9 questions in this exam (18 pages including this cover sheet)
. Questions are not equally difficult.
. This exam is open to book and notes. Computers, PDAs, Cell phones are not allowed.
. You have three hours.
. Good luck!
1
1 Assorted Questions [20 points]
1. (True or False, 2 pts) PCA and Spectral Clustering (such as Andrew Ng’s) perform eigen-
decomposition on two different matrices. However, the size of these two matrices are the
same.
Solutions: F
2. (True or False, 2 pts) The dimensionality of the feature map generated by polynomial kernel
(e.g., K(x, y) = (1 + x · y)d )is polynomial wrt the power d of the polynomial kernel.
Solutions: T
3. (True or False, 2 pts) Since classification is a special case of regression, logistic regression
is a special case of linear regression.
Solutions: F
4. (True or False, 2 pts) For any two variables x and y having joint distribution p(x, y), we
always have H[x, y] ≥ H[x] + H[y] where H is entropy function.
Solutions: F
5. (True or False, 2 pts) The Markov Blanket of a node x in a graph with vertex set X is the
smallest set Z such that x ⊥ X/{Z ∪ x}|Z
Solutions: T
6. (True or False, 2 pts) For some directed graphs, moralization decreases the number of edges
present in the graph.
Solutions: F
7. (True or False, 2 pts) The L2 penalty in a ridge regression is equivalent to a Laplace prior
on the weights.
Solutions: F
8. (True or False, 2 pts) There is at least one set of 4 points in ℜ3 that can be shattered by
2
the hypothesis set of all 2D planes in ℜ3 .
Solutions: T
9. (True or False, 2 pts) The log-likelihood of the data will always increase through successive
iterations of the expectation maximation algorithm.
Solutions: F
10. (True or False, 2 pts) One disadvantage of Q-learning is that it can only be used when the
learner has prior knowledge of how its actions affect its environment.
Solutions: F
3
2 Support Vector Machine(SVM) [10 pts]
1. Properties of Kernel
1.1. (2 pts) Prove that the kernel K(x1 , x2 ) is symmetric, where xi and xj are the feature
vectors for ith and j th examples.
hints: Your proof will not be longer than 2 or 3 lines.
Solutions: Let Φ(x1 ) and Φ(x2 ) be the feature maps for xi and xj , respectively. Then,
we have K(x1 , x2 ) = Φ(x1 )′ Φ(x2 ) = Φ(x2 )′ Φ(x1 ) = K(x2 , x1 )
1.2. (4 pts) Given n training examples (xi , xj )(i, j = 1, ..., n), the kernel matrix A is an
n × n square matrix, where A(i, j) = K(xi , xj ). Prove that the kernel matrix A is
semi-positive definite.
hints: (1) Remember that an n × n matrix A is semi-positive definite iff. for any n
dimensional vector f , we have f ′ Af ≥ 0. (2) For simplicity, you can prove this statement
just for the following particular kernel function: K(xi , xj ) = (1 + xi xj )2 .
Solutions: Let Φ(xi ) be the feature map for the ith example and define the ma-
trix B = [Φ(x1 ), ..., Φ(xn )]. It is easy to verify that A = B′ B. Then, we have
f ′ Af = (Bf )′ Bf = kBf k2 ≥ 0
2. Soft-Margin Linear SVM. Given the following dataset in 1-d space (Figure 1), which
consists of 4 positive data points {0, 1, 2, 3} and 3 negative data points {−3, −2, −1}. Suppose
that we want to learn a soft-margin linear SVM for this data set. Remember that the soft-
margin linear SVM can be formalized as the following constrained quadratic optimization
problem. In this formulation, C is the regularization parameter, whichPbalances the size of
margin (i.e., smaller wt w) vs. the violation of the margin (i.e., smaller m i=1 ǫi ).
X m
1 t
argmin{w,b} w w+C ǫi
2
i=1
t
Subject to :yi (w xi + b) ≥ 1 − ǫi
ǫi ≥ 0 ∀i
Figure 1: Dataset
4
2.1 (2 pts) if C = 0, which means that we only care the size of the margin, how many
support vectors do we have?
Solutions: 7
2.2 (2 pts) if C → ∞, which means that we only care the violation of the margin, how many
support vectors do we have?
Solutions: 2
5
3 Principle Component Analysis (PCA) [10 pts]
Given 3 data points in 2-d space, (1, 1), (2, 2) and (3, 3),
(a) (1 pt) what is the first principle component?
√ √
Solutions: pc = (1/ 2, 1/ 2)′ = (0.707, 0.707)′ , (the negation is also correct)
(b) (1 pt) If we want to project the original data points into 1-d space by principle component
you choose, what is the variance of the projected data?
Solutions: 4/3 = 1.33
(c) (1 pt) For the projected data in (b), now if we represent them in the original 2-d space,
what is the reconstruction error?
Solutions: 0
1.2 (7 pts) PCA and SVD
Given 6 data points in 5-d space, (1, 1, 1, 0, 0), (−3, −3, −3, 0, 0), (2, 2, 2, 0, 0), (0, 0, 0, −1, −1),
(0, 0, 0, 2, 2), (0, 0, 0, −1, −1). We can represent these data points by a 6 × 5 matrix X, where
each row corresponds to a data point:
1 1 1 0 0
−3 −3 −3 0 0
2 2 2 0 0
X=
0 0 0 −1 −1
0 0 0 2 2
0 0 0 −1 −1
(a) (1 pt) What is the sample mean of the data set?
Solutions: [0, 0, 0, 0, 0]
(b) (3 pts) What is SVD of the data matrix X you choose?
hints: The SVD for this matrix must take the following form, where a, b, c, d, σ1 , σ2 are
the parameters
you
need to decide.
a 0
−3a 0
2a 0 σ1 0 c c c 0 0
X= × ×
0 b 0 σ2 0 0 0 d d
0 −2b
0 b
√ √
Solutions: a = √ ±1/ 14 = ±0.267, b = ±1/ √6 = ±0.408,
√ · c) = 42 = 6.48, σ√
σ1 = 1/(a 2 = 1/(b · d) = 12 = 3.46,
c = ±1/ 3 = ±0.577,d = ±1/ 2 = ±0.707.
6
(c) (1 pt) What is first principle component for the original data points?
Solutions: pc=±[c, c, c, 0, 0] = ±[0.577, 0.577, 0.577, 0, 0] (Intuition: First, we want to
notice that the first three data points are co-linear, and so do the last three data points.
And also the first three data points are orthogonal to the rest three data points. Then,
we want notice that the norm of the first three are much bigger than the last three,
therefor, the first pc has the same direction as the first three data points)
(d) (1 pt) If we want to project the original data points into 1-d space by principle component
you choose, what is the variance of the projected data?
Solutions: var=σ12 /6 = 7 (Intuition: we just the keep the first three data points, and
set the rest three data points as [0, 0, 0, 0, 0] (since they are orthogonal to pc), and then
compute the variance among them)
(e) (1 pt) For the projected data in (d), now if we represent them in the original 5-d space,
what is the reconstruction error?
Solutions: var=σ22 /6 = 21 (Intuition, since the first three data points are orthogonal
with the rest three, here the rerr is the just the sum of the norm of the last three data
points (2+8+2=12), and then divided by the total number (6) of data points, if we use
average definition
1
if you give an answer var=σ22 = 12, that is also correct. In this case, the reconstruction error is just the sum (not
average) among all the data points, which is the definition used in Carlos’ lecture notes. But in Bishop’s book, he
uses the average definition.
7
4 Linear Regression [12 Points]
2.5 2.5
2 2
y
y
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
A) x B) x
θ1=0.5000 θ0=0.8333 θ1=0.3944 θ0=0.3521
3 3
2.5 2.5
2 2
y
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
C) x D) x
Background: In this problem we are working on linear regression with regularization on points in
a 2-D space. Figure 2 plots linear regression results on the basis of three data points, (0,1), (1,1)
and (2,2), with different regularization penalties.
As we all know, solving a linear regression problem is about to solve a minimization problem. That
is to compute
Xn
arg min (yi − θ1 xi − θ0 )2 + R(θ0 , θ1 )
θ0 ,θ1
i=1
where R represents a regularization penalty which could be L-1 or L-2. In this problem, n = 3,
(x1 , y1 ) = (0, 1), (x2 , y2 ) = (1, 1), and (x3 , y3 ) = (2, 2). R(θ0 , θ1 ) could either be λ(|θ1 | + |θ0 |) or
λ(θ12 + θ02 ).
However, in stead of computing the derivatives to get a minimum value, we could adopt a geometric
method. In this way, rather than letting the square error term and the regularization penalty term
vary simultaneously as a function of θ0 and θ1 , we can fix one and only let the other vary at a time.
Having a upper-bound, r, on the penalty, we can replace R(θ0 , θ1 ) by r, and solve a minimization
8
1 1
1.2
1
4.
1
2
0.8 0.8 0.5
5 3 0.3
2.
2.
6
0.
0.6 0.6 4
5
3 0.2
1.4 1.2 2
0.
4
10 4.
0.4 10 0.3 0.2 0.4 2 .5 1 0.2
3
0.5 2 2.2
5
1.4
2.6 0.5
0.2 0.2 2.
1 5
1.2 4
θ1
θ1
3
0.5
1.4
0 0 4 10 4.2 4
10 2. 1 2.2
2.6 5
−0.2 5 −0.2
1.4
0.5
5
2.
3
−0.4 −0.4 4.2
10
1.4 10 5 4 4
−0.6 3 −0.6
−0.8 3 −0.8 10
10 35
35
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
θ θ
(a) 0 (b) 0
Figure 3: Contour plots of the decomposition for the linear regression problem with (a) L-2 reg-
ularization or (b) L-1 regularization where the ellipsis correspond to the square error term, and
circles/squares correspond to the regularization penalty term.
problem on the square error term for any non-negative value of r. Finally, we get the minimum
value by enumerating over all possible value of r. That is,
n
( ( n ) )
X X
2 2
min (yi − θ1 xi − θ0 ) + R(θ0 , θ1 ) = min min (yi − θ1 xi − θ0 ) | R(θ0 , θ1 ) ≤ r + r
θ0 ,θ1 r≥0 θ0 ,θ1
i=1 i=1
. The value of (θ0 , θ1 ) corresponding to the minimum value of the object function can be got at
the same time.
P
In Figure 3, we plot the square error term, ni=1 (yi − θ1 xi − θ0 )2 , by ellipse contours. The circle
contours in Fig 3(a) plots a L-2 penalty with λ = 5, whereas the square contours in Fig 3(b) plots
a L-1 penalty with λ = 5.
To further explain how it works, the solution to
( n )
X
min (yi − θ1 xi − θ0 )2 | R(θ0 , θ1 ) ≤ r
θ0 ,θ1
i=1
is the height of the smallest ellipse contour that is tangent with (or contained in) the contour that
depict R(θ0 , θ1 ) = r. The desired (θ0 , θ1 ) are the coordinates of the tangent point.
Question:
1. Please assign each plot in Figure 2 to one (and only one) of the following regularization
methods. You can get some help from Figure 3. Please answer A, B, C or D.
(a) (2 pts) No regularization (or regularization parameter equals to 0).
3
X
(yi − θ1 xi − θ0 )2
i=1
9
Solution: C
(b) (3 pts) L-2 regularization with λ being 5.
3
X
(yi − θ1 xi − θ0 )2 + λ(θ12 + θ02 ) where λ = 5
i=1
Solution: D
(c) (3 pts) L-1 regularization with λ being 5.
3
X
(yi − θ1 xi − θ0 )2 + λ(|θ1 | + |θ0 |) where λ = 5
i=1
Solution: B
(d) (2 pts) L-2 regularization with λ being 1.
3
X
(yi − θ1 xi − θ0 )2 + λ(θ12 + θ02 ) where λ = 1
i=1
Solution: A
2. (2 pts) If we have much more features (that is more xi ’s) and we want to perform feature
selection while solving the LR problem, which kind of regularization method do we want to
use? (Hint: L-1 or L-2? What about λ?)
Solution: We will choose L-1, and we will use bigger λ when we want fewer effective features.
10
5 Sampling [8 Points]
1. (2 pts) Suppose we want to compute P (B1|E1) using the naive sampling method. We draw
1,000,000 sample records in total. How many useful samples do we expect to see? (Hint: B1
means Burglary is true.)
Solution: Records with E = 1 are useful samples. So, 1000000 ∗ 0.002 = 2000.
2. (1 pts) Suppose we want to compute P (B1|J1) using the Gibbs sampling algorithm. How
many different states of (B,E,A,J,M) will we observe during the process?
Solution: There are four variables (B,E,A,M), each of which has two states, so we can
observe 24 = 16 different states.
3. (3 pts) Suppose we want to compute P (B1|J1) using the Gibbs sampling algorithm, and we
start with state (B1,E0,A0,J1,M0). We choose variable E in the first step. What are the
possible states after the first step, and what are their probability of occurrence respectively?
Solution: The next possible states are (B1,E0,A0,J1,M0) and (B1,E1,A0,J1,M0), because
only E may change.
11
With probability 0.9983 it will become (B1,E0,A0,J1,M0), and with probability 0.0017 it will
become (B1,E1,A0,J1,M0).
4. (2 pts) In Markov Chain Monte Carlo (MCMC), is choosing the transition probabilities to
satisfy the property of detailed balance a necessary condition for ensuring that a stationary
distribution exists? Please answer Yes or No.
Solution: No. It is a sufficient condition.
12
6 Expectation Maximization [10 Points]
Imagine a machine learning class where the probability that a student gets an “A” grade is P (A) =
1/2, a “B” grade P (B) = µ, a “C” grade P (C) = 2µ, and a “D” grade P (D) = 1/2 − 3µ. We are
told that c students get a “C” and d students get a “D”. We don’t know how many students got
exactly an “A” or exactly a “B”. But we do know that h students got either an a or b. Therefore, a
and b are unknown values where a + b = h. Our goal is to use expectation maximization to obtain
a maximum likelihood estimate of µ.
1. (4 pts) Expectation step: Which formulas compute the expected values of a and b given µ?
Circle your answers.
1/2 bb = µ
a=
b µ µ
1/2 + h 1/2 + h
1/2 bb = ∗ ∗ ∗ ∗ µ
∗∗∗∗b
a= h h
1/2 + µ 1/2 + µ
µ bb = 1/2
a=
b h h
1/2 + µ 1/2 + µ
1/2 bb = µ h
a=
b h
1 + µ2 1 + µ2
h−a+c
∗∗∗∗µ
b =
6(h − a + c + d)
h−a+d
µ
b =
6(h − 2a − d)
h−a
µ
b =
6(h − 2a + c)
2(h − a)
µ
b =
3(h − a + c + d)
13
7 VC-Dimension and Learning Theory [10 Points]
1. (True/False, 2 pts) Can the set of all rectangles in the 2D plane (which includes non axis-
aligned rectangles) shatter a set of 5 points? Explain in 1-2 sentences.
1
m ≥ (ln(1/δ) + ln|H|)
ǫ
1
m ≥ (ln(1/δ) + ln|H|)
2ǫ2
1
m ≥ (4log2 (2/δ) + 8V C(H)log2 (13/ǫ))
ǫ
For each of the below questions, pick the formula you would use to estimate the number of
examples you would need to learn the concept. You do not need to do any computation or plug in
any numbers. Explain your answer.
1. (2 pts) Consider instances with two Boolean variables {X1 , X2 }, and responses Y are given
by the XOR function. We try to learn the function f : X → Y using a 2-layer neural network.
14
8 Hidden Markov Models with continuous emissions (10 points)
In this question, we will study hidden markov models with continuous emissions. We will use the
notation used in class, with xi denoting the output at time i, and yi denoting the corresponding
hidden state. The HMM has K states {1 . . . K}. The output for state k is obtained by sampling a
Gaussian distribution parameterized by mean µk and standard deviation σk . Thus, we can write
the emission probabilitye as p(xi |yi = k, θ) = N (xi |µk , σk ). θ is the set of parameters of the
HMM, which includes the initial probabilities π, transition probability matrix A and the means
and standard deviations {µ1 , . . . , µK , σ1 , . . . , σk }.
Write down the log-likelihood for a sequence of observations of the emissions {x1 , . . . , xn } when the
states (also observed) are {y1 , . . . , yn }.
Solution:
Y
log p(x1 , . . . , xn |y1 , . . . , yn ) = log p(xi |yi ) (1)
i
X
= log(N (xi |µyi , σyi )) (2)
i
Write the forward and backward update equations for this HMM. Explain in a single line how they
are different from the updates we studied in class.
Solution:
X
αtk = N (xt |µk , σk ) i
αt−1 ai,k (3)
i
X
βtk = i
ak,i βt+1 N (xt |µi , σi ) (4)
i
The equations are similar in form. But in this case, the output probabilities are gaussian rather
than multinomial. The outputs are also continuous rather than discrete.
We are given a sequence of observations X = {x1 , . . . , xn } and the corresponding hidden states
Y = {y1 , . . . , yn }. We want to find the parameters θ for the HMM.
1. Are the update equations for Aij and πi different from the ones obtained for the HMM we
studied in class? Explain why or why not (2 points).
Solution: The update equations for Aij and πi are the same. They involve only the state
transition counts and so are independent of the form chosen for emission probabilities.
15
2. What are the update equations for the Gaussian parameters µk and σk ? (Hint: You do not
need to derive them. Given the hidden states, the outputs are all independent of each other,
and each is sampled from one out of K gaussians.) (2 points)
Solution:
P
I[yi = k]xi
µk = Pi (5)
i I[yi = k]
P
I[yi = k](xi − µk )2
σk2 = i P (6)
i I[yi = k]
Now, we are only given a sequence of observations X = {x1 , . . . , xn }. We want to find the param-
eters θ for the HMM. (Slide 47 and 48 for the HMM lecture describe the unsupervised learning
algorithm for the HMM discussed in class)
The unsupervised learning algorithm optimizes the expected complete log-likelihood. Why is that
a reasonable choice for the objective function? (1 point)
Solution: The expected complete log-likelihood is a lower bound to the complete log-likelihood.
It is guaranteed to converge to a local optimum of the complete likelihood. Hence it is a reasonable
choice for the objective function.
What is the expected complete log-likelihood ( hlc (θ; x, y)i ) for the HMM with continuous gaussian
emissions? Just write the expression, a derivation is not necessary.(1 point)
Solution:
X T D
XX E T
XX
i
i j i
hlc (θ; x, y)i = yn,1 log πi + yn,t−1 yn,t log ai,j + yn,t log N (xn , t|µi , σi )
n n t=2 n t=1
(7)
Suppose you want to find ML estimates µ̂k and σ̂k for parameters µk and σk . Will the ML
expressions have the same form as those obtained for the means and variances in a mixture of
gaussians? Explain in one line. (Hint: Write down the terms in hlc (θ; x, y)i that are relevant to the
optimization (i.e, contain µk and σk ) )(1 point)
Solution: Yes, the ML expressions will have same form. The relevant term in hlc (θ; x, y)i is only
the last term, which closely resembles the expected complete log-likelihood for a gaussian mixture
16
model. In this case the p(y = 1|x) term is computed using the forward backward algorithm rather
than by simply using Bayes rule (as is done for a mixture of gaussians).
Consider the Bayesian network shown in Figure 5. All the variables are boolean.
g f
a b c d e
9.1 Likelihood
Write the expression for the joint likelihood of the network in its factored form. (2 points)
Solution: p(a, b, c, d, e, f, g) = p(a)p(b|a)p(c|b)p(d|c, f )p(e|d)p(f |g)p(g)
9.2 D-separation
1. Let X = {c}, Y = {b, d}, Z = {a, e, f, g}. Is X ⊥ Z|Y ? If yes, explain why. If no, show a
path from X to Z that is not blocked. (2 points)
Solution: No, X 6⊥ Z|Y . The path c → d → f is not blocked since the v-structure at d is
observed.
2. Suppose you are allowed to choose a set W such that W ⊂ Z. Then define Z ∗ = Z/W and
Y ∗ = Y ∪ W . What is the smallest set W such that X ⊥ Z ∗ |Y ∗ is true? (2 points)
Solution: W = {f } is the smallest subset that satisfies the requirement. Y ∗ then is the
Markov Blanket of node c.
From the graph, we can see that a ⊥ c, d|b. Prove using the axioms of probability that this implies
a ⊥ c|b. (2 points)
Solution: a ⊥ c, d|b means P (a, c, d|b) = P (a|b)P (c, d|b). We want to prove a ⊥ c|b using the
17
axioms of probability.
X
P (a, c|b) = P (a, c, d|b) by the axiom of additivity for disjoint events (8)
d
X
= P (a|b)P (c, d|b) (9)
d
X
= P (a|b) P (c, d|b) (10)
d
= P (a|b)P (c|b) (11)
Suppose that we do not know the directionality of the edges a − b and b − c, and we are trying
to learn that by observing the conditional probability p(a|b, c). Some of the entries in the table
are observed and noted. Fill in the rest of the conditional probability table so that we obtain the
directionality that we see in the graph, i.e, a → b and b → c. (2 points)
Solution: We want a ⊥ c|b, i.e P (a|b, c) = P (a|b). So we want that P (a|b, c) should be the same
for all values of c.
18