CS 221 Fall 19 Solution
CS 221 Fall 19 Solution
CS221
Fall 2019 Name: | {z }
by writing my name I agree to abide by the honor code
SUNet ID:
Read all of the following information before starting the exam:
• This test has 3 problems printed on 28 pages and is worth 150 points total. It is your
responsibility to make sure that you have all of the pages.
• Only the printed (top) side of each page will be scanned so write all your answers on
that side of the paper.
• Keep your answers precise and concise. We may award partial credit so show all your
work clearly and in order.
• Don’t spend too much time on one problem. Read through all the problems carefully
and do the easier ones first. Try to understand the problems intuitively; it really helps
to draw a picture.
• You cannot use any external aids except one double-sided 8 12 ” x 11” page of notes.
• Good luck!
Total Score: + + =
1
1. Under Attack (50 points)
You work for a cybersecurity company, Sanymtec, to monitor online forums. Recently,
you’ve noticed an increasing number of curious sentences that look like this:
These sentences are perfectly readable by humans, but when you feed them into your machine
learning models, they are totally confused and make wildly incorrect predictions. Your
automatic fake news detection system is under attack!
a. (15 points)
As a first step, you would like to classify a sentence x as either adversarial (y = 1) or
not (y = −1). Your boss doesn’t want you to use the hinge loss because she’s worried that
the attacker might be able to more easily reverse engineer the system. So you decide to
investigate alternative loss functions.
For each of the five loss functions Loss(x, y, w) below:
(i) [3 points] Loss(x, y, w) = max{1 − b(w · φ(x))yc, 0}, where bac returns a rounded
down to the nearest integer.
Solution No, because the gradient is 0 almost everywhere, so we cannot apply gradient
descent.
(ii) [3 points] Loss(x, y, w) = max{(w · φ(x))y − 1, 0}.
Solution No, because for this loss function, a larger margin corresponds to a larger loss,
which is the opposite of what we want.
2
(iii) [3 points]
1 − 2(w · φ(x))y
if (w · φ(x))y ≤ 0
Loss(x, y, w) = (1 − (w · φ(x))y)2 if 0 < (w · φ(x))y ≤ 1
0 if (w · φ(x))y > 1
Note: this loss function is a “smoothed” hinge loss, which is continuously differentiable
everywhere (notice that the constants are chosen so that the gradients match at the different
pieces), but still behaves like the hinge
( loss when the magnitude of the margin is large.
max(1 − (w · φ(x)), 0) + 10 if y = +1
(iv) [3 points] Loss(x, y, w) = .
max(1 + (w · φ(x)), 0) − 10 if y = −1
Solution Yes, it turns out that this loss function has the exact same gradient as the hinge
loss! The fact that the two labels have a different constant offset doesn’t matter.
(
−φ(x)y if (w · φ(x))y ≤ 1
∇w Loss(x, y, w) = .
0 otherwise.
Note that this is the simplified form of
−φ(x) if (w · φ(x)) ≤ 1 and y = +1
φ(x) if (w · φ(x)) ≤ −1 and y = −1
∇w Loss(x, y, w) = .
0 if (w · φ(x)) > 1 and y = +1
(w · φ(x)) > −1 and y = −1.
0 if
(v) [3 points] Loss(x, y, w) = σ(−(w · φ(x))y), where σ(z) = (1 + e−z )−1 is the logistic
function.
Solution Both answers are acceptable. One could argue that the loss is non-convex so that
it is difficult to optimize (even if the gradient exists and is non-zero everywhere). Of course,
non-convexity never stopped anyone, and given that this loss function is a more faithful
approximation to the zero-one loss, one could make a case that this objective is reasonable.
It’s gradient is:
∇w Loss(x, y, w) = −σ(−(w · φ(x))y)(1 − σ(−(w · φ(x))y)φ(x)y.
3
b. (10 points)
Your next job is to decide which features to use in order to solve the classification problem.
Assume you have a set D of real English words.
(i) [5 points] Suppose you have a sentence x which is a string; e.g., x = "erath is falt".
Write the Python code for taking a sentence x and producing a dict representing the fol-
lowing two feature templates:
1. x contains word
Assume that words in x are separated by spaces; e.g., the words in x above are "erath",
"is", "falt".
return phi
Solution The key names are arbitrarily chosen. The return value of "erath is falt" should
be: {"num dict": 2, "contains erath": 1, "contains is": 1, "contains falt": 1}. One possible
solution is shown below.
4
return phi
Due to ambiguity of the first feature template, solutions that add words in D with value
0 will also be accepted. Solutions that assumed the words in x are unique were also accepted.
5
(ii) [2 points] If k is the number of unique words that occur in the training set and |D| is
the number of words in the given set of real English words, what is the number of features
that the linear classifier will have?
Solution There is one feature for every possible word in the training set, but only one
feature that captures the number words in x that are in the dictionary. So the total is
k+1.
Due to ambiguity of the first feature template, this part will be correct if it is consistent
with the code written on the previous page. If the solution to part (i) adds words in D with
value 0, then the answer n + |D| + 1 will also be accepted where n is the number of unique
words that occur in the training set but not in D.
(iii) [3 points] Suppose that an insider leaks Sanymtec’s classification strategy to the
attackers. The classifier itself was not leaked, just the classification strategy behind it,
which reveals that Sanymtec is using a dataset of adversarial sentences to train a classifier
with the features defined in part (i).
The attackers then use this information to try modifying any fixed sentence (e.g., "climate
change is a hoax") into something readable by humans (e.g., "clmaite canhge is a
haox") but classified (incorrectly) as non-adversarial by Sanymtec. How can the attackers
achieve this?
Solution The resulting classifier only knows about adversarial words that it has seen in
the training set. Any words that don’t appear in the training set will not contribute to any
of the "x contains word " weights. However, the space of possible adversarial words
is exponentially large, so given any sentence, the adversary can find a perturbation of each
word that has not appeared in the training set. If the adversary then just ensures that x
has a non-adversarial word (e.g., "the") which likely has negative weight, then only this
word will contribute to the classifier score to result in a non-adversarial prediction. Having
a smaller ratio of unseen words to words in D can also help ensure that x has a negative
weight.
6
Solutions that add an arbitrary type of obfuscation other than permutation are also
accepted (e.g. adding characters: "climate" -> "cloimate").
Due to ambiguity of the question, solutions that mention the problem that attackers
would need to know which adversarial permutations are used in Sanymtec’s training set are
accepted. However, solutions that do not take into account that some of the adversarial
sentences could be weighted very positively by Sanymtec’s classifier are penalized.
7
c. (10 points)
Having built a supervised classifier, you find it extremely hard to collect enough examples
of adversarial sentences. On the other hand, you have a lot of non-adversarial text lying
around.
(i) [3 points] Suppose you have a total of 100,000 training examples that consists of 100
adversarial sentences and 99,900 non-adversarial sentences. You train a classifier and get
99.9% accuracy. Is this a good, meaningful result? Explain why or why not.
Solution No, because you can get at least that level of accuracy by always predicting
non-adversarial (y = −1), which is a trivial solution.
(ii) [3 points] You decide to fit a generative model to the non-adversarial text, which is
a distribution p(x) that assigns a probability to each sentence x. For simplicity, let’s use a
unigram model:
n
Y
p(x) = pu (wi ),
i=1
where w1 , w2 , ..., wn are the words in sentence x, and pu (w) is a probability distribution over
possible w’s.
Suppose you are given a single sentence "the cat in the hat" as training data. Com-
pute the maximum likelihood estimate of the unigram model pu :
w pu (w)
Solution
w pu (w)
the 2/5
cat 1/5
in 1/5
hat 1/5
8
(iii) [4 points] Given an unseen sentence, your goal is to be able to predict whether that
sentence is adversarial or not. You have a labeled dataset Dtrain = {(x1 , y1 ), . . . , (xn , yn )}
and would like to use the unigram model to train your predictor.
How could you use p(x) (from the previous problem) and Dtrain to obtain a predictor
f (x) that outputs whether a sentence x is adversarial (y = 1) or not (y = −1)? Be precise
in defining f (x). Hint: define a feature vector φ(x).
f (x) =
Solution Ideally, p(x) should assign high probability to non-adversarial text and low prob-
ability to adversarial text. The only thing that’s missing is the threshold, which can be tuned
on Dtrain . One way to do this is to define two features φ(x) = [1, p(x)], and fit a corresponding
weight vector w using standard techniques (hinge loss + SGD). Then we simply predict
9
d. (15 points)
You notice that the adversarial words are often close to real English words. For example,
you might see "erath" or "eatrh" as misspellings of "earth". Furthermore, the actual
number of adversarial words is rather small (it seems like the attacker just wants to reinforce
the same messages). This makes you think of another unsupervised approach to try.
Let D be the set of real English words as before and a1 , . . . , an be the list of adversarial
words you’ve found, and let dist(a, e) be the number of edits to transform some adversarial
word a to the English word e (how exactly distance is defined is unimportant).
We wish to choose K English words e1 , . . . , eK ∈ D and assign each adversarial word
ai to one of the chosen English words (zi ∈ {1, . . . , K}). Each English word e ∈ D incurs
a cost c(e) if we choose it as one of the K words. Our goal is to minimize the total cost of
choosing e1 , . . . , eK plus the total number of edits from adversarial words a1 , . . . , an to their
assigned English words ez1 , . . . , ezn .
As an example, let D = {"earth", "flat", "scientists"} with c("earth") = 1,
c("flat") = 1, c("scientists") = 2, and a1 = "erath", a2 = "falt", a3 = "eatrh". Then
with K = 2, one possible assignment (presumably the best one) is e1 = "earth",
e2 = "flat", z1 = 1, z2 = 2, z3 = 1.
(i) [3 points] Define a loss function that captures the optimization problem above:
Loss(e1 , . . . , eK , z1 , . . . , zn ) =
Solution As the flavortext mentions, "our goal is to minimize the total cost of choosing
e1 , . . . , eK plus the total number of edits from adversarial words a1 ,P . . . , an to their assigned
English words ez1 , . . . , ezn ." The total cost of choosing e1 , . . . , eK is K j=1 c(e
Pj n) and the total
number of edits from adversarial words to their assigned English words is i=1 dist(ai , ezi ).
K
X n
X
Loss(e1 , . . . , eK , z1 , . . . , zn ) = c(ej ) + dist(ai , ezi ).
j=1 i=1
10
(ii) [5 points] Derive an alternating minimization algorithm for optimizing the above
objective. We alternate between two steps. In step 1, we optimize z1 , . . . , zn . Formally write
down this update rule as an equation for each zi where 1 ≤ i ≤ n. What is the runtime?
You should specify runtime with big-Oh notation in terms of n, K and/or |D|.
Solution For each adversarial word ai , the goal is to find the cluster assignment, which
can be done by finding the index of the English word with the closest edit distance to ai .
This can be found by iterating through all j and comparing every English word ej to the
current ai . The index that minimizes this value is the desired cluster assignment.
For each of the n adversarial words, we loop over K clusters, so the runtime is O(nK).
(iii) [5 points] In step 2, we optimize e1 , . . . , eK . Formally write down this update rule
as an equation for each ej where 1 ≤ j ≤ K. What is the runtime? You should specify
runtime with big-Oh notation in terms of n, K and/or |D|.
Solution For each cluster j, we want to find the best English word ej to be the "centroid."
To do this, we can iterate through every English word e and calculate the "cost" of that
particular assignment. The "cost" of that particular assignment is defined as the cost of
choosing e plus the sum of the edit distance between e with every adversarial word assigned
j. The e with the minimum "cost" will be the new "centroid."
!
X
ej = arg min c(e) + dist(ai , e) .
e∈D
i:zi =j
For each English word e ∈ D and cluster j, we keep a running sum Se,j which is initialized
to c(e). Then for each adversarial word ai and English word e, we increment Se,zi . Finally,
for each j, we simply choose the e that minimizes Se,j . The runtime is O(n|D| + K|D|).
(iv) [2 points] Is the above procedure guaranteed to converge to the minimum cost
solution? Explain why or why not. If not, what method what algorithm could you use with
such guarantees?
11
2. Maze (50 points)
One day, you wake up to find yourself in the middle of a corn field holding an axe and a
map (Figure 1). The corn field consists of an n × n grid of cells, where some adjacent cells
are blocked by walls of corn stalks; specifically, for any two adjacent cells (i, j) and (i0 , j 0 ),
let W ((i, j), (i0 , j 0 )) = 1 if there is a wall between the two cells and 0 otherwise. For example,
in Figure 1, W ((1, 1), (1, 2)) = 0 and W ((1, 2), (1, 3)) = 1.
You can either move to an adjacent cell if there’s no intervening wall with cost 1, or you
can use the axe to cut down a wall with cost c without changing your position. Your axe
can be used to break down at most b0 walls, and your goal is to get from your starting point
(i0 , j0 ) to the exit at (n, n) with the minimum cost.
Figure 1: An example of a corn maze. The goal is to go from the initial location (i0 , j0 ) =
(2, 2) to the exit (n, n) = (3, 3) with the minimum cost.
12
a. (15 points)
(i) [10 points] Fill out the components of the search problem corresponding to the above
maze.
• sstart = ((i0 , j0 ), b0 ).
• Actions(((i, j), b)) = {a ∈ {(−1, 0), (+1, 0), (0, −1), (0, +1)} :
(i, j) + a is in bounds and (W ((i, j), (i, j) + a) = 0 or b > 0)}.
Solution The new location is (i, j) + a, and we need to decrement our axe budget b
whenever W ((i, j), (i, j) + a) = 1.
During exam we made clarifications that c > 0 and the cost c should be rephrased as
the cost to break down a wall in a move (if there is a wall during a move, we take it
down with extra cost c). Some students might not get our clarification during exam,
so we also accept this solution in this and following questions:
Succ(((i, j), b), a) = ((i, j) + (1 − W ((i, j), (i, j) + a))a, b − W ((i, j), (i, j) + a))
W ((i, j), (i, j) + a) = 0
Solution We always pay cost 1 for moving one cell and additionally, we pay c if we
have to break down a wall.
Cost(((i, j), n), a) = (1 − W ((i, j), (i, j) + a)) + cW ((i, j), (i, j) + a). (3)
13
(ii) [5 points] When you use an axe to take down a wall, the wall stays down but the set
of walls which have been taken down are not tracked in the state. Why does our choice of
state still guarantee the minimum cost solution to the problem?
Solution Under the current formulation, one needs to pay each time we go through a wall.
However, the minimum cost path will never visit the same location twice (otherwise, there’s
a cycle which can be cut out to reduce the cost of the path). Therefore, there is no difference
between paying only for the first time and paying for each time.
14
b. (10 points)
Solving the search problem above is taking forever and you don’t want to be stuck in the
corn maze all day long. So you decide to use A*.
(i) [5 points] Define a consistent heuristic function h(((i, j), b)) based on finding the
minimum cost path using the relaxed state (i, j) where we assume we have an infinite axe
budget and therefore do not need to track it. Show why your choice of h is consistent
and what you would precompute so that evaluating any h(((i, j), b)) takes O(1) time and
precomputation takes O(n2 log n) time.
Solution Define h(((i, j), b)) to be the minimum cost path from (i, j) to (n, n), where we
have no budget constraint on the number of times we can use the axe. We can solve the
relaxed search problem from the exit (n, n) to all locations (i, j). This can be done using
UCS in O(n2 log n) time. Then, we just store the lookup table for each of the O(n2 ) states
to be queried during the search.
This problem only relaxes axe budget constraint, so any heuristic relaxing wall or cost
related constraint (such as Manhattan distance to the end, which just remove all walls or
set c = 0) will not be accepted.
15
(ii) [5 points] Noticing that sometimes h is the true future cost of the original search
problem, you wonder when this holds more generally. For what ranges of b0 and c would
this hold? Assume for this part that there is a path that doesn’t require breaking down any
walls.
≤ b0 ≤c
Your lower bounds need not be tight, but you need to formally justify why they hold.
Solution The heuristic is exactly the future cost when the number of axe uses of the
original and relaxed search problems are identical. This happens in two cases:
1. When the budget constraint b0 is large enough, then there’s effectively no constraint
on the number of axe uses. This happens when b0 is at least the largest number of
walls that need to be broken down. In worse case we only need to break down 2n walls
(otherwise we can directly head to the exit with less cost).
2. When c is large enough, then it is not worth breaking down any walls at all. This
happens when c ≥ n2 , an upper bound on the length of the minimum cost path
without breaking down walls (which exists by assumption).
16
c. (15 points)
Having solved the search problem above, you are eager to set out on your journey through
the maze, but you realize that breaking down corn stalks is harder than you thought. Suppose
that each attempt to break down a wall has an > 0 probability of failing. Recall that b0 is
the maximum number of walls you can break down, not the number of attempts, and each
attempt to break down a wall has cost c.
(i) [5 points] Suppose that each attempt to break down a wall is independent (e.g., if you
fail once, the next attempt at the same wall also has probability of failing regardless of your
previous failures). You are interested in minimizing the expected cost of exiting the maze.
While the natural solution is to treat this as an MDP, it turns out you can still cast this
problem as a search problem. In particular, define a modified Cost(((i, j), b), a) function,
and write one sentence about why this choice gives you the optimal policy.
Solution Since a failed attempt to bring down a wall leaves you in the same state, the
optimal policy will either keep on trying to break down a wall or not try at all (recall the
dice game from lecture). Therefore, we can interpret an action a as saying: repeatedly break
down any wall in direction a and move in that direction. The expected total cost T of such
c
an action is given by the recurrence T = c + T , which has a solution T = 1− , so:
c
Cost(((i, j), b), a) = 1 + W ((i, j), (i, j) + a). (4)
1−
We also accept:
c
Cost(((i, j), b), a) = W ((i, j), (i, j) + a). (5)
1−
17
(ii) [5 points] Suppose instead that each attempt to break down a wall is perfectly de-
pendent (e.g., if you fail once, you will always fail to break down that wall). Let us model
this problem as an MDP. What should the states of the MDP be? What is the number of
states in the worst case as a function of b0 and n (use big-Oh notation)? In this problem
suppose b0 << n.
Solution Now when you fail to break down a wall, you have to remember that fact so you
don’t try again. Therefore your state must include all the walls that you’ve tried to break
down that have failed (F ) and all the walls where you have succeeded (S), since you might
need to revisit locations. You have at most 2n2 − 2n total walls and you need P to keep track
2n2 −2n
of at most b0 in S, and possibly all 2n2 − 2n in F ; the number of such S is bi=0 0
i
2
and that of F is 22n −2n . There are still n2 possible locations. We don’t need to store the
budget
left since this can be tracked by the size of set S. The number of states is therefore
2
Pb0 2n2 −2n 2n2 −2n
O n i=0 i
2 .
Another
solution is to label each wall as one of succeeded, failed or not attempted. This
2
gives O n2 32n −2n .
We give full credit for both solutions.
18
(iii) [5 points] If the probability of successfully breaking down a wall is (1 − )/k, where
k > 0 is the number of times you’ve tried to break down a wall. What should the states of
the MDP be now?
Solution Now you have to keep track of how many times you’ve tried to break down each
wall, so that at any point in time, you know the failure probability of each wall. The state
should be same as (ii) but F is now a counter instead of a set.
Another interpretation is to take k as shared for all walls. In this context the state should
be current location, budget of axe, k and set of broken walls S. Both interpretations are
acceptable.
19
d. (10 points)
Let’s actually solve the maze! In this specific 3 × 3 maze as shown in Figure 1, the initial
location is (2, 2) and the exit is (3, 3). For simplicity assume that b0 = 1 and that your axe
always succeeds ( = 0).
3. Break down the other wall: (2,2)–(2,3)–(3,3), which both have cost 2 + c.
Therefore, the minimum cost is min(2 + c, 6). We also accept solutions that consistent with
their definitions in a.
20
(ii) [5 points] Let’s look at the optimal policy at the initial location. For each value of c,
what are corresponding optimal actions? If there is a tie between optimal actions state all of
them. Your answer should consist of statements of the form: if c ∈ , then the optimal
actions are .
Solution Note that the three solutions above correspond to action 1 (west), action 2 (east),
and action 4 (south), respectively. There are three cases:
21
3. Faulty Accumulator (50 points)
You decide to try your hand at building hardware. Specifically, you will build a simple
circuit that takes n numbers and incrementally computes their sum. However, it turns out
hardware is hard, and in your first attempt, the accumulator occasionally gets zeroed out
randomly.
To capture this precisely, we can define the following generative model whose Bayesian
network is shown in Figure 3. Let Y0 = 0 be the initial sum. For each time step i = 1, . . . , n,
the circuit:
2. Decides to remember (Ri = 1) with probability 1− or forget (Ri = 0) with probability
.
As an example:
1. X1 = 3, R1 = 1, Y1 = 3 (remember)
2. X2 = 2, R2 = 0, Y2 = 2 (forget)
3. X3 = 4, R3 = 1, Y3 = 6 (remember)
4. X4 = 4, R4 = 1, Y4 = 10 (remember)
X1 X2 X3 X4
Y0 Y1 Y2 Y3 Y4
R1 R2 R3 R4
22
a. (10 points)
To speed things up, you want to first prune the domains of variables. Recall that when
we enforce arc consistency on a variable A with respect to a factor f , we keep a value v in
the domain of A if and only if there exist values for other variables in the scope of f such
that f evaluates to a non-zero number.
(i) [5 points] What is the domain of Yn as a function of n?
Solution The value Yn is the sum over n numbers, each of which can be up to 4, so the
domain of Yn is {1, . . . , 4n} .
(ii) [5 points] Consider the following factor, where we have marginalized out R2 :
Suppose Y1 ∈ {1, 2} and Y2 = 3. What is the domain of X2 after enforcing arc consistency
on X2 ?
23
b. (15 points)
Now, disregarding what was done during part a, let us explore how conditioning on
evidence changes our beliefs about X2 .
(i) [5 points] Compute:
x2 P(X2 = x2 )
1
2
3
4
Solution We marginalize all other variables, which are non-ancestors of X2 , to get the
prior distribution, which is uniform as defined:
x2 P(X2 = x2 )
1 1/4
2 1/4
3 1/4
4 1/4
24
(ii) [5 points] Suppose we observe that Y2 = 3. Now what do we believe about X2 ?
x2 P(X2 = x2 | Y2 = 3)
1
2
3
4
Solution First note that Y1 = X1 deterministically, which has a uniform distribution over
{1, 2, 3, 4}. Next, let us write out the conditional distribution:
X
P(X2 = x2 | Y2 = 3) ∝ P(X2 = x2 , Y2 = 3) = p(y1 )p(r2 )p(x2 )p(y2 = 3 | y1 , x2 , r2 ). (7)
y1 ,r2
Note that for X2 ∈ {1, 2}, we must have had remembered (R2 = 1), which forces Y1 = Y2 −X2
(has probability 14 ). For X2 = 3, we must have forgotten (R2 = 0), and Y1 is free to be
anything (has probability 1). Then normalize the probabilities.
25
(iii) [5 points] Suppose we observe Y2 = 3 and Y1 = 2. Compute
x2 P(X2 | Y2 = 3, Y1 = 2)
1
2
3
4
Solution The solution follows the same calculation as in the previous part, but where we
zero out terms where Y1 6= 2.
x2 P(X2 = x2 , Y2 = 3, Y1 = 2) P(X2 = x2 | Y2 = 3, Y1 = 2)
1
1 4
· (1 − ) · 14 · 1 1−
2 0 0
1 1
3 4
·· 4 ·1
4 0 0
26
c. (10 points)
Suppose you wish to compute the posterior distribution over all other variables given
Y1 = 3, Y2 = 2, Y3 = 6, Y4 = 10. You’re getting tired of doing probabilistic inference by
hand, so you decide to implement Gibbs sampling to do it. Suppose you start out with the
following configuration:
i 1 2 3 4
Xi 3 2 4 4
Yi 3 2 6 10
Ri 1 0 1 1
(i) [3 points] Compute the Gibbs sampling update for
P(X2 | everything else) = P(X2 | X1 , X3 , X4 , Y1 , . . . , Y4 , R1 , . . . , R4 ) = (8)
(iii) [4 points] What is the problem with running Gibbs sampling on this Bayesian net-
work? What alternative would you suggest?
Solution In order for Gibbs sampling to work, we must be able to reach any setting of
variables from any other one. Because the Gibbs updates are mostly deterministic, we are
stuck in this setting. One solution would be first marginalize out the variables R1 , . . . , Rn , in
which case the posterior over X1 , . . . , Xn would factor in to n separate distributions can be
sampled independently and exactly. Particle filtering would work too as a default because
of the sequential structure of the problem, but would be less efficient.
27
d. (15 points)
You are embarrassed to realize that not only is the circuit faulty, but also you have no
clue how faulty it is (what is). You decide to estimate from data. Suppose you observe
the following variables:
i 1 2 3 4
Xi 3 2 4 4
Yi 3 2 6 10
In particular, you do not observe R1 , . . . , R4 . Your goal is to find the which maximizes the
marginal likelihood of the observed data. Let’s use the EM algorithm.
(i) [5 points] Initialize = 0 . For the E-step, compute the posterior:
P(R1 , R2 , R3 , R4 | X1 = 3, X2 = 2, X3 = 4, X4 = 4, Y1 = 3, Y2 = 2, Y3 = 6, Y4 = 10)
Solution The data provides no information about R1 , so the posterior is identical to the
prior distribution over R1 , which is 0 with probability 0 and 1 with probability 1 − 0 . The
other three variables are completely determined from the data: R2 = 0, R3 = 1, R4 = 1.
(ii) [5 points] For the M-step, use the posterior above to compute the updated value of
(which should be a function of 0 ).
Solution There are 0 +1 fractional counts towards forgetting (0) and (1−0 )+2 fractional
0 + 1
counts towards remembering (1). Normalizing, we get = .
4
(iii) [5 points] Compute what converges to as you run more iterations of EM. Justify
your answer mathematically.
+1
Solution In each iteration, we take and return .
We can compute the fixed point
4
1
(what EM converges) by solving the equation = +1
4
, which yields = .
3
28
(page left blank for scratch work)
29
(page left blank for scratch work)
30