0% found this document useful (0 votes)

28 views30 pages

CS 221 Fall 19 Solution

Uploaded by

Jinchao Lin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views30 pages

CS 221 Fall 19 Solution

Uploaded by

Jinchao Lin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

CS221 Exam Solutions

CS221
Fall 2019 Name: | {z }
by writing my name I agree to abide by the honor code

SUNet ID:
Read all of the following information before starting the exam:

• This test has 3 problems printed on 28 pages and is worth 150 points total. It is your
responsibility to make sure that you have all of the pages.

• Only the printed (top) side of each page will be scanned so write all your answers on
that side of the paper.

• Keep your answers precise and concise. We may award partial credit so show all your
work clearly and in order.

• Don’t spend too much time on one problem. Read through all the problems carefully
and do the easier ones first. Try to understand the problems intuitively; it really helps
to draw a picture.

• You cannot use any external aids except one double-sided 8 12 ” x 11” page of notes.

• Good luck!

Problem Part Max Score Score

a 15
1 b 10
c 10
d 15
a 15
2 b 10
c 15
d 10
a 10
b 15
3 c 10
d 15

Total Score: + + =

1
1. Under Attack (50 points)
You work for a cybersecurity company, Sanymtec, to monitor online forums. Recently,
you’ve noticed an increasing number of curious sentences that look like this:

"stoafnrd scisetints fnid evenidce taht the erath is falt"

These sentences are perfectly readable by humans, but when you feed them into your machine
learning models, they are totally confused and make wildly incorrect predictions. Your
automatic fake news detection system is under attack!

a. (15 points)
As a first step, you would like to classify a sentence x as either adversarial (y = 1) or
not (y = −1). Your boss doesn’t want you to use the hinge loss because she’s worried that
the attacker might be able to more easily reverse engineer the system. So you decide to
investigate alternative loss functions.
For each of the five loss functions Loss(x, y, w) below:

• Determine whether Loss(x, y, w) is usable for classification if we minimize the training

loss with stochastic gradient descent (SGD).

• If the answer is yes, compute its gradient ∇w Loss(x, y, w).

• If the answer is no, explain in one sentence why it is not usable.

(i) [3 points] Loss(x, y, w) = max{1 − b(w · φ(x))yc, 0}, where bac returns a rounded
down to the nearest integer.

Solution No, because the gradient is 0 almost everywhere, so we cannot apply gradient
descent.
(ii) [3 points] Loss(x, y, w) = max{(w · φ(x))y − 1, 0}.

Solution No, because for this loss function, a larger margin corresponds to a larger loss,
which is the opposite of what we want.

2
(iii) [3 points]

1 − 2(w · φ(x))y
 if (w · φ(x))y ≤ 0
Loss(x, y, w) = (1 − (w · φ(x))y)2 if 0 < (w · φ(x))y ≤ 1

0 if (w · φ(x))y > 1


Solution Yes, and the gradient is:


−2φ(x)y
 if (w · φ(x))y ≤ 0
∇w Loss(x, y, w) = −2(1 − (w · φ(x))y)φ(x)y if 0 < (w · φ(x))y ≤ 1

0 if (w · φ(x))y > 1


Note: this loss function is a “smoothed” hinge loss, which is continuously differentiable
everywhere (notice that the constants are chosen so that the gradients match at the different
pieces), but still behaves like the hinge
( loss when the magnitude of the margin is large.
max(1 − (w · φ(x)), 0) + 10 if y = +1
(iv) [3 points] Loss(x, y, w) = .
max(1 + (w · φ(x)), 0) − 10 if y = −1

Solution Yes, it turns out that this loss function has the exact same gradient as the hinge
loss! The fact that the two labels have a different constant offset doesn’t matter.
(
−φ(x)y if (w · φ(x))y ≤ 1
∇w Loss(x, y, w) = .
0 otherwise.
Note that this is the simplified form of


 −φ(x) if (w · φ(x)) ≤ 1 and y = +1

φ(x) if (w · φ(x)) ≤ −1 and y = −1
∇w Loss(x, y, w) = .


 0 if (w · φ(x)) > 1 and y = +1
(w · φ(x)) > −1 and y = −1.

0 if

(v) [3 points] Loss(x, y, w) = σ(−(w · φ(x))y), where σ(z) = (1 + e−z )−1 is the logistic
function.

Solution Both answers are acceptable. One could argue that the loss is non-convex so that
it is difficult to optimize (even if the gradient exists and is non-zero everywhere). Of course,
non-convexity never stopped anyone, and given that this loss function is a more faithful
approximation to the zero-one loss, one could make a case that this objective is reasonable.
It’s gradient is:
∇w Loss(x, y, w) = −σ(−(w · φ(x))y)(1 − σ(−(w · φ(x))y)φ(x)y.

3
b. (10 points)
Your next job is to decide which features to use in order to solve the classification problem.
Assume you have a set D of real English words.
(i) [5 points] Suppose you have a sentence x which is a string; e.g., x = "erath is falt".
Write the Python code for taking a sentence x and producing a dict representing the fol-
lowing two feature templates:

1. x contains word

2. number of words in x that are not in D

Assume that words in x are separated by spaces; e.g., the words in x above are "erath",
"is", "falt".

def featureExtractor(x, D):

phi = defaultdict(float)

return phi

Solution The key names are arbitrarily chosen. The return value of "erath is falt" should
be: {"num dict": 2, "contains erath": 1, "contains is": 1, "contains falt": 1}. One possible
solution is shown below.

def featureExtractor(x, D):

phi = defaultdict(float)
for word in x.split(‘ ’):
x[‘contains ’ + word] = 1
if x not in D:
x[‘num dict’] += 1

4
return phi

Due to ambiguity of the first feature template, solutions that add words in D with value
0 will also be accepted. Solutions that assumed the words in x are unique were also accepted.

Solutions that were not accepted:

• Solutions that assume a "given" word is being passed in
• Solutions that only consider words in D and not words like "erath"

5
(ii) [2 points] If k is the number of unique words that occur in the training set and |D| is
the number of words in the given set of real English words, what is the number of features
that the linear classifier will have?

Solution There is one feature for every possible word in the training set, but only one
feature that captures the number words in x that are in the dictionary. So the total is
k+1.
Due to ambiguity of the first feature template, this part will be correct if it is consistent
with the code written on the previous page. If the solution to part (i) adds words in D with
value 0, then the answer n + |D| + 1 will also be accepted where n is the number of unique
words that occur in the training set but not in D.

Solutions that are not accepted:

• k + |D| + 1 is incorrect because there can be words in the training set that are double
counted in |D|
• |D| + 1 is incorrect (if the previous part is correct) because it only consider words in
D and not words like "erath"

(iii) [3 points] Suppose that an insider leaks Sanymtec’s classification strategy to the
attackers. The classifier itself was not leaked, just the classification strategy behind it,
which reveals that Sanymtec is using a dataset of adversarial sentences to train a classifier
with the features defined in part (i).
The attackers then use this information to try modifying any fixed sentence (e.g., "climate
change is a hoax") into something readable by humans (e.g., "clmaite canhge is a
haox") but classified (incorrectly) as non-adversarial by Sanymtec. How can the attackers
achieve this?

Solution The resulting classifier only knows about adversarial words that it has seen in
the training set. Any words that don’t appear in the training set will not contribute to any
of the "x contains word " weights. However, the space of possible adversarial words
is exponentially large, so given any sentence, the adversary can find a perturbation of each
word that has not appeared in the training set. If the adversary then just ensures that x
has a non-adversarial word (e.g., "the") which likely has negative weight, then only this
word will contribute to the classifier score to result in a non-adversarial prediction. Having
a smaller ratio of unseen words to words in D can also help ensure that x has a negative
weight.

6
Solutions that add an arbitrary type of obfuscation other than permutation are also
accepted (e.g. adding characters: "climate" -> "cloimate").
Due to ambiguity of the question, solutions that mention the problem that attackers
would need to know which adversarial permutations are used in Sanymtec’s training set are
accepted. However, solutions that do not take into account that some of the adversarial
sentences could be weighted very positively by Sanymtec’s classifier are penalized.

7
c. (10 points)
Having built a supervised classifier, you find it extremely hard to collect enough examples
of adversarial sentences. On the other hand, you have a lot of non-adversarial text lying
around.
(i) [3 points] Suppose you have a total of 100,000 training examples that consists of 100
adversarial sentences and 99,900 non-adversarial sentences. You train a classifier and get
99.9% accuracy. Is this a good, meaningful result? Explain why or why not.

Solution No, because you can get at least that level of accuracy by always predicting
non-adversarial (y = −1), which is a trivial solution.

(ii) [3 points] You decide to fit a generative model to the non-adversarial text, which is
a distribution p(x) that assigns a probability to each sentence x. For simplicity, let’s use a
unigram model:
n
Y
p(x) = pu (wi ),
i=1

where w1 , w2 , ..., wn are the words in sentence x, and pu (w) is a probability distribution over
possible w’s.
Suppose you are given a single sentence "the cat in the hat" as training data. Com-
pute the maximum likelihood estimate of the unigram model pu :

w pu (w)

Solution
w pu (w)
the 2/5
cat 1/5
in 1/5
hat 1/5

8
(iii) [4 points] Given an unseen sentence, your goal is to be able to predict whether that
sentence is adversarial or not. You have a labeled dataset Dtrain = {(x1 , y1 ), . . . , (xn , yn )}
and would like to use the unigram model to train your predictor.
How could you use p(x) (from the previous problem) and Dtrain to obtain a predictor
f (x) that outputs whether a sentence x is adversarial (y = 1) or not (y = −1)? Be precise
in defining f (x). Hint: define a feature vector φ(x).

f (x) =

Solution Ideally, p(x) should assign high probability to non-adversarial text and low prob-
ability to adversarial text. The only thing that’s missing is the threshold, which can be tuned
on Dtrain . One way to do this is to define two features φ(x) = [1, p(x)], and fit a corresponding
weight vector w using standard techniques (hinge loss + SGD). Then we simply predict

f (x) = sign(w · φ(x)).

9
d. (15 points)
You notice that the adversarial words are often close to real English words. For example,
you might see "erath" or "eatrh" as misspellings of "earth". Furthermore, the actual
number of adversarial words is rather small (it seems like the attacker just wants to reinforce
the same messages). This makes you think of another unsupervised approach to try.
Let D be the set of real English words as before and a1 , . . . , an be the list of adversarial
words you’ve found, and let dist(a, e) be the number of edits to transform some adversarial
word a to the English word e (how exactly distance is defined is unimportant).
We wish to choose K English words e1 , . . . , eK ∈ D and assign each adversarial word
ai to one of the chosen English words (zi ∈ {1, . . . , K}). Each English word e ∈ D incurs
a cost c(e) if we choose it as one of the K words. Our goal is to minimize the total cost of
choosing e1 , . . . , eK plus the total number of edits from adversarial words a1 , . . . , an to their
assigned English words ez1 , . . . , ezn .
As an example, let D = {"earth", "flat", "scientists"} with c("earth") = 1,
c("flat") = 1, c("scientists") = 2, and a1 = "erath", a2 = "falt", a3 = "eatrh". Then
with K = 2, one possible assignment (presumably the best one) is e1 = "earth",
e2 = "flat", z1 = 1, z2 = 2, z3 = 1.

(i) [3 points] Define a loss function that captures the optimization problem above:

Loss(e1 , . . . , eK , z1 , . . . , zn ) =

Solution As the flavortext mentions, "our goal is to minimize the total cost of choosing
e1 , . . . , eK plus the total number of edits from adversarial words a1 ,P . . . , an to their assigned
English words ez1 , . . . , ezn ." The total cost of choosing e1 , . . . , eK is K j=1 c(e
Pj n) and the total
number of edits from adversarial words to their assigned English words is i=1 dist(ai , ezi ).
K
X n
X
Loss(e1 , . . . , eK , z1 , . . . , zn ) = c(ej ) + dist(ai , ezi ).
j=1 i=1

10
(ii) [5 points] Derive an alternating minimization algorithm for optimizing the above
objective. We alternate between two steps. In step 1, we optimize z1 , . . . , zn . Formally write
down this update rule as an equation for each zi where 1 ≤ i ≤ n. What is the runtime?
You should specify runtime with big-Oh notation in terms of n, K and/or |D|.

Solution For each adversarial word ai , the goal is to find the cluster assignment, which
can be done by finding the index of the English word with the closest edit distance to ai .
This can be found by iterating through all j and comparing every English word ej to the
current ai . The index that minimizes this value is the desired cluster assignment.

zi = arg min dist(ai , ej ).

1≤j≤K

For each of the n adversarial words, we loop over K clusters, so the runtime is O(nK).

(iii) [5 points] In step 2, we optimize e1 , . . . , eK . Formally write down this update rule
as an equation for each ej where 1 ≤ j ≤ K. What is the runtime? You should specify
runtime with big-Oh notation in terms of n, K and/or |D|.

Solution For each cluster j, we want to find the best English word ej to be the "centroid."
To do this, we can iterate through every English word e and calculate the "cost" of that
particular assignment. The "cost" of that particular assignment is defined as the cost of
choosing e plus the sum of the edit distance between e with every adversarial word assigned
j. The e with the minimum "cost" will be the new "centroid."
!
X
ej = arg min c(e) + dist(ai , e) .
e∈D
i:zi =j

For each English word e ∈ D and cluster j, we keep a running sum Se,j which is initialized
to c(e). Then for each adversarial word ai and English word e, we increment Se,zi . Finally,
for each j, we simply choose the e that minimizes Se,j . The runtime is O(n|D| + K|D|).

(iv) [2 points] Is the above procedure guaranteed to converge to the minimum cost
solution? Explain why or why not. If not, what method what algorithm could you use with
such guarantees?

Solution No, as this is a variant of k-means, it is only guaranteed to converge to a local

optimum. To obtain the optimal solution, we could cast the problem as a factor graph and
use backtracking search.

11
2. Maze (50 points)
One day, you wake up to find yourself in the middle of a corn field holding an axe and a
map (Figure 1). The corn field consists of an n × n grid of cells, where some adjacent cells
are blocked by walls of corn stalks; specifically, for any two adjacent cells (i, j) and (i0 , j 0 ),
let W ((i, j), (i0 , j 0 )) = 1 if there is a wall between the two cells and 0 otherwise. For example,
in Figure 1, W ((1, 1), (1, 2)) = 0 and W ((1, 2), (1, 3)) = 1.
You can either move to an adjacent cell if there’s no intervening wall with cost 1, or you
can use the axe to cut down a wall with cost c without changing your position. Your axe
can be used to break down at most b0 walls, and your goal is to get from your starting point
(i0 , j0 ) to the exit at (n, n) with the minimum cost.

(1, 1) (2, 1) (3, 1)

(1, 2) (2, 2) (3, 2)

(1, 3) (2, 3) (3, 3)

exit

Figure 1: An example of a corn maze. The goal is to go from the initial location (i0 , j0 ) =
(2, 2) to the exit (n, n) = (3, 3) with the minimum cost.

12
a. (15 points)
(i) [10 points] Fill out the components of the search problem corresponding to the above
maze.

• sstart = ((i0 , j0 ), b0 ).

• Actions(((i, j), b)) = {a ∈ {(−1, 0), (+1, 0), (0, −1), (0, +1)} :
(i, j) + a is in bounds and (W ((i, j), (i, j) + a) = 0 or b > 0)}.

• IsEnd(((i, j), b)) =

Solution IsEnd(((i, j), b)) = [i = n and j = n].

• Succ(((i, j), b), a) =

Solution The new location is (i, j) + a, and we need to decrement our axe budget b
whenever W ((i, j), (i, j) + a) = 1.

Succ(((i, j), b), a) = ((i, j) + a, b − W ((i, j), (i, j) + a)) (1)

During exam we made clarifications that c > 0 and the cost c should be rephrased as
the cost to break down a wall in a move (if there is a wall during a move, we take it
down with extra cost c). Some students might not get our clarification during exam,
so we also accept this solution in this and following questions:

Succ(((i, j), b), a) = ((i, j) + (1 − W ((i, j), (i, j) + a))a, b − W ((i, j), (i, j) + a))
W ((i, j), (i, j) + a) = 0

• Cost(((i, j), b), a) =

Solution We always pay cost 1 for moving one cell and additionally, we pay c if we
have to break down a wall.

Cost(((i, j), n), a) = 1 + cW ((i, j), (i, j) + a). (2)

We also accept this solution:

Cost(((i, j), n), a) = (1 − W ((i, j), (i, j) + a)) + cW ((i, j), (i, j) + a). (3)

13
(ii) [5 points] When you use an axe to take down a wall, the wall stays down but the set
of walls which have been taken down are not tracked in the state. Why does our choice of
state still guarantee the minimum cost solution to the problem?

Solution Under the current formulation, one needs to pay each time we go through a wall.
However, the minimum cost path will never visit the same location twice (otherwise, there’s
a cycle which can be cut out to reduce the cost of the path). Therefore, there is no difference
between paying only for the first time and paying for each time.

14
b. (10 points)
Solving the search problem above is taking forever and you don’t want to be stuck in the
corn maze all day long. So you decide to use A*.
(i) [5 points] Define a consistent heuristic function h(((i, j), b)) based on finding the
minimum cost path using the relaxed state (i, j) where we assume we have an infinite axe
budget and therefore do not need to track it. Show why your choice of h is consistent
and what you would precompute so that evaluating any h(((i, j), b)) takes O(1) time and
precomputation takes O(n2 log n) time.

Solution Define h(((i, j), b)) to be the minimum cost path from (i, j) to (n, n), where we
have no budget constraint on the number of times we can use the axe. We can solve the
relaxed search problem from the exit (n, n) to all locations (i, j). This can be done using
UCS in O(n2 log n) time. Then, we just store the lookup table for each of the O(n2 ) states
to be queried during the search.
This problem only relaxes axe budget constraint, so any heuristic relaxing wall or cost
related constraint (such as Manhattan distance to the end, which just remove all walls or
set c = 0) will not be accepted.

15
(ii) [5 points] Noticing that sometimes h is the true future cost of the original search
problem, you wonder when this holds more generally. For what ranges of b0 and c would
this hold? Assume for this part that there is a path that doesn’t require breaking down any
walls.

≤ b0 ≤c

Your lower bounds need not be tight, but you need to formally justify why they hold.

Solution The heuristic is exactly the future cost when the number of axe uses of the
original and relaxed search problems are identical. This happens in two cases:

1. When the budget constraint b0 is large enough, then there’s effectively no constraint
on the number of axe uses. This happens when b0 is at least the largest number of
walls that need to be broken down. In worse case we only need to break down 2n walls
(otherwise we can directly head to the exit with less cost).

2. When c is large enough, then it is not worth breaking down any walls at all. This
happens when c ≥ n2 , an upper bound on the length of the minimum cost path
without breaking down walls (which exists by assumption).

16
c. (15 points)
Having solved the search problem above, you are eager to set out on your journey through
the maze, but you realize that breaking down corn stalks is harder than you thought. Suppose
that each attempt to break down a wall has an > 0 probability of failing. Recall that b0 is
the maximum number of walls you can break down, not the number of attempts, and each
attempt to break down a wall has cost c.
(i) [5 points] Suppose that each attempt to break down a wall is independent (e.g., if you
fail once, the next attempt at the same wall also has probability of failing regardless of your
previous failures). You are interested in minimizing the expected cost of exiting the maze.
While the natural solution is to treat this as an MDP, it turns out you can still cast this
problem as a search problem. In particular, define a modified Cost(((i, j), b), a) function,
and write one sentence about why this choice gives you the optimal policy.

Solution Since a failed attempt to bring down a wall leaves you in the same state, the
optimal policy will either keep on trying to break down a wall or not try at all (recall the
dice game from lecture). Therefore, we can interpret an action a as saying: repeatedly break
down any wall in direction a and move in that direction. The expected total cost T of such
c
an action is given by the recurrence T = c + T , which has a solution T = 1− , so:

c
Cost(((i, j), b), a) = 1 + W ((i, j), (i, j) + a). (4)
1−

We also accept:

c
Cost(((i, j), b), a) = W ((i, j), (i, j) + a). (5)
1−

17
(ii) [5 points] Suppose instead that each attempt to break down a wall is perfectly de-
pendent (e.g., if you fail once, you will always fail to break down that wall). Let us model
this problem as an MDP. What should the states of the MDP be? What is the number of
states in the worst case as a function of b0 and n (use big-Oh notation)? In this problem
suppose b0 << n.

Solution Now when you fail to break down a wall, you have to remember that fact so you
don’t try again. Therefore your state must include all the walls that you’ve tried to break
down that have failed (F ) and all the walls where you have succeeded (S), since you might
need to revisit locations. You have at most 2n2 − 2n total walls and you need P to keep track
2n2 −2n
of at most b0 in S, and possibly all 2n2 − 2n in F ; the number of such S is bi=0 0
i
2
and that of F is 22n −2n . There are still n2 possible locations. We don’t need to store the
budget
left since this can be tracked by the size of set S. The number of states is therefore
2
Pb0 2n2 −2n 2n2 −2n
O n i=0 i
2 .
Another
solution is to label each wall as one of succeeded, failed or not attempted. This
2
gives O n2 32n −2n .
We give full credit for both solutions.

18
(iii) [5 points] If the probability of successfully breaking down a wall is (1 − )/k, where
k > 0 is the number of times you’ve tried to break down a wall. What should the states of
the MDP be now?

Solution Now you have to keep track of how many times you’ve tried to break down each
wall, so that at any point in time, you know the failure probability of each wall. The state
should be same as (ii) but F is now a counter instead of a set.
Another interpretation is to take k as shared for all walls. In this context the state should
be current location, budget of axe, k and set of broken walls S. Both interpretations are
acceptable.

19
d. (10 points)
Let’s actually solve the maze! In this specific 3 × 3 maze as shown in Figure 1, the initial
location is (2, 2) and the exit is (3, 3). For simplicity assume that b0 = 1 and that your axe
always succeeds ( = 0).

(1, 1) (2, 1) (3, 1)

(1, 2) (2, 2) (3, 2)

(1, 3) (2, 3) (3, 3)

exit

Figure 2: Same corn maze from Figure 1, repeated for convenience.

(i) [5 points] Compute the minimum achievable cost as a function of c.

Solution There are three solutions:

1. Go around the walls: (2,2)–(1,2)–(1,1)–(2,1)–(3,1)–(3,2)–(3,3), which has cost 6.

2. Break down one wall: (2,2)–(3,2)–(3,3), which both have cost 2 + c.

3. Break down the other wall: (2,2)–(2,3)–(3,3), which both have cost 2 + c.

Therefore, the minimum cost is min(2 + c, 6). We also accept solutions that consistent with
their definitions in a.

20
(ii) [5 points] Let’s look at the optimal policy at the initial location. For each value of c,
what are corresponding optimal actions? If there is a tie between optimal actions state all of
them. Your answer should consist of statements of the form: if c ∈ , then the optimal
actions are .

Solution Note that the three solutions above correspond to action 1 (west), action 2 (east),
and action 4 (south), respectively. There are three cases:

• If 2 + c < 6 or c ∈ (0, 4), then optimal actions are 2 and 4.

• If 2 + c = 6 or c = 4, then optimal actions are 1, 2 and 4.

• If 2 + c > 6 or c ∈ (4, ∞), then optimal action is 1.

We also accept solutions that consistent with their definitions in a.

21
3. Faulty Accumulator (50 points)
You decide to try your hand at building hardware. Specifically, you will build a simple
circuit that takes n numbers and incrementally computes their sum. However, it turns out
hardware is hard, and in your first attempt, the accumulator occasionally gets zeroed out
randomly.
To capture this precisely, we can define the following generative model whose Bayesian
network is shown in Figure 3. Let Y0 = 0 be the initial sum. For each time step i = 1, . . . , n,
the circuit:

1. Receives an input number Xi chosen uniformly from {1, 2, 3, 4}.

2. Decides to remember (Ri = 1) with probability 1− or forget (Ri = 0) with probability
.

3. Computes the running sum: Yi = Ri Yi−1 + Xi , where Yi−1 is added depending on Ri .

As an example:

1. X1 = 3, R1 = 1, Y1 = 3 (remember)

2. X2 = 2, R2 = 0, Y2 = 2 (forget)

3. X3 = 4, R3 = 1, Y3 = 6 (remember)

4. X4 = 4, R4 = 1, Y4 = 10 (remember)

X1 X2 X3 X4

Y0 Y1 Y2 Y3 Y4

R1 R2 R3 R4

Figure 3: Bayesian network corresponding to the faulty accumulator.

22
a. (10 points)
To speed things up, you want to first prune the domains of variables. Recall that when
we enforce arc consistency on a variable A with respect to a factor f , we keep a value v in
the domain of A if and only if there exist values for other variables in the scope of f such
that f evaluates to a non-zero number.
(i) [5 points] What is the domain of Yn as a function of n?

Solution The value Yn is the sum over n numbers, each of which can be up to 4, so the
domain of Yn is {1, . . . , 4n} .

(ii) [5 points] Consider the following factor, where we have marginalized out R2 :

p(y2 | y1 , x2 ) = p(y2 | y1 , x2 , r2 = 0) + (1 − )p(y2 | y1 , x2 , r2 = 1). (6)

Suppose Y1 ∈ {1, 2} and Y2 = 3. What is the domain of X2 after enforcing arc consistency
on X2 ?

Solution If we remember (R2 = 1), then X2 must be Y2 − Y1 ∈ {1, 2}. If we forget

(R2 = 0), X2 must be Y2 = 3. So the resulting domain of X2 is {1, 2, 3} .

23
b. (15 points)
Now, disregarding what was done during part a, let us explore how conditioning on
evidence changes our beliefs about X2 .
(i) [5 points] Compute:

x2 P(X2 = x2 )
1
2
3
4

Solution We marginalize all other variables, which are non-ancestors of X2 , to get the
prior distribution, which is uniform as defined:
x2 P(X2 = x2 )
1 1/4
2 1/4
3 1/4
4 1/4

24
(ii) [5 points] Suppose we observe that Y2 = 3. Now what do we believe about X2 ?

x2 P(X2 = x2 | Y2 = 3)
1
2
3
4

Solution First note that Y1 = X1 deterministically, which has a uniform distribution over
{1, 2, 3, 4}. Next, let us write out the conditional distribution:
X
P(X2 = x2 | Y2 = 3) ∝ P(X2 = x2 , Y2 = 3) = p(y1 )p(r2 )p(x2 )p(y2 = 3 | y1 , x2 , r2 ). (7)
y1 ,r2

Now we compute the RHS for each value of x2 .

x2 P(X2 = x2 , Y2 = 3) P(X2 = x2 | Y2 = 3)
1
1 4
· (1 − ) · 14 · 1 1−
2+2
1
2 4
· (1 − ) · 14 · 1 1−
2+2
1 4
3 1·· 4 ·1 2+2
4 0 0

Note that for X2 ∈ {1, 2}, we must have had remembered (R2 = 1), which forces Y1 = Y2 −X2
(has probability 14 ). For X2 = 3, we must have forgotten (R2 = 0), and Y1 is free to be
anything (has probability 1). Then normalize the probabilities.

25
(iii) [5 points] Suppose we observe Y2 = 3 and Y1 = 2. Compute

x2 P(X2 | Y2 = 3, Y1 = 2)
1
2
3
4

Solution The solution follows the same calculation as in the previous part, but where we
zero out terms where Y1 6= 2.
x2 P(X2 = x2 , Y2 = 3, Y1 = 2) P(X2 = x2 | Y2 = 3, Y1 = 2)
1
1 4
· (1 − ) · 14 · 1 1−
2 0 0
1 1
3 4
·· 4 ·1
4 0 0

26
c. (10 points)
Suppose you wish to compute the posterior distribution over all other variables given
Y1 = 3, Y2 = 2, Y3 = 6, Y4 = 10. You’re getting tired of doing probabilistic inference by
hand, so you decide to implement Gibbs sampling to do it. Suppose you start out with the
following configuration:
i 1 2 3 4
Xi 3 2 4 4
Yi 3 2 6 10
Ri 1 0 1 1
(i) [3 points] Compute the Gibbs sampling update for
P(X2 | everything else) = P(X2 | X1 , X3 , X4 , Y1 , . . . , Y4 , R1 , . . . , R4 ) = (8)

Solution Given all other variables, X2 is deterministic. In other words,

P(X2 = 2 | everything else) = 1.

(ii) [3 points] Compute the Gibbs sampling update for

P(Y2 | everything else) = P(Y2 | X1 , . . . , X4 , Y1 , Y3 , Y4 , R1 , . . . , R4 ) = (9)

Solution Given all other variables, Y2 is deterministic. In other words,

P(Y2 = 2 | everything else) = 1.

(iii) [4 points] What is the problem with running Gibbs sampling on this Bayesian net-
work? What alternative would you suggest?

Solution In order for Gibbs sampling to work, we must be able to reach any setting of
variables from any other one. Because the Gibbs updates are mostly deterministic, we are
stuck in this setting. One solution would be first marginalize out the variables R1 , . . . , Rn , in
which case the posterior over X1 , . . . , Xn would factor in to n separate distributions can be
sampled independently and exactly. Particle filtering would work too as a default because
of the sequential structure of the problem, but would be less efficient.

27
d. (15 points)
You are embarrassed to realize that not only is the circuit faulty, but also you have no
clue how faulty it is (what is). You decide to estimate from data. Suppose you observe
the following variables:
i 1 2 3 4
Xi 3 2 4 4
Yi 3 2 6 10
In particular, you do not observe R1 , . . . , R4 . Your goal is to find the which maximizes the
marginal likelihood of the observed data. Let’s use the EM algorithm.
(i) [5 points] Initialize = 0 . For the E-step, compute the posterior:

P(R1 , R2 , R3 , R4 | X1 = 3, X2 = 2, X3 = 4, X4 = 4, Y1 = 3, Y2 = 2, Y3 = 6, Y4 = 10)

Solution The data provides no information about R1 , so the posterior is identical to the
prior distribution over R1 , which is 0 with probability 0 and 1 with probability 1 − 0 . The
other three variables are completely determined from the data: R2 = 0, R3 = 1, R4 = 1.

(ii) [5 points] For the M-step, use the posterior above to compute the updated value of
(which should be a function of 0 ).

Solution There are 0 +1 fractional counts towards forgetting (0) and (1−0 )+2 fractional
0 + 1
counts towards remembering (1). Normalizing, we get = .
4

(iii) [5 points] Compute what converges to as you run more iterations of EM. Justify
your answer mathematically.

+1
Solution In each iteration, we take and return .
We can compute the fixed point
4
1
(what EM converges) by solving the equation = +1
4
, which yields = .
3

28
(page left blank for scratch work)

29
(page left blank for scratch work)

Synopsis Hotel Room Booking
50% (2)
Synopsis Hotel Room Booking
19 pages
Use of ICT in Automobile Industry
100% (3)
Use of ICT in Automobile Industry
3 pages
Installation Procedure: EVC-E Volvo Penta IPS Triple: Typical Installation / Main Station
No ratings yet
Installation Procedure: EVC-E Volvo Penta IPS Triple: Typical Installation / Main Station
2 pages
Manual Roche Cobas B 221
No ratings yet
Manual Roche Cobas B 221
360 pages
ML MCQ
100% (4)
ML MCQ
31 pages
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
Data Structures, Algorithms and Applications in C++
No ratings yet
Data Structures, Algorithms and Applications in C++
826 pages
UG Courses of Study 2007
100% (2)
UG Courses of Study 2007
147 pages
Document Management System (DMS)
No ratings yet
Document Management System (DMS)
22 pages
SAILOR Battery Panel BP4680
No ratings yet
SAILOR Battery Panel BP4680
16 pages
Final2019 Solutions
No ratings yet
Final2019 Solutions
23 pages
EC6 2 ReleaseNotes P638 24
No ratings yet
EC6 2 ReleaseNotes P638 24
3 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Image Processing
No ratings yet
Image Processing
45 pages
(Amir Hussain Shah) (Amir Hussain Shah) (Amir Hussain Shah) : Course Code Tutor Address Tutor Address Tutor Address
No ratings yet
(Amir Hussain Shah) (Amir Hussain Shah) (Amir Hussain Shah) : Course Code Tutor Address Tutor Address Tutor Address
25 pages
CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
100% (1)
CS230: Deep Learning: Winter Quarter 2018 Stanford University Midterm Examination 180 Minutes
36 pages
007A Lufkin Well Manager 2.0 VSD Web
No ratings yet
007A Lufkin Well Manager 2.0 VSD Web
2 pages
SG Acma
No ratings yet
SG Acma
9 pages
Intro To Neural Networks Explained For Beginners: Sajjad Mustafa
No ratings yet
Intro To Neural Networks Explained For Beginners: Sajjad Mustafa
110 pages
SP916GK Manual
No ratings yet
SP916GK Manual
41 pages
Internship Presentation of Learning
No ratings yet
Internship Presentation of Learning
12 pages
NetLogo User Manual
No ratings yet
NetLogo User Manual
438 pages
MLvsMAP Merged
No ratings yet
MLvsMAP Merged
208 pages
Lecture 2
No ratings yet
Lecture 2
80 pages
Exam2 Review Solutions
No ratings yet
Exam2 Review Solutions
18 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
56 pages
Probabilistic Reasoning: Unit-V
No ratings yet
Probabilistic Reasoning: Unit-V
33 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
CS115 01
No ratings yet
CS115 01
38 pages
Cs224n Midterm 2018 Solution
No ratings yet
Cs224n Midterm 2018 Solution
17 pages
Trial Quiz v2
No ratings yet
Trial Quiz v2
11 pages
CS388N Practice Questions Answers
No ratings yet
CS388N Practice Questions Answers
48 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
No ratings yet
Date Narration Chq./Ref - No. Value DT Withdrawal Amt. Deposit Amt. Closing Balance
67 pages
Midterm 2002
No ratings yet
Midterm 2002
10 pages
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
Unit 2 - Part A - B - C
No ratings yet
Unit 2 - Part A - B - C
25 pages
Some Common Taylor Series: The Sine and Cosine Functions
No ratings yet
Some Common Taylor Series: The Sine and Cosine Functions
4 pages
Winter 21 Exam 1
No ratings yet
Winter 21 Exam 1
17 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
13 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
LogReg 2024 25 Exercs Sols
No ratings yet
LogReg 2024 25 Exercs Sols
20 pages
2nd Exam Question Paper 2
No ratings yet
2nd Exam Question Paper 2
16 pages
Ps 3
No ratings yet
Ps 3
15 pages
C1 W1 Assignment
No ratings yet
C1 W1 Assignment
14 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
Capstone Case Study
No ratings yet
Capstone Case Study
4 pages
Chapter 6 Word - Table and Mail Merge
No ratings yet
Chapter 6 Word - Table and Mail Merge
29 pages
RICOH IM 370 460F Brochure 3
No ratings yet
RICOH IM 370 460F Brochure 3
6 pages
Week 7 - Graded
No ratings yet
Week 7 - Graded
17 pages
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Spring 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
8 pages
Exam 21
No ratings yet
Exam 21
17 pages
Network-Based Detection of Iot Botnet Attacks Using Deep Autoencoders
No ratings yet
Network-Based Detection of Iot Botnet Attacks Using Deep Autoencoders
45 pages
CS 224n Assignment #2: Word2Vec and Dependency Parsing
No ratings yet
CS 224n Assignment #2: Word2Vec and Dependency Parsing
10 pages
1 - Icue49301.2020.9307075
No ratings yet
1 - Icue49301.2020.9307075
7 pages
BDA3073 - 11 Bode Plot
No ratings yet
BDA3073 - 11 Bode Plot
26 pages
Week 1
No ratings yet
Week 1
11 pages
CS6910 Tutorial5
No ratings yet
CS6910 Tutorial5
9 pages
Deep Learning Assignment2 Solutions PDF
No ratings yet
Deep Learning Assignment2 Solutions PDF
16 pages
A2 Handout
No ratings yet
A2 Handout
6 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
1 Intro
No ratings yet
1 Intro
5 pages
Midem ML Makeup Sol Upated
No ratings yet
Midem ML Makeup Sol Upated
6 pages
Unixtoolbox Book
No ratings yet
Unixtoolbox Book
30 pages
2011 End Spring 2011 Computer Science Machine Learning
No ratings yet
2011 End Spring 2011 Computer Science Machine Learning
10 pages
Mid-Term A2 ML Solution
No ratings yet
Mid-Term A2 ML Solution
7 pages
Final Exam Solutions
No ratings yet
Final Exam Solutions
12 pages
CS 540: Introduction To Artificial Intelligence: Final Exam: 8:15-9:45am, December 21, 2016 132 Noland
No ratings yet
CS 540: Introduction To Artificial Intelligence: Final Exam: 8:15-9:45am, December 21, 2016 132 Noland
8 pages
Final Week 12 Quiz
No ratings yet
Final Week 12 Quiz
5 pages
Practice Midterm Solutions
No ratings yet
Practice Midterm Solutions
7 pages
MSBD5001 WrittenAssignment2 2024F
No ratings yet
MSBD5001 WrittenAssignment2 2024F
5 pages
t4 Sol
No ratings yet
t4 Sol
8 pages
Exam Spring 10
No ratings yet
Exam Spring 10
10 pages
AI Answers
No ratings yet
AI Answers
4 pages
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
No ratings yet
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
8 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
hw1 PDF
No ratings yet
hw1 PDF
6 pages
Nour Issa
No ratings yet
Nour Issa
6 pages
HW 1 Eeowh 3
No ratings yet
HW 1 Eeowh 3
6 pages
Midterm Solution
No ratings yet
Midterm Solution
6 pages
DSCI 303: Machine Learning For Data Science Fall 2020
No ratings yet
DSCI 303: Machine Learning For Data Science Fall 2020
5 pages
Artificial Intelligence Operated Elevator Using RL AIOERL
No ratings yet
Artificial Intelligence Operated Elevator Using RL AIOERL
4 pages
Chapter 2 RRL
No ratings yet
Chapter 2 RRL
9 pages
Homework 2
No ratings yet
Homework 2
3 pages
DevOps Engineer
No ratings yet
DevOps Engineer
2 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet

CS 221 Fall 19 Solution

Uploaded by

CS 221 Fall 19 Solution

Uploaded by

CS221 Exam Solutions

Problem Part Max Score Score

"stoafnrd scisetints fnid evenidce taht the erath is falt"

• Determine whether Loss(x, y, w) is usable for classification if we minimize the training

• If the answer is yes, compute its gradient ∇w Loss(x, y, w).

• If the answer is no, explain in one sentence why it is not usable.

Solution Yes, and the gradient is:

2. number of words in x that are not in D

def featureExtractor(x, D):

def featureExtractor(x, D):

Solutions that were not accepted:

Solutions that are not accepted:

f (x) = sign(w · φ(x)).

zi = arg min dist(ai , ej ).

Solution No, as this is a variant of k-means, it is only guaranteed to converge to a local

(1, 1) (2, 1) (3, 1)

(1, 2) (2, 2) (3, 2)

(1, 3) (2, 3) (3, 3)

• IsEnd(((i, j), b)) =

Solution IsEnd(((i, j), b)) = [i = n and j = n].

• Succ(((i, j), b), a) =

Succ(((i, j), b), a) = ((i, j) + a, b − W ((i, j), (i, j) + a)) (1)

• Cost(((i, j), b), a) =

Cost(((i, j), n), a) = 1 + cW ((i, j), (i, j) + a). (2)

We also accept this solution:

(1, 1) (2, 1) (3, 1)

(1, 2) (2, 2) (3, 2)

(1, 3) (2, 3) (3, 3)

Figure 2: Same corn maze from Figure 1, repeated for convenience.

(i) [5 points] Compute the minimum achievable cost as a function of c.

Solution There are three solutions:

1. Go around the walls: (2,2)–(1,2)–(1,1)–(2,1)–(3,1)–(3,2)–(3,3), which has cost 6.

2. Break down one wall: (2,2)–(3,2)–(3,3), which both have cost 2 + c.

• If 2 + c < 6 or c ∈ (0, 4), then optimal actions are 2 and 4.

• If 2 + c = 6 or c = 4, then optimal actions are 1, 2 and 4.

• If 2 + c > 6 or c ∈ (4, ∞), then optimal action is 1.

We also accept solutions that consistent with their definitions in a.

1. Receives an input number Xi chosen uniformly from {1, 2, 3, 4}.

3. Computes the running sum: Yi = Ri Yi−1 + Xi , where Yi−1 is added depending on Ri .

Figure 3: Bayesian network corresponding to the faulty accumulator.

p(y2 | y1 , x2 ) = p(y2 | y1 , x2 , r2 = 0) + (1 − )p(y2 | y1 , x2 , r2 = 1). (6)

Solution If we remember (R2 = 1), then X2 must be Y2 − Y1 ∈ {1, 2}. If we forget

Now we compute the RHS for each value of x2 .

Solution Given all other variables, X2 is deterministic. In other words,

(ii) [3 points] Compute the Gibbs sampling update for

Solution Given all other variables, Y2 is deterministic. In other words,

You might also like

p(y2 | y1 , x2 ) = p(y2 | y1 , x2 , r2 = 0) + (1 − )p(y2 | y1 , x2 , r2 = 1). (6)