0% found this document useful (0 votes)
251 views28 pages

cs188 sp19 Final Sol

orem ipsum dolor sit amet, consectetur adipiscing elit. Nunc commodo hendrerit magna. Phasellus quis nisi ullamcorper, iaculis elit non, ornare diam. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Nunc congue sit amet velit a accumsan. Vivamus laoreet, orci at dictum tincidunt, enim mauris pharetra elit, eget scelerisque augue dolor in orci. Pellentesque urna ligula, tincidunt sed suscipit id, maximus sit amet leo. Aenean aliquam vitae mi et venenatis

Uploaded by

Dũng Minh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views28 pages

cs188 sp19 Final Sol

orem ipsum dolor sit amet, consectetur adipiscing elit. Nunc commodo hendrerit magna. Phasellus quis nisi ullamcorper, iaculis elit non, ornare diam. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Nunc congue sit amet velit a accumsan. Vivamus laoreet, orci at dictum tincidunt, enim mauris pharetra elit, eget scelerisque augue dolor in orci. Pellentesque urna ligula, tincidunt sed suscipit id, maximus sit amet leo. Aenean aliquam vitae mi et venenatis

Uploaded by

Dũng Minh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

CS 188 Introduction to

Spring 2019 Artificial Intelligence Final Exam


• You have 170 minutes. The time will be projected at the front of the room. You may not leave during the last
10 minutes of the exam.
• Do NOT open exams until told to. Write your SIDs in the top right corner of every page.
• If you need to go to the bathroom, bring us your exam, phone, and SID. We will record the time.
• In the interest of fairness, we want everyone to have access to the same information. To that end, we will not
be answering questions about the content. If a clarification is needed, it will be projected at the front of the
room. Make sure to periodically check the clarifications.
• The exam is closed book, closed laptop, and closed notes except your three-page double-sided cheat sheet. Turn
off and put away all electronics.
• We will give you two sheets of scratch paper. Please do not turn them in with your exam. Mark your answers
ON THE EXAM IN THE DESIGNATED ANSWER AREAS. We will not grade anything on scratch paper.
• For multiple choice questions:
–  means mark ALL options that apply
– # means mark ONE choice


– When selecting an answer, please fill in the bubble or square COMPLETELY and

First name

Last name

SID

Student to the right (SID and Name)

Student to the left (SID and Name)

Q1. Agent Testing Today! /1


Q2. Search /12
Q3. Pacman’s Treasure Hunt /14
Q4. Inexpensive Elimination /9
Q5. Sampling /10
Q6. HMM Smoothing /11
Q7. Partying Particle #No Filter(ing) /8
Q8. Double Decisions and VPI /11
Q9. Naively Fishing /11
Q10. Neural Networks and Decision Trees /13
Total /100

1
THIS PAGE IS INTENTIONALLY LEFT BLANK
SID:

Q1. [1 pt] Agent Testing Today!

It’s testing time! Not only for you, but for our CS188 robots as well! Circle your favorite robot below.

Any answer was acceptable.

3
Q2. [12 pts] Search

(a) Consider the class of directed, m × n grid graphs as illustrated above. (We assume that m, n > 2.) Each edge
has a cost of 1. The start state is A11 at top left and the goal state at Amn is bottom right.
(i) [2 pts] If we run Uniform-Cost graph search (breaking ties randomly), what is the maximum possible
size of the fringe? Write your answer in terms of m and n in big-O style, ignoring constants. For example,
if you think the answer is m3 n3 + 2, write m3 n3 .

min{m, n}

All edge lengths are 1, so search expands out in a manner similar to BFS: the fringe is essentially the diagonal
sweeping across the grid starting from 1,1. Its maximum size is O(min{m, n}).
(ii) [2 pts] If we run Depth-First graph search with a stack, what is the maximum possible size of the stack?

m+n

Because all the arrows go right and down, the longest path in the graph is the one from 1,1 to m, n, which has
length O(m + n).

(b) Now answer the same questions for undirected m × n grid graphs (i.e., the links go in both directions between
neighboring nodes).
(i) [2 pts] Maximum fringe size for Uniform-Cost graph search?

min{m, n}

The nodes k steps from the start state will be exactly the same as in the directed graph, because following any
left or up arrow will just get you to an already-visited state.
(ii) [2 pts] Maximum stack size for Depth-First graph search?

mn

Here, with ties broken randomly, DFS could follow a path that visits every node in the graph before reaching
the goal—e.g., by travelling up and down the columns—hence the stack can grow to O(mn).

(c) The following questions are concerned with an undirected grid graph G and with Manhattan distance d((i, j), (k, l)) =
|i − k| + |j − l|. Now the start and goal locations can be anywhere in the graph.
(i) [1 pt] True/False: Let G+ be a copy of G with n extra links added with edge cost 1 that connect arbitrary
non-adjacent nodes. Then Manhattan distance is an admissible heuristic for finding shortest paths in G+ .
# True False

4
SID:

Manhattan is exact for G, and shortest paths in G+ can be shorter than those in G—e.g., an added link might
go straight to the goal, so Manhattan is not admissible.
(ii) [1 pt] True/False: Let G− be a copy of G with n arbitrarily chosen links deleted, and let h+ (s, t) be the
exact cost of the shortest path from location s to location t in G+ . Then h+ (·, g) is an admissible heuristic
for finding shortest paths in G− when the goal is location g.
True # False
The set of paths in G− is a subset of those in G, whch is a subset of those in G+ , and hence no shorter path
exists in G− than the shortest path in G+ , so h+ (·, g) is admissible.

5
(iii) [2 pts] Suppose that K robots are at K different locations (xk , yk ) on the complete grid G, and can move
simultaneously. The goal is for them all to meet in one location as soon as possible, subject to the constraint
that if two robots meet in the same location en route, they immediately settle down together and cannot
move after that. Define dX to be the maximum x separation, i.e., | max{x1 , . . . , xK } − min{x1 , . . . , xK }|,
with dY defined similarly. Which of the following is the most accurate admissible heuristic for this problem?
(Select one only.)
# ddX /2e + ddY /2e
# dX + dY
# ddX /2e + ddY /2e + K/4
max{ddX /2e, ddY /2e, K/4}
# max{ddX /2e + ddY /2e, K/4}
The no-meeting requirement means that at most 4 robots can arrive at the goal location in one time step,
so K/4 is a lower bound. Some robot has to travel at least ddX /2e steps, and some robot has to travel at
least ddY /2e steps, so those are lower bounds; but it may not be the same robot. (Consider the K robots
arrayed in a single vertical line and a single horizontal line, with the lines crossing in the middle.) So we
cannot add the bounds.

6
SID:

Q3. [14 pts] Pacman’s Treasure Hunt


Pacman is hunting for gold in a linear grid-world with cells A, B, C, D. Cell A contains the gold but the entrance to
A is locked. Pacman can pass through if he has a key. The possible states are as follows: Xk means that Pacman
is in cell X and has the key; X−k means that Pacman is in cell X and does not have the key. The initial state is
always C−k .

In each state Pacman has two possible actions, left and right. These actions are deterministic but do not change the
state if Pacman tries to enter cell A without the key or run into a wall (left from cell A or right from cell D). The
key is in cell D and entering cell D causes the key to be picked up instantly.

If Pacman tries to enter cell A without the key, he receives a reward of -10, i.e. R(B−k , lef t, B−k ) = −10. The
“exit” action from cell A receives a reward of 100. All other actions have 0 reward.

Pacman has the key

A B C D

Pacman does not have the key

A B C D

(a) [2 pts] Consider the discount factor γ = 0.1 and the following policy:

State Ak Bk Ck Dk A−k B−k C−k D−k


Action exit left left left exit left right right

Fill in V π (B−k ) and V π (C−k ) for this policy in the table below.
State Ak Bk Ck Dk A−k B−k C−k D−k
V 100 10 1 0.1 100 −11.1̄ 0.01 N/A

(b) [3 pts] Now, we will redefine the MDP so that Pacman has a probability β ∈ [0, 1], on each attempt, of crashing
through the gate even without the key. So our transition function from B will be modified as follows:

T (B−k , lef t, A−k ) = β and T (B−k , lef t, B−k ) = 1 − β.

7
All other aspects remain the same. The immediate reward for attempting to go through the gate is −10 if
Pacman fails to go through the gate, as before, and 0 if Pacman succeeds in crashing through gate. Which of
the following are true? (Select one or more choices.)

 For any fixed γ < 1, there is some value β < 1 such that trying to crash through the gate is better than
fetching the key.
 For any fixed β, there is some value of γ < 1 such that trying to crash through the gate is better than
fetching the key.
 For β = 12 , there is some value of γ < 1 such that trying to crash through the gate is better than fetching
the key.
 None of the above

(c) Thus far we’ve assumed knowledge of the transition function T (s, a, s0 ). Now let’s assume we do not.

(i) [2 pts] Which of the following can be used to obtain a policy if we don’t know the transition function?
 Value Iteration followed by Policy Extraction
 Approximate Q-learning
 TD learning followed by Policy Extraction
 Policy Iteration with a learned T (s, a, s0 )

(ii) [1 pt] Under which conditions would one benefit from using approximate Q-learning over vanilla Q-
learning? (Select one only)
When the state space is very high-dimensional
# When the transition function is known
# When the transition function is unknown
# When the discount factor is small

(iii) [4 pts] Suppose we choose to use Q-learning (in absence of the transition function) and we obtain the
following observations:
st a st+1 reward
Ck left Bk 0
Bk left Ak 0
Ak exit terminal 100
B−k left B−k -10

What values does the Q-function attain if we initialize the Q-values to 0 and replay the experience in the
table exactly two times? Use a learning rate, α, of 0.5 and a discount factor, γ, of 0.1.
1. Q(Ak , exit): # 100 75 # 50 # 0
2. Q(Bk , lef t): # 10 # 5 2.5 # 0
3. Q(Ck , lef t): # 10 # 5 # 2.5 0
4. Q(B−k , lef t): # -10 # 5 # −2.5 −7.5

(d) Suppose we want to define the (deterministic) transition model using propositional logic instead of a table.
States are defined using proposition symbols be At , Bt , Ct , Dt and Kt , where, e.g., At means that Pacman is
in cell A at time t and Kt means that the Pacman has the key. The action symbols are Leftt , Rightt , Exitt .

(i) [1 pt] Which of the following statements are correct formulations of the successor-state axiom for At ?
# At ⇔ (At−1 ⇒ (Leftt−1 ∧ Bt−1 ∧ Kt−1 )

8
SID:

# At ⇔ (At ∧ Leftt ) ∨ (Bt−1 ∧ Leftt−1 ∧ Kt−1 )


At ⇔ (At−1 ∧ Leftt−1 ) ∨ (Bt−1 ∧ Leftt−1 ∧ Kt−1 )
# At ⇔ (At−1 ∧ Leftt−1 ) ∨ (Bt ∧ Leftt ∧ Kt )
# At ⇔ (At−1 ∧ Leftt−1 ) ∨ (Bt−1 ∧ Leftt−1 ∧ ¬Kt−1 )
(ii) [1 pt] Which of the following statements are correct formulations of the successor-state axiom for Kt ?
# Kt ⇔ Kt−1 ∧ (Ct−1 ∨ Rightt−1 )
# Kt ⇔ Kt ∨ (Ct ∧ Rightt )
Kt ⇔ Kt−1 ∨ (Ct−1 ∧ Rightt−1 )
# Kt ⇔ Kt−1 ∨ Dt−1

9
Q4. [9 pts] Inexpensive Elimination
In this problem, we will be using the Bayes Net below. Assume all random variables are binary-valued.

A B C

G D

(a) [1 pt] Consider using Variable Elimination to get P (C, G|D = d).
What is the factor generated if B is the first variable to be eliminated? Comma-separate all variables in
the resulting factor, e.g., f (A, B, C), without conditioned and unconditioned variables. Alphabetically order
variables in your answer. E.g., P (A) before P (B), and P (A|B) before P (A|C).

P
f( A, C, d, G )= b P (b|A)P (C|b)P (d|b, C)P (G|A, b, d)

(b) Suppose we want to find the optimal ordering for variable elimination such that we have the smallest sum
of factor sizes. Recall that all random variables in the graph are binary-valued (that means if there are two
factors, one over two variables and another over three variables, the sum of factor sizes is 4+8=12).
In order to pick this ordering, we consider using A* Tree Search. Our state space graph consists of states which
represent the variables we have eliminated so far, and does not take into account the order which they are
eliminated. For example, eliminating B is a transition from the start state to state B, then eliminating A will
result in state AB. Similarly, eliminating A is a transition from the start state to state A, then eliminating B
will also result in state AB. An edge represents a step in variable elimination, and has weight equal to the size
of the factor generated after eliminating the variable.
?
A AB
? 2
?
?
8 4
Start B AE ABE
2
2 4
?
E BE
8
(i) [2 pts] Yes/No: As the graph is defined, we have assumed that different elimination orderings of the same
subset of random variables will always produce the same final set of factors. Does this hold for all graphs?
Yes # No For any subset of variables, we will need to join the same factors eventually and
marginalize over all of the variables in the subset
(ii) [4 pts] For this part, we consider possible heuristics h(s), for a generic Bayes Net which has N variables
left to eliminate at state s. Each remaining variable has domain size D.
Let the set E be the costs of edges from state s (e.g. for s = A, E is the costs of edges from A to AE,
and A to AB). Which of the following would be admissible heuristics?
(We could reduce the number of heuristics.) This question takes a bit of thinking since each of the
heuristics has to be considered separately.

10
SID:

2. 2 min Econsider a situation with just 1 more variable to eliminate. This would predict double the
actual size.
3. N min E Consider a complete graph of 3 variables (triangle bayes net). First step, must generate D2
factor. Next will generate D size factor. N ∗ D2 > D2 + D
5. max (E)/N Part 3 is still a counterexample, since the table sizes increases exponentially
6. N ∗ D The minimum size of a factor eliminated is D, so N ∗ D would be a lower bound.
7. DK This is a lower bound for the given state.
8. N ∗ DK If we eliminate a variable, K could decrease, so this will not hold. (Think about case 2
explanation).
 min E  max E  N ∗D
 2 min E  max (E)/N  None are admissible

(c) [2 pts] Now let’s consider A* tree search on our Bayes Net for the query P (C, G|D = d), where d is observed
evidence.
Fill in the edge weights (a) − (d) to complete the graph.
(b)
A AB
2 2
(a)
(c)
8 4
Start B AE ABE
2
2 4
(d)

E BE
8

(a) = 4 (b) = 4

(c) = 4 (d) = 4 Solution: We create

the following graph and label edge weights according to the size of the factor generated at each edge:

4 8 2

A B E
2 4 2 4
4 8

AB AE BE

2 4 4

ABE

11
Q5. [10 pts] Sampling
Variables H, F, D, E and W denote the event of being health conscious, having free time, following a healthy diet,
exercising and having a normal body weight, respectively. If an event does occur, we denote it with a +, otherwise
−, e.g., +e denotes an exercising and −e denotes not exercising.
• A person is health conscious with probability 0.8.
• A person has free time with probability 0.4.
• If someone is health conscious, they will follow a healthy diet with probability 0.9.

• If someone is health conscious and has free time, then they will exercise with probability 0.9.
• If someone is health conscious, but does not have free time, they will exercise with probability 0.4.
• If someone is not health conscious, but they do have free time, they will exercise with probability 0.3.

• If someone is neither health conscious, nor they have free time, then they will exercise with probability 0.1.
• If someone follows both a healthy diet and exercises, they will have a normal body weight with probability 0.9.
• If someone only follows a healthy diet and does not exercise, or vice versa, they will have a normal body weight
with probability 0.5.

• If someone neither exercises nor has a healthy diet, they will have a normal body weight with probability 0.2.

(a) [2 pts] Select the minimal set of edges that needs to be added to the following Bayesian network

W D H

E F

 D→H  F →E  D→E  D→W


 H→D  D→W  F →D  W →D
 H→F  H→W  E→W  E→W

(b) Suppose we want to estimate the probability of a person being normal body weight given that they exercise
(i.e. P (+w| + e)), and we want to use likelihood weighting.
(i) [1 pt] We observe the following sample: (−w, −d, +e, +f, −h). What is our estimate of P (+w| + e) given
this one sample? Express your answer in decimal notation rounded to the second decimal point, or express
it as a fraction simplified to the lowest terms.

0.00
The given sample does not have W = +w.
(ii) [2 pts] Now, suppose that we observe another sample: (+w, +d, +e, +f, +h). What is our new estimate
for P (+w| + e)? Express your answer in decimal notation rounded to the second decimal point, or express
it as a fraction simplified to the lowest terms.

3
0.75 or 4
The likelihood weight for (−w, −d, +e, +f, −h) is 0.3, while the likelihood weight for (+w, +d, +e, +f, +h)
0.9
is 0.9. The answer is then given by 0.9+0.3 = 0.75

12
SID:

(iii) [1 pt] True/False: After 10 iterations, both rejection sampling and likelihood weighting would typically
compute equally accurate probability estimates. Each sample counts as an iteration, and rejecting a
sample counts as an iteration.
# True False
In rejection sampling, we would most likely reject some samples, but we can use all our samples for
likelihood weighting.

13
(c) Suppose we now want to use Gibbs sampling to estimate P (+w| + e)
(i) [2 pts] We start with the sample (+w, +d, +e, +h, +f ), and we want to resample the variable W. What is
the probability of sampling (−w, +d, +e, +h, +f )? Express your answer in decimal notation rounded to
the second decimal point, or express it as a fraction simplified to the lowest terms.

0.10
We would like to estimate P (−w| + d, +e, +h, +f ), which is equal to P (−w| + d, +e) from conditional
independence assumptions. Based on the given information at the start of question, we get P (−w| +
e, +d) = 0.1.
(ii) [1 pt] Suppose we observe the following sequence of samples via Gibbs sampling:

(+w, +d, +e, +h, +f ), (−w, +d, +e, +h, +f ), (−w, −d, +e, +h, +f ), (−w, −d, +e, −h, +f )

What is your estimate of P (+w| + e) given these samples?

0.25
The evidence +e is satisfied by all the samples, and the we have W = +w in only one of them, so the
answer is 14 , or 0.25.
(iii) [1 pt] While estimating P (+w| + e), the following is a possible sequence of sample that can be obtained
via Gibbs sampling:

(+w, +d, +e, +h, +f ), (−w, +d, +e, +h, +f ), (−w, −d, +e, +h, +f ), (−w, −d, +e, −h, +f ), (−w, −d, −e, −h, +f )

# True False
The evidence variable is kept fixed while using Gibbs sampling, which is not satisfied in the last element
of the sequence above.

14
SID:

Q6. [11 pts] HMM Smoothing


Consider the HMM with state variables Xt and observation variables Et . The joint distribution is given by
−1
TY T
Y
P (X1:T , E1:T = e1:T ) = P (X1 ) P (Xt+1 |Xt ) P (Et = et |Xt ) .
t=1 t=1

where X1:T means X1 , ..., XT and E1:T = e1:T means E1 = e1 , ..., ET = eT . We learned about how the forward al-
gorithm can be used to solve the filtering problem, which calculates P (Xt |E1:t = e1:t ). We will now focus on the
smoothing problem, which calculates P (Xt |E1:T = e1:T ), where 1 ≤ t < T , for obtaining a more informed estimate
of the past state Xt given all observed evidence E1:T .

Now define the following vectors of probabilities:

• α(Xt ) ≡ P (E1:t = e1:t , Xt ), the probability of seeing evidence E1 = e1 through Et = et and being in state Xt ;
• β(Xt ) ≡ P (Et+1:T = et+1:T |Xt ), the probability of seeing evidence Et+1 = et+1 through ET = eT having started
in state Xt .

(a) [2 pts] Let us consider β(XT −1 ). Which of the following are equivalent to P (ET = eT |XT −1 )? (Select one or
more.)
 xt P (XT = xt |XT −1 )P (ET = eT |XT = xt )
P

 xt P (XT = xt |XT −1 )P (ET = eT |XT = xt , XT −1 )


P

 P (XT = xt |XT −1 )P (ET = eT |XT = xt )


 P (XT = xt |XT −1 )P (ET = eT |XT = xt , XT −1 )

X
P (ET = eT |XT −1 ) = P (ET = eT , XT = xt |XT −1 )
xt
X
= P (XT = xt |XT −1 )P (ET = eT |XT = xt , XT −1 )
xt
X
= P (XT = xt |XT −1 )P (ET = eT |XT −1 )
xt

1st line: marginalization


2nd line: chain rule
3rd line: ET is independent of XT given XT −1

(b) [4 pts] In lecture we covered the forward recursion for filtering. An almost identical algorithm can be derived
for computing the sequence α(X1 ), . . . , α(XT ). For β, we need a backward recursion. What is the appropriate
expression for β(Xt ) to implement such a recursion? The expression may have up to four parts, as follows:

P (Et+1 = et+1 , ..., ET = eT |Xt ) = (i) (ii) (iii) (iv)

For each blank (i) through (iv), mark the appropriate subexpression. If it is possible to write the expression
for β(Xt ) without a particular subexpression, mark “None.”
# xt−1 # xt # None
P P P
(i) [1 pt] xt+1

(ii) [1 pt] # α(Xt−1 = xt−1 ) # α(Xt = xt ) # α(Xt+1 = xt+1 )


# β(Xt−1 = xt−1 ) β(Xt+1 = xt+1 ) # None

15
(iii) [1 pt] # P (Xt = xt |Xt−1 ) P (Xt+1 = xt+1 |Xt )
# P (Xt |Xt−1 = xt−1 ) # P (Xt+1 |Xt = xt ) # None

(iv) [1 pt] # P (Et−1 = et−1 |Xt−1 ) # P (Et = et |Xt ) # P (Et+1 = et+1 |Xt+1 )
# P (Et−1 = et−1 |Xt−1 = xt−1 ) # P (Et = et |Xt = xt ) P (Et+1 = et+1 |Xt+1 = xt+1 )
# None

P (Et+1:T = et+1:T |Xt )


X
= P (Et+1:T = et+1:T , Xt+1 = xt+1 |Xt )
xt+1
X
= P (Et+1:T = et+1:T |Xt+1 = xt+1 , Xt )P (Xt+1 = xt+1 |Xt )
xt+1
X
= P (Et+1:T = et+1:T |Xt+1 = xt+1 )P (Xt+1 = xt+1 |Xt )
xt+1
X
= P (Et+2:T = et+2:T |Xt+1 = xt+1 )P (Et+1 = et+1 |Xt+1 = xt+1 , Et+2:T = et+2:T )P (Xt+1 = xt+1 |Xt )
xt+1
X
= P (Et+2:T = et+2:T |Xt+1 = xt+1 )P (Et+1 = et+1 |Xt+1 = xt+1 )P (Xt+1 = xt+1 |Xt )
xt+1
X
= β(Xt+1 = xt+1 P (Et+1 = et+1 |Xt+1 = xt+1 )P (Xt+1 = xt+1 |Xt )
xt+1

First equals sign: marginalization


Second equals sign: chain rule
Third equals sign: Et+1:T is independent of Xt given Xt+1
Fourth equals sign: chain rule
Fifth equals sign: Et+1 is independent of Et+2:T given Xt+1
Sixth equals sign: definition of beta
Rearranging terms gives us the answer.

(c) [1 pt] If the number of values that each Xt can take on is K and the number of timesteps is T , what is the
total computational complexity of calculating β(Xt ) for all t, where 1 ≤ t ≤ T ?

# # # # #
  
O (K) O (T ) O K 2T O KT 2 O K 2T 2 None
For each t, we have to sum over every value of Xt+1 for every value of Xt .

(d) [2 pts] Which of the following expressions are equivalent to P (Xt = xt |E1:T = e1:T )? (Select one or more.)
 x0 α(Xt = x0t )β(Xt = x0t )
P
t

 α(Xt = xt )β(Xt = xt )
 P α(Xt =xt )β(X
0
t =xt )
0
x0 α(Xt =xt )β(Xt =xt )
t

 α(X
P t =xt )β(Xt =x
0
x0 α(XT =xT )
t)

We observe that P (E1:T = e1:T ) = x0 α(Xt = x0t )β(Xt = x0t ) = x0 α(XT = x0T ). We also observe that
P P
t T
P (E1:T = e1:T , Xt = xt ) = α(Xt = xt )β(Xt = xt ). We can use Bayes rule to compute

P (E1:T = e1:T , Xt = xt )
P (Xt = xt |E1:T = e1:T ) =
P (E1:T = e1:T )

16
SID:

(e) [2 pts] If the number of values that each Xt can take on is K and the number of timesteps is T , what is the
lowest total computational complexity of calculating P (Xt |E1:T = e1:T ) for all t, where 1 ≤ t ≤ T ?

# # # # #
  
O (K) O (T ) O K 2T O KT 2 O K 2T 2 None
It takes O(K 2 T ) time to calculate all of the β’s and similarly O(K 2 T ) time to calculate all of the α’s. Based
on part (d), it is sufficient to perform a single forward pass to compute all the α’s and a single backward pass
to compute all the β’s in order to calculate P (Xt |E1:T = e1:T ) for all t, where 1 ≤ t ≤ T . We cannot do better
because we need to at least compute P (E1:T = e1:T ) which requires at least O(K 2 T ) time.

17
Q7. [8 pts] Partying Particle #No Filter(ing)
Algorithm 1 Particle Filtering
1: procedure Particle Filtering(T, N ) . T : number of time steps, N : number of sampled particles
2: x ← sample N particles from initial state distribution P (X0 ) . Initialize
3: for t ← 0 to T − 1 do . Xt : hidden state, Et : observed evidence
4: xi ← sample particle from P (Xt+1 |Xt = xi ) for i = 1, . . . , N . Time Elapse Update
5: wi ← P (Et+1 |Xt+1 = xi ) for i = 1, . . . , N . Evidence Update
6: x ← resample N particles according to weights w . Particle Resampling
7: end for
8: return x
9: end procedure

Algorithm 1 outlines the particle filtering algorithm discussed in lecture. The variable x represents a list of N
particles, while w is a list of N weights for those particles.
(a) Here, we consider the unweighted particles in x as approximating a distribution.
(i) [1 pt] After executing line 4, which distribution do the particles x represent?
P (Xt+1 |E1:t ) # P (X1:t+1 |E1:t ) # P (Xt+1 |E1:t+1 ) # None

(ii) [1 pt] After executing line 6, which distribution do the particles x represent?
# P (Xt+1 |Xt , E1:t+1 ) # P (X1:t+1 |E1:t+1 ) P (Xt+1 |E1:t+1 ) # None

1. Before line 4, x estimated P (Xt | E1:t ). Line 4 is the time elapse update, sampling from the distribution
P (Xt+1 |Xt = x). After the update, x estimates P (Xt+1 | E1:t ). Note the second choice is estimating too many
states, and the third choice depends on Et+1 that has yet to be incorporated.
2. (ii) After we sum by state, normalize, and then re-sample, our new states x approximate P (Xt+1 |E1:t+1 ) which
is the distribution we want for HMM’s.

(b) The particle filtering algorithm should return a sample-based approximation to the true posterior distribution
P (XT | E1:T ). The algorithm is consistent if and only if the approximation converges to the true distribution
as N → ∞. In this question, we present several modifications to Algorithm 1. For each modification, indicate
if the algorithm is still consistent or not consistent, and if it is consistent, indicate whether you expect it
to be more accurate in general in terms of its estimate of P (XT | E1:T ) (i.e., you would expect the estimated
distribution to be closer to the true one) or less accurate. Assume unlimited computational resources and
arbitrary precision arithmetic.

(i) [2 pts] We modify line 6 to sample 1 or 2N − 1 particles with equal probability p = 0.5 for each time step
(as opposed to a fixed number of particles N ). You can assume that P (Et+1 |Xt+1 ) > 0 for all observations
and states. This algorithm is:
# Consistent and More Accurate Not Consistent
# Consistent and Less Accurate

This algorithm is not consistent because we will sample only one particle, with probability tending to one.
Every time we end up sampling 1 particle, our algorithm can no longer represent the posterior distribution
with arbitrary precision even as N → ∞.
(ii) [1 pt] Replace lines 4–6 as follows:
40 : Compute a tabular representation of P (Xt = s|E1:t ) based on the proportion of particles in state s.
50 : Use the forward algorithm to calculate P (Xt+1 |E1:t+1 ) exactly from the tabular representation.
60 : Set x to be a sample of N particles from P (Xt+1 |E1:t+1 ).
This algorithm is:

18
SID:

Consistent and More Accurate # Not consistent


# Consistent and Less Accurate

This algorithm is consistent. It is more accurate since the particle filtering algorithm is a consistent
approximator while the forward algorithm computes the exact distribution. The forward algorithm might
be computationally intractable for large HMMs (a key reason to use particle filtering), but we are instructed
to ignore resource limitations.
(iii) [1 pt] At the start of the algorithm, we initialize each entry in w to 1s. Keep line 4, but replace lines 5
and 6 with the following multiplicative update:
50 : For i = 1, . . . , N do
60 wi ← wi ∗ P (Et+1 |Xt+1 = xi ).
Finally, only at the end of the T iterations, we resample x according to the cumulative weights w just
like in line 6, producing a list of particle positions. This algorithm is:
# Consistent and More Accurate # Not Consistent
Consistent and Less Accurate

This algorithm is consistent. Indeed, it is equivalent to likelihood weighting, which is consistent. One
can also see this result by comparison to the particle filtering algorithm. The time elapse updates are
unmodified. The observation updates are weighted by P (Et |Xt ) as in particle filtering. The only difference
is that normalization and resampling happens at the end of the loop rather than as part of each observation
update. It is less accurate since in practice, the error for fixed N grows exponentially with T , but for fixed
T it still converges as N goes to ∞.

19
(c) [2 pts] Suppose that instead of particle filtering we run the following algorithm on the Bayes net with T time
steps corresponding to the HMM:
1. Fix all the evidence variables E1:T and initialize each Xt to a random value xt .
2. For i = 1, . . . , N do
• Choose a variable Xt uniformly at random from X1 , . . . , XT .
• Resample Xt according to the distribution P (Xt |Xt−1 = xt−1 , Xt+1 = xt+1 , Et = et ).
• Record the value of XT as a sample.
Finally, estimate P (XT = s|E1:T = e1:T ) by the proportion of samples with XT = s. This algorithm is:

Consistent # Not Consistent


This is just Gibbs sampling. It will converge to the true posterior distribution P (X1:T | E1:T ), even though
the sample only changes when XT is sampled.

20
SID:

Q8. [11 pts] Double Decisions and VPI


In both parts of this problem, you are given a decision network with two decision nodes: D1 and D2 . Your total
utility is the sum of two subparts: Utotal = U1 + U2 , where U1 only depends on decision D1 and U2 only depends on
decision D2 .

(a) Consider the following decision network:

F G

D1 U1 A B U2 D2

For each subpart below, select all comparison relations that are true or could be true.

(i) [1 pt] (ii) [1 pt]


 >  >
V P I({A, B})  = V P I(A) + V P I(B) V P I({A, F })  = V P I(A) + V P I(F )
 <  <

(i) Information about variable A only affects the decision D1 and utility U1 . Information about variable B only
affects the decision D2 and utility U2 . Therefore, the VPI of the two random variables is additive.
(ii) The node U1 is d-separated from A, conditioned on F . Moreover, neither A nor F have any affect on
utility U2 . This implies that V P I({A, F }) = V P I(F ). Since V P I(F ) is non-negative, V P I({A, F }) ≤
V P I(A) + V P I(F ) since V P I(A) ≥ 0.

(iii) [1 pt]
 >
V P I({B, G})  = V P I(G)
 <

(iii) The node U2 is d-separated from B, conditioned on G. Moreover, neither B nor G have any affect on
utility U1 . This implies that V P I({B, G}) = V P I(G).

(b) Now consider the following decision network:

F G

D1 U1 A B U2 D2

For each subpart below, select all comparison relations that are true or could be true.

21
(i) [2 pts] (ii) [2 pts]
 >  >
V P I({F, G})  = V P I(F ) + V P I(G) V P I({A, F })  = V P I(A) + V P I(F )
 <  <

(i) The node F is d-separated from U2 , meaning that the utility U2 is independent of F . Likewise, the utility
U1 does not depend on G. Therefore (absent any information about A or B), F only influences utility U1 and
G only influences utility U2 , so their VPI is additive.
(ii) Suppose that G ⊥ ⊥ A and G ⊥ ⊥ F |A (recall that in a Bayes net, there may be additional independence
relations other than those implied by the graph structure). In that case, random variable A does not impact
utility U2 at all, and we would have V P I({A, F }) ≤ V P I(A) + V P I(F ) just like in part (a)(ii) above. Now
suppose instead that G ⊥ ⊥ A but that F and G are not independent conditioned on A. In that case, A and F
together provide information that affects utility U2 , but not separately. We could then have V P I({A, F }) >
V P I(A) + V P I(F ).

(iii) [2 pts]
 >
V P I({B, G})  = V P I(G)
 <

(iii) V P I({B, G}) = V P I(B|G)+V P I(G). Unlike in part (a)(iii) above, V P I(B|G) can be positive because B is
no longer d-separated from utility node U1 (conditioned on G). Since V P I(B|G) ≥ 0, V P I({B, G}) ≥ V P I(G)

(c) [2 pts] Select all statements that are true. For the first two statements, consider the decision network above.

 If A, B are independent, then V P I(B) = 0


 If A, B are guaranteed to be dependent, then V P I(B) > 0
 In general, Gibbs Sampling can estimate the probabilities used when calculating VPI
 In general, Particle Filtering can estimate the probabilities used when calculating VPI

Since A, B are independent, then V P I(B) = 0. For the second part, even though there is a guaranteed dependence
relationship, the VPI may still be zero if decision don’t change even after observing evidence (not change in probability,
or one decision is always much better than the other). The third selection is true, as you can use gibbs sampling to
sample from a bayes net. On the other hand, particle filtering is for HMMs, which you cannot use.

22
SID:

Q9. [11 pts] Naively Fishing


Pacman has developed a hobby of fishing. Over the years, he has learned that a day can be considered fit or unfit
for fishing Y which results in three features: whether or not Ms. Pacman can show up M , the temperature of the
day T , and how high the water level is W . Pacman models it as the following Naive Bayes classification problem,
shown below:
Y

W T M

(a) We wish to calculate the probability a day is fit for fishing given features of the day. Consider the conditional
probability tables that Pacman has estimated over the years:
T Y P(T|Y)
M Y P(M|Y) W Y P(W|Y) cold yes 0.2
Y P(Y) yes yes 0.5 high yes 0.1 warm yes 0.2
yes 0.1 no yes 0.5 low yes 0.9 hot yes 0.5
no 0.9 yes no 0.2 high no 0.5 cold no 0.1
no no 0.8 low no 0.5 warm no 0.2
hot no 0.6
(i) [1 pt] Using the method of Naive Bayes, what are these conditional probabilities, calculated from the
conditional probability tables above? Fill in your final, decimal answer in the boxes below.

P (Y = yes|M = yes, T = cold, W = high) = 0.1

P (Y = no|M = yes, T = cold, W = high) = 0.9

(ii) [1 pt] Using the method of Naive Bayes, do we predict that the day is fit for fishing if Ms. Pacman is
available, the weather is cold, and the water level is high?

# Fit for fishing Not fit for fishing

(b) Assume for this problem we do not have estimates for the conditional probability tables, and that Pacman is
still using a Naive Bayes model. Write down an expression for each of the following queries. Express your
solution using the conditional probabilities P (M |T ), P (T |Y ), P (W |Y ), and P (Y ) from the Naive Bayes model.
(i) [1 pt] Pacman now wishes to find the probability of Ms. Pacman being available or not given that the
temperature is hot and the water level is low. Select all expressions that are equal to P (M |T, W ).
P
P (f )P (M |f )P (T |f )P (W |f ) P (M |f )P (T |f )P (W |f )
   
f
P P
P
f P (M |f ) f None of the above
f P (f )P (W |f )P (T |f ) P (T |f )P (W |f )

(ii) [2 pts] Pacman now wants to now choose the class that gives the maximum P (features|class), that is,
choosing the class that maximizes the probability of seeing the features. Write an expression that is equal
to P (M, T, W |Y ).

P (M, T, W |Y ) = P (M |Y )P (T |Y )P (W |Y )

(iii) [2 pts] Assume that Pacman is equally likely to go fishing as he is to not, i.e. P (Y = yes) = P (Y = no).
Which method would give the correct Naive Bayes classification of whether a day is a good day for fishing
if Pacman observes values for M, T , and W ?
 arg maxy P (M, T, W |Y = y)  arg maxy P (Y = y|M, T, W )  None of the above

23
(c) Assume Pacman now has the underlying Bayes Net model, and the conditional probability tables from the
previous parts do not apply. Recall that predictions are made under the Naive Bayes classification using the
conditional probability, P (Y |W, T, M ), and the Naive Bayes assumption that features are independent given
the class. We wish to explore if the Naive Bayes model is guaranteed to be able to able to represent the true
distribution.
For each of the following true distributions, select 1) if the Naive Bayes modeling assumption holds and 2) if the
Naive Bayes model is guaranteed to be able to represent the true conditional probability, P (Y |W, T, M ).
(i) [2 pts]
Does the Naive Bayes modeling Can the Naive Bayes model
assumption hold here? represent the true conditional
Y
distribution P (Y |W, T, M )?

W T
# Yes No Yes # No

The Naive Bayes assumption does not hold as W is not independent of M given the class Y . However,
based on the true distribution model, we can conclude that P (Y |W, M, T ) = P (Y |W, T ). In other words,
M is not required to model the conditional distribution. If we set P (M |Y ) = constant (same for both
values of Y ), the new Bayes Net model will also have P (Y |W, M, T ) = P (Y |W, T ).
(ii) [2 pts]
Does the Naive Bayes modeling Can the Naive Bayes model
assumption hold here? represent the true conditional
Y distribution P (Y |W, T, M )?

A B
# Yes No # Yes No

W M T

The Naive Bayes assumption does not hold as M is not independent of T given the class Y . As for why
the Naive Bayes can’t represent the true distribution... it’s annoying to show so I’m not gonna :P.

24
SID:

Q10. [13 pts] Neural Networks and Decision Trees


(a) Given the Boolean function represented by the truth table on the right, indicate which of the models can
perfectly represent this function for some choice of parameters. Where no constraints are stated on the archi-
tecture (e.g. the number of neurons, activation function), answer yes if there exists some architecture which
could represent the function.
(i) [2 pts] f (x, y) = x ⊕ y can be modelled by:
x y x⊕y
 A neural network with a single layer (no hidden layer)
0 0 0
 A neural network with two layers (a single hidden layer) 0 1 1
1 0 1
 A decision tree of depth one
1 1 0
 A decision tree of depth two
A perceptron (no hidden layer neural network) g(x, y) = α(β(x, y)) where β(x, y) = ax + by + c and α is
a (monotonic) activation function cannot correctly classify the XOR function. Note β(1, 1) = a + b + c and
β(0, 0) = c. WLOG, suppose a+b+c ≥ c (we can always flip the signs of all coefficients and reflect g to make this
true). Suppose for contradiction that (0, 0) and (1, 1) are both correctly classified, then α(a + b + c) = 0 = α(a).
Since activation function α must be monotonic, it follows that all points α(x) for x ∈ [c, a + b + c] must also be
0. Note that one of α(1, 0) = a + c or α(0, 1) = b + c must lie between [c, a + b + c]. Thus at least one of (1, 0)
or (0, 1) is misclassifed, giving a contradiction.
By contrast, a neural network with a hidden layer has sufficient capacity to represent XOR. Indeed, the
universal approximation theorem shows that a single hidden layer network can
 represent any function given
x
enough neurons. For XOR, a small network suffices: g(x, y) = sgn(B · sgn(A )) where sgn(x) is −1 when
y
 
1 −1 
x < 0, 0 when x = 0 and +1 when x > 0 and A = ,B= 1 1 .
−1 1
A decision tree of depth one can only split on a single variable. Since XOR depends on the values of both
variables, no tree of depth one can represent it.
A decision tree of depth two can represent any two-variable boolean function, including XOR.
(ii) [2 pts] f (x, y) = ¬(x ∨ y) can be modelled by:
x y ¬(x ∨ y)
 A neural network with a single layer (no hidden layer)
0 0 1
 A neural network with two layers (a single hidden layer) 0 1 0
1 0 0
 A decision tree of depth one
1 1 0
 A decision tree of depth two
A single layer neural network g(x, y) = sgn(−x − y + 1) classifies f (x, y) correctly. A two layer neural network
can also classify it (we can always make the second layer an identity).
A decision tree of depth one cannot represent f since it depends on two variables. A decision tree of depth two
can represent it (and any other two-variable boolean function).

(b) Ada is training a single layer (no hidden layer) neural network to predict a one-dimensional value from a
one-dimensional input:
f (x) = g(W x + b)
where g(y) = relu(y) = max(0, y). She initializes her weight and bias parameters to be:

W = −1, b = 0.

(i) [1 pt] The derivative of the ReLU function takes the form:
(
0 s y > u,
relu (y) =
t y < u.

25
Select the appropriate choice of s, t and u below.

s is equal to: t is equal to: u is equal to:

# s = 0. t = 0. # u = −1.
s = 1. # t = 1. u = 0.
# s = y. # t = y. # u = 1.
dy
ReLU is a piecewise lienar function. At y ≥ 0, relu(y) = y, which has derivative dy = 1. At y < 0,
d0
relu(y) = 0, which has derivative dy = 0. (Note that y = 0 has different derivatives taking the limit from
the left and right, but this was not required to answer the question.) So:
(
1 y > 0,
relu0 (y) =
0 y < 0.

(ii) [2 pts] Compute the following partial derivatives of f with respect to the weight W assuming the same
parameters W and b as above.

∂f ∂f
∂W = -1 ∂W = 0
x=−1 x=1

Note ∂f
∂W = relu0 (W x + b) · x by the chain rule. At x = −1, W x + b = 1 and so relu0 (W x + b) = 1. Thus
∂f
∂W = −1. At x = 1 then W x + b = −1 and relu0 (W x + b) = 0 so ∂f
∂W = 0.
x=−1 x=1
∂f
(iii) [2 pts] For inputs x > 0, what range of values will ∂W take on for the current values of W and b?

∂f
0 ≤ ∂W ≤ 0
Note with W = −1 and b = 0, W x + b < 0 for all x > 0. Thus relu0 (W x + b) = 0 for all x > 0. Thus
∂f
∂W = 0 for all x > 0.
(iv) [1 pt] Suppose we now use gradient descent to train W and b to minimize a squared loss. Assume the
training data consists only of inputs x > 0. For the given weight initialization, which of the following
activation functions will result in the loss decreasing over time? You may find your answer to (b)(iii)
helpful.
 Rectified Linear Unit: g(y) = relu(y).
 Hyperbolic Tangent: g(y) = tanh y.
 Sigmoid Function: g(y) = σ(y).
When using a ReLU, (b)(iii) shows that there will never be a gradient through W for data points x > 0. By
similar reasoning there is also never a gradient through b. Accordingly the parameters will never update, and
so the loss will stay constant over time. This problem is known as a “dead ReLU”, and is a common problem
in training neural networks.
Other activation functions such as tanh and σ do not suffer from this problem: although the gradient tends
towards zero for extreme values, it is never exactly zero.

(c) [1 pt] Suppose you have found a learning rate αb that achieves low loss when using batch gradient descent. A
learning rate αs for stochastic gradient descent that achieves similarly low loss would be expected to be:
# Higher than for batch gradient descent: αs > αb
Lower than for batch gradient descent: αs < αb

The stochastic gradient estimator is higher variance than the batch gradient, which takes the average gradient across
a batch of data points. Accordingly a lower learning rate should be used for stochastic than batch gradient descent.

26
SID:

(d) Your friend Alex is training a deep neural network, AlexNet, to classify images. He observes that the training
loss is low, but that loss on a held-out validation dataset is high. Validation loss can be improved by:
(i) [1 pt]
Decreasing the number of layers.
# Increasing the number of layers.
Decreasing the number of layers increases model bias but decreases variance, which has a regularizing
effect.
(ii) [1 pt]
Decreasing the size of each hidden layer.
# Increasing the size of each hidden layer.
Decreasing the number of parameters at each layer increases model bias but decreases variance, which has
a regularizing effect.

27
THIS PAGE IS INTENTIONALLY LEFT BLANK

You might also like