0% found this document useful (0 votes)
31 views14 pages

q2B Review Sol

The document describes a reinforcement learning problem involving a Markov decision process (MDP). It provides details on the state transition probabilities and rewards for a 3-state MDP with two actions ("Moves" and "Stays") in each state. It then asks several questions: 1) To show the Q-values after iterations of Q-learning on this MDP. 2) To characterize the weakness of Q-learning demonstrated by this example. 3) To explain why an adaptive dynamic programming (ADP) approach may be better than Q-learning for this problem. 4) To note one disadvantage of ADP approaches compared to Q-learning.

Uploaded by

Jamman Ayesha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

q2B Review Sol

The document describes a reinforcement learning problem involving a Markov decision process (MDP). It provides details on the state transition probabilities and rewards for a 3-state MDP with two actions ("Moves" and "Stays") in each state. It then asks several questions: 1) To show the Q-values after iterations of Q-learning on this MDP. 2) To characterize the weakness of Q-learning demonstrated by this example. 3) To explain why an adaptive dynamic programming (ADP) approach may be better than Q-learning for this problem. 4) To note one disadvantage of ADP approaches compared to Q-learning.

Uploaded by

Jamman Ayesha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

3 MDP (12 points)

1. (3 pts) In MDPs, the values of states are related by the Bellman equation:

P (s0 |s, a)U (s0 )


X
U (s) = R(s) + γ max
a
s0

where R(s) is the reward associated with being in state s. Suppose now we wish the reward to
depend on actions; i.e. R(a, s) is the reward for doing a in state s. How should the Bellman
equation be rewritten to use R(a, s) instead of R(s)?

P (s0 |s, a)U (s0 ))


X
U (s) = max(R(a, s) + γ
a
s0

2. (9 pts) Can any search problem with a finite number of states be translated into a Markov
decision problem, such that an optimal solution of the latter is also an optimal solution of
the former? If so, explain precisely how to translate the problem AND how to translate the
solution back; illustrate your answer on a 3 state search problem of your own choosing. If not
give a counterexample.
Yes, a finite search problem is a deterministic MDP:

• states in the search problem are states in MDP


• if Sj is descendant of state Si in the search, call this action ai,j , and set P (Sj |Si , ai,j ) = 1.
All the other transition probabilities are 0.
• R(ai,j , i) = −cost(Si , Sj )
• goal nodes in the search problem are terminal nodes with no possible actions.

To translate an optimal policy back to a search path, simply start at the start state and follow
the optimal action until the goal is reached.

6
4 Reinforcement Learning (13 points)
Consider an MDP with three states, called A, B and C, arranged in a loop.
0.2 0.2
0.2

A 0.8
B 0.8
C
R(C)=1

0.8

There are two actions available in each state:

• M oves : with probability 0.8, moves to the next state in the loop and with probability 0.2,
stays in the same state.
• Stays : with probability 1.0 stays in the state.

There is a reward of 1 in state C and zero reward elsewhere. The agent starts in state A.
Assume that the discount factor is 0.9, that is, γ = 0.9.

1. (6 pts) Show the values of Q(a, s) for 3 iterations of the TD Q-learning algorithm (equation
21.8 in Russell & Norvig):

Q(a, s) ← Q(a, s) + α(R(s) + γ max


0
Q(a0 , s0 ) − Q(a, s))
a

Let α = 1, note the simplification that follows from this. Assume we always pick the Move
action and end up moving to the adjacent state. That is, we see a state-action sequence: A,
Move, B, Move, C, Move, A. The Q values start out as 0.

This wording is a bit ambiguous. If we display the values after each action in one trail, we
get:
iter=0 iter=1 iter=2 iter=3
Q(Move,A) 0 0 0 0
Q(Stay,A) 0 0 0 0
Q(Move,B) 0 0 0 0
Q(Stay,B) 0 0 0 0
Q(Move,C) 0 0 0 1
Q(Stay,C) 0 0 0 0
If we think about the 3 actions as a trial and show what happens after each full trial, we get:
iter=0 iter=1 iter=2 iter=3
Q(Move,A) 0 0 0 0.81
Q(Stay,A) 0 0 0 0
Q(Move,B) 0 0 0.9 0.9
Q(Stay,B) 0 0 0 0
Q(Move,C) 0 1 1 1
Q(Stay,C) 0 0 0 0

7
2. (2 pts) Characterize the weakness of Q-learning demonstrated by this example. Hint. Imagine
that the chain were 100 states long.
The Q-learning updates are very local. In a chain of 100 states we would have to go around
100 times before we had non-zero Q values in each state.

3. (3 pts) Why might a solution based on ADP (adaptive dynamic programming) be better than
Q-learning?
ADP updates all the states to make optimal use of whatever information we have discovered.
So, for example, all states leading to a good state would have their values increased.

4. (2 pts) On the other hand, what are the disadvantages of ADP based approaches (compared
to Q-learning)?

• ADP is global. If there are lots of states then the cost of these updates will be very large.
• If we use ADP to learning the utility/value function directly (instead of the Q-function),
this will require also learning the transition function so that we can compute the action
with the best expected utility.

8
11 Variable Elimination
a. 5
b. B, C, D, E, F, A, G

12 Parameter Estimation
a. 2/3
b. 3/5

13 Decision Theory
a. ((ski (.08 win not-broken 100) (.02 win broken 50) (.72 not-win not-broken
0) (.18 not-win broken -50)) (not-ski (.2 broken -10) (.8 not-broken 0))
b. U(ski) = 8 + 1 + 0 + -9 = 0; U(not ski) = -2; so we ski!
c. Given perfect info about my leg, we have the tree ((0.2 broken ((ski (.1 win
50) (.9 not-win -50)) (not-ski -10))) (0.8 not-broken ((ski (.1 win 100) (.9
not-win 0)) (not-ski 0)))) which evaluates to ((0.2 broken ((ski -40) (not-ski
-10))) (0.8 not-broken ((ski 10) (not-ski 0))) ((0.2 broken -10) (0.8 not-broken
10)) With perfect information I have expected utility -2 + 8 = 6. So expected
value of perfect info is 6 - 0 = 6.
d. Given perfect info about winning the race, we have the tree ((0.1 win ((ski
(.2 broken 50) (.8 not-broken 100)) (not-ski (.2 broken -10) (.8 not-broken
0)))) (0.9 not-win ((ski (.2 broken -50) (.8 not-broken 0)) (not-ski (.2 broken
-10) (.8 not-broken 0))))) which evaluates to ((0.1 win ((ski 90) (not-ski -2))
(0.9 not-win ((ski -10) (not-ski -2)))) which evaluates to ((0.1 win 90) (0.9
not-win -2)) = 9 - 1.8 = 7.2. So expected value of perfect info is 7.2 - 0 =
7.2.
e. Yes. You just put the win branch after the broken branch, and use the
conditional probabilities for win given broken.

14 Markov Decision Processes


a. V(s1) = .9 * .9 * 5.5 = 4.455
b. V(s2) = 5.5
c. V(s3) = 4.5
d. V(s4) = 0
e. V(s5) = 10

4
Problem 3 Reinforcement Learning

Deterministic world

Part A

You were to fill in a table as follows:

Initial State: MILD Action: West Action: East Action: East Action: West
New State: HOT New State: MILD New State: COLD New State: MILD
East West East West East West East West East West
HOT 0 0 0 0 5 0 5 0 5 0
MILD 0 0 0 10 0 10 0 10 0 10
COLD 0 0 0 0 0 0 0 0 0 -5

Part B

You were to determin the number of policies there are. There are 8 possible policies, because there are 2
actions out of each of the 3 states.

Part C

You were to explain why is the policy π(s) = West, for all states, better than the policy π(s) = East, for all
states.

The policy π(s) = West, for all states, is better than the policy π(s) = East, for all states, because the value
of at least one state, in particular the state HOT, is higher for that policy.

Let π1 (s) = West, for all states, γ = 0.5. Then:

• V π1 (HOT) = 10 + γV π1 (HOT) = 20.

Let π2 (s) = East, for all states, γ = 0.5. Then:

• V π2 (COLD) = −10 + γV π2 (COLD) = −20,

• V π2 (MILD) = 0 + γV π2 (COLD) = −10,

• V π2 (HOT) = 0 + γV π2 (MILD) = −5.

7
Nondeterministic world

Part D

D.1

You were to compute the optimal values of each state, namely V ∗ (S1), V ∗ (S2), V ∗ (S3), V ∗ (S4) according
to a given policy.

Remember that γ = 0.9.

V*(S2) = r(S2,D) + 0.9 (1.0 V*(S2))


V*(S2) = 100 + 0.9 V*(S2)
V*(S2) = 100 (1/1 - 0.9)
V*(S2) = 1000.

V*(S1) = r(S1,D) + 0.9 (1.0 V*(S2))


V*(S1) = 0 + 0.9 x 1000
V*(S1) = 900.

V*(S3) = r(S3,D) + 0.9 (0.9 V*(S2) + 0.1 V*(S3))


V*(S3) = 0 + 0.9 (0.9 x 1000 + 0.1 V*(S3))
V*(S3) = 0.9 (900 + 0.1 V*(S3))
V*(S3) = 810 + 0.09 V*(S3)
V*(S3) = 810 (1/1 - 0.09)
V*(S3) = 81000/91.

V*(S4) = r(S4,D) + 0.9 (0.9 V*(S2) + 0.1 V*(S4))


V*(S4) = 10 + 0.9 (0.9 x 1000 + 0.1 V*(S4))
V*(S4) = 82000/91.

D.2

You were to determine the Q-value, Q(S2,R).

Q(S2,R) = r(S2,R) + 0.9 (0.9 V*(S1) + 0.1 V*(S2))


Q(S2,R) = 100 + 0.9 (0.9 x 900 + 0.1 x 1000)
Q(S2,R) = 100 + 0.9 (810 + 100)
Q(S2,R) = 100 + 0.9 x 910
Q(S2,R) = 919.

8
Part E

E.1

You were to determine the problem with a Q-learning agent that always takes the action whose estimated
Q-value is currently the highest.

The agent will probably not learn the optimal policy, as it commits to the first policy that it finds.

E.2

You were to determine the problem with a Q-learning agent that ignores its current estimates of Q in order
to explore everywhere.

The agent will take a very long time to converge to the optimal policy and it will not necessarily improve
its performance while actively learning.

Part F

You were to consider the Q-learning training rule for a nondetermistic Markov Decision Process:

Q̂n (s, a) ← (1 − αn )Q̂n−1 (s, a) + αn [r + γ max



Q̂n−1 (s , a )],
a

1
where αn = 1+visitsn (s,a) , and visitsn (s, a) is the number of times that the transition s, a was visited at
iteration n.

Then you were to answer with True (T) and False (F):

• αn decreases as the number of times the learner visits the transition s, a increases.
ANSWER: True.

• The weighted sum through αn makes the Q-values oscillate as a function of the nondeterministic
transition and therefore not converge.
ANSWER: False.

• If the world is deterministic, the Q-learning training rule given above converges to the same as the
specific one for the deterministic worlds.
ANSWER: True.

Problem 4 Odds and Ends

Circle T or F, for True or False. It will pay to guess if you are not sure. Roughly, you get no points until
you have about half right; you do not have to get them all right to full credit.

9
Parents(Speed(Camaro1)) = {EngineSize(Camaro1), Mood(John)}
Parents(Speed(Pinto2)) = {EngineSize(Pinto2), Mood(Mary)}

Parents(Mood(John)) = {BankBalance(John)}
Parents(Mood(Mary)) = {BankBalance(Mary)}

Parents(BankBalance(John)) = {Employer(John)}
Parents(BankBalence(Mary)) = {Employer(Mary)}

(c) (2 pts) Say what would have to change in your model if the mood of the car’s owner
also depended on how comfortable the seats were?

Parents(Mood(person)) = {BankBalance(person), Seats(CarOf(person))}

(d) (1 pt) Are the speeds of Camaro1 and Pinto2 independent, assuming we don’t know who
John and Mary work for?

Yes

(e) (1 pt) Are the speeds of Camaro1 and Pinto2 independent given that they both work
for Yoyodyne?

No

4. (10 points)
Consider a house-cleaning robot. It can be either in the living room or at its charging station.
The living room can be clean or dirty. So there are four states: LD (in the living room, dirty),
LC (in the living room, clean), CD (at the charger, dirty), and CC (at the charger, clean).
The robot can either choose to suck up dirt or return to its charger. Reward for being in the
charging station when the living room is clean is 0; reward for being in the charging station
when the living room is dirty is -10; reward for other states is -1. Assume also that after the
robot has gotten a -10 penalty for entering the charging station when the living room is still
dirty, it will get rewards of 0 thereafter, no matter what it does.
Assume that if the robot decides to suck up dirt while it is in the living room, then the
probability of going from a dirty to a clean floor is 0.5. The return action always takes the
robot to the charging station, leaving the dirtiness of the room unchanged. The discount
factor is 0.8.

(a) (1 pt) What is V ∗ (CC) (the value of being in the CC state)?

5
(b) (1 pt) What is V ∗ (CD) ?

-10

(c) (2 pts) Write the Bellman equation for V ∗ (LC).

V ∗ (LC) = −1 + 0.8 ∗ maxa


P
s T (LC, a, s)V (s)

(d) (2 pts) What is the value of V ∗ (LC)?

-1

(e) (2 pts) Write the Bellman equation for V ∗ (LD) and simplify it as much as possible.

V ∗ (LD) = −1 + 0.8 ∗ maxa s T (LD, a, s)V (s)


P

V ∗ (LD) = −1 + 0.8 ∗ maxa {actiongoBack , actionsuckU p }


V ∗ (LD) = −1 + 0.8 ∗ maxa {−10, 0.5 ∗ V (LC) + 0.5 ∗ V (LD)}
V ∗ (LD) = −1 + 0.8 ∗ maxa {−10, 0.5 ∗ −1 + 0.5 ∗ V (LD)}
V ∗ (LD) = −1 + 0.8 ∗ maxa {−10, −0.5 + 0.5 ∗ V (LD)}

(f) (2 pts) If V0 (LD) = 0 (that is, the initial value assigned to this state is 0), what is
V1 (LD), the value of LD with one step to go (computed via one iteration of value
iteration)?

V0 (LD) = 0
V1 (LD) = −1 + 0.8 ∗ maxa {−10, −0.5 + 0.5 ∗ V0 (LD)}
V1 (LD) = −1 + 0.8 ∗ maxa {−10, −0.5}
V1 (LD) = −1 − 0.8 ∗ 0.5
V1 (LD) = −1 − 0.4
V1 (LD) = −1.4

6
3. (8 points) True or False?

(a) The resolution/refutation proof strategy for propositional logic runs in polynomial
time in the size of the input.
Solution:
False. The result of converting to clausal form alone can be exponentially larger than the input.

(b) GraphPlan runs in time polynomial in plan depth.


Solution:
False. While creating the plan graph is polynomial, searching for a plan is still non-polynomial.

(c) Iterative deepening requires less space than breadth-first search.


Solution:
True. Iterative deepening uses only as much space as depth-first search.

(d) In first-order logic, if a formula is entailed by a theory, it can always be proven using
the resolution/refutation proof strategy.
Solution:
True, resolution-refutation is a complete proof procedure for first-order logic.

4. (30 points) A robot has to deliver identical packages to locations A, B, and C, in an office en-
vironment. Assume it starts off holding all three packages. The environment is represented
as a grid of squares, some of which are free (so the robot can move into them) and some of
which are occupied (by walls, doors, etc.). The robot can move into neighboring squares,
and can pick up and drop packages if they are in the same square as the robot.

(a) (4 points) Formulate this problem as a search problem, specifying the state space, action
space, goal test, and cost function.
Solution:
The state space needs to include enough information so that, by looking at the current values of
the state features, the robot knows what it needs to do. For this task, the robot needs to be able to
determine its position in the grid and which packages it has already delivered.
• State: { (x,y), deliveredA, deliveredB, deliveredC }
• Action space: MoveN, MoveE, MoveS, MoveW, DropA, DropB, DropC.
• Goal test: { (x,y), delA, delB, delC } = { (any x, any y), 1, 1, 1 }
• Cost function: cost of 1 for each action taken.

3
Common Mistakes:
• Including only the number of packages in the state space: this is insufficient be-
cause, if we know we are holding 2 packages, we don’t know if, for example, we’ve
already been to A, or if we should try to go there next.
• Specifying a goal test that couldn’t be derived from the state space that was de-
scribed.

(b) (3 points) Give a non-trivial, polynomial-time, admissible heuristic for this domain, or
argue that there isn’t one.
Solution:
There were many acceptable solutions. Some examples:
• Straight line or manhattan distance to furthest unvisited drop-off location.
• Straight line or manhattan distance to nearest unvisited drop-off location. (slightly
looser underestimate than the previous one)
• Number of packages left to deliver. (Not as “good” as the above two, but still
acceptable.)
Common Mistakes:
• Trying to solve the Travelling Salesman Problem: In the general case, not poly in
the number of drop-off locations.
• Summing the distances between current robot location and all unvisited drop-off
locations: Can overestimate the distance.
• Summing up the perimeter of the convex hull of robot location and drop-off loca-
tions: Can overestimate the distance.

(c) (3 points) Package A has to be delivered to the boss, and it’s important that it be done
quickly. But we should deliver the other packages promptly as well. How could we
encode this in the problem?
Solution:
The idea of putting “pressure” on the delivering to the boss is correctly handled in the cost
function. We want to emphasize delivery to A, but we don’t want to ignore B if we happen to
pass by it. Please note that to encode something into the “problem,” it must be encoded into
one of the problem components: the state space, action space, goal test, or cost function. For
example:
while we haven’t delivered to A, actions cost 2; after that, actions cost 1.
Common Mistakes:
• Decreasing the cost instead of increasing the cost.
• Constraining the order of the problem; requiring delivery to A before considering
the other two locations at all.
• Modifying the search heuristic instead of the cost function.

4
(d) (3 points) Now, consider the case where the robot doesn’t start with the packages, but
it has to pick up a package from location 1 to deliver to location A, a package from
location 2 to deliver to B, and from 3 to deliver to C. What is an appropriate state space
for this problem?
Solution:
Again, multiple solutions were possible. One example:
• State: { (x,y), deliveredA, deliveredB deliveredC, holdingA, holdingB, holdingC }
Common Mistakes:
• Again, just the number of packages and the number of visited locations are insuffi-
cient.

(e) (3 points) What is a good admissible heuristic?


Solution:
• The same heuristic as before. (Not so “good,” but acceptable.)
• The maximum (distance from my location to unvisited pickup location si plus the
distance from si to unvisited delivery location).

(f) (4 points) This problem can also be treated as a planning problem.


Describe the effects of the action of moving north using situation calculus axioms. As-
sume that we are representing the location of the robot using integer coordinates for its
x, y location in the grid, and that we can name the next integer coordinate greater than
x as x + 1. Additionally, assume that we have a predicate occupied that applies to x, y
pairs, indicating whether the robot may pass through the indicated locations.
Solution:

∀x, y, s.atrobot(x, y, s) ∧ ¬occupied (x, y + 1) → atrobot(x, y + 1, result(moveNorth, s))

A little variability in this answer was acceptable, but the idea of result as a function from
actions and situations to actions was crucial.

(g) (3 points) Describe what frame axioms would be necessary for this domain (but don’t
write them down).
Solution:
• When the robot moves, the positions of the (non-held) packages and the delivery
locations don’t change; which packages are held doesn’t change.
• When the robot picks up or drops off a package, the robot’s location doesn’t change;
the locations of packages and delivery locations don’t change.

5
Some people mentioned that which squares are occupied doesn’t change. Because these
facts never change (unless we have moving obstacles), it would probably be best to
model them without a situation argument, thereby obviating the need for frame axioms
for occupied.

(h) (3 points) Either provide a description of the move operation in the STRIPS language,
or say why it is difficult to do so.
Solution:
There were a couple of different acceptable answers here. The most common one was
some variation on :
move(x,y):
Pre: at(x), ¬ occupied(y), neighbor(x,y)
Effect: ¬ at(x), at(y)
This requires some mention of the fact that it might be a big pain to specify the neighbor
relation.
Another answer was
moveNorth(x,y):
Pre: at(x,y), ¬ occupied(x,y+1)
Effect: ¬ at(x,y), at(x,y+1)
This requires some mention, at least, of the fact that you’d need 4 operators. I gave
full credit for this, but, in general, it’s not legal to use functions in STRIPS, so you can’t
write “y+1” in the effects.
One more answer, saying that it was too hard, because you’d need to have a specific
operator for each possible location (due to the fact that you can’t have functions and/or
that the “neighbor” relation would be hard to specify) was also okay.

(i) (2 points) When does it make sense to treat a problem as a planning problem rather
than as a basic state-space search problem?
Solution:
There are a lot of decent answers here. I was looking for at least a couple of the follow-
ing points.
• The initial state is not known exactly (so you have to search over the space of sets
of states).
• There is a direction connection between actions and effects that can be heuristically
exploited.
• The goal is described conjunctively, so solving the subgoals individually might lead
to a plan more efficiently.
• The domain is described using a factored, logical representation (this really con-
tributes to all of the above points).

6
• There is not a big concern with minimizing cost (because planning methods usually
seek any path to the goal).
Saying “when the state space is too big” was not sufficient.

(j) (2 points) Should this problem be treated as a planning problem? Explain why or why
not.
Solution:
I accepted a lot of answers here, mostly dependent on how cogently they were argued
in the context of the answer to the previous problem.
I actually think that treating this as a planning problem won’t be much help because:
the initial state is known exactly, the subgoals aren’t particularly separable (you can’t
solve the problem of going to location A without knowing where you’re going to be
when you start), and we have a strong desire for minimizing path cost.

You might also like