0% found this document useful (0 votes)

31 views14 pages

q2B Review Sol

The document describes a reinforcement learning problem involving a Markov decision process (MDP). It provides details on the state transition probabilities and rewards for a 3-state MDP with two actions ("Moves" and "Stays") in each state. It then asks several questions: 1) To show the Q-values after iterations of Q-learning on this MDP. 2) To characterize the weakness of Q-learning demonstrated by this example. 3) To explain why an adaptive dynamic programming (ADP) approach may be better than Q-learning for this problem. 4) To note one disadvantage of ADP approaches compared to Q-learning.

Uploaded by

Jamman Ayesha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views14 pages

q2B Review Sol

Uploaded by

Jamman Ayesha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

3 MDP (12 points)

1. (3 pts) In MDPs, the values of states are related by the Bellman equation:

P (s0 |s, a)U (s0 )

X
U (s) = R(s) + γ max
a
s0

where R(s) is the reward associated with being in state s. Suppose now we wish the reward to
depend on actions; i.e. R(a, s) is the reward for doing a in state s. How should the Bellman
equation be rewritten to use R(a, s) instead of R(s)?

P (s0 |s, a)U (s0 ))

X
U (s) = max(R(a, s) + γ
a
s0

2. (9 pts) Can any search problem with a finite number of states be translated into a Markov
decision problem, such that an optimal solution of the latter is also an optimal solution of
the former? If so, explain precisely how to translate the problem AND how to translate the
solution back; illustrate your answer on a 3 state search problem of your own choosing. If not
give a counterexample.
Yes, a finite search problem is a deterministic MDP:

• states in the search problem are states in MDP

• if Sj is descendant of state Si in the search, call this action ai,j , and set P (Sj |Si , ai,j ) = 1.
All the other transition probabilities are 0.
• R(ai,j , i) = −cost(Si , Sj )
• goal nodes in the search problem are terminal nodes with no possible actions.

To translate an optimal policy back to a search path, simply start at the start state and follow
the optimal action until the goal is reached.

6
4 Reinforcement Learning (13 points)
Consider an MDP with three states, called A, B and C, arranged in a loop.
0.2 0.2
0.2

A 0.8
B 0.8
C
R(C)=1

0.8

There are two actions available in each state:

• M oves : with probability 0.8, moves to the next state in the loop and with probability 0.2,
stays in the same state.
• Stays : with probability 1.0 stays in the state.

There is a reward of 1 in state C and zero reward elsewhere. The agent starts in state A.
Assume that the discount factor is 0.9, that is, γ = 0.9.

1. (6 pts) Show the values of Q(a, s) for 3 iterations of the TD Q-learning algorithm (equation
21.8 in Russell & Norvig):

Q(a, s) ← Q(a, s) + α(R(s) + γ max

0
Q(a0 , s0 ) − Q(a, s))
a

Let α = 1, note the simplification that follows from this. Assume we always pick the Move
action and end up moving to the adjacent state. That is, we see a state-action sequence: A,
Move, B, Move, C, Move, A. The Q values start out as 0.

This wording is a bit ambiguous. If we display the values after each action in one trail, we
get:
iter=0 iter=1 iter=2 iter=3
Q(Move,A) 0 0 0 0
Q(Stay,A) 0 0 0 0
Q(Move,B) 0 0 0 0
Q(Stay,B) 0 0 0 0
Q(Move,C) 0 0 0 1
Q(Stay,C) 0 0 0 0
If we think about the 3 actions as a trial and show what happens after each full trial, we get:
iter=0 iter=1 iter=2 iter=3
Q(Move,A) 0 0 0 0.81
Q(Stay,A) 0 0 0 0
Q(Move,B) 0 0 0.9 0.9
Q(Stay,B) 0 0 0 0
Q(Move,C) 0 1 1 1
Q(Stay,C) 0 0 0 0

7
2. (2 pts) Characterize the weakness of Q-learning demonstrated by this example. Hint. Imagine
that the chain were 100 states long.
The Q-learning updates are very local. In a chain of 100 states we would have to go around
100 times before we had non-zero Q values in each state.

3. (3 pts) Why might a solution based on ADP (adaptive dynamic programming) be better than
Q-learning?
ADP updates all the states to make optimal use of whatever information we have discovered.
So, for example, all states leading to a good state would have their values increased.

4. (2 pts) On the other hand, what are the disadvantages of ADP based approaches (compared
to Q-learning)?

• ADP is global. If there are lots of states then the cost of these updates will be very large.
• If we use ADP to learning the utility/value function directly (instead of the Q-function),
this will require also learning the transition function so that we can compute the action
with the best expected utility.

8
11 Variable Elimination
a. 5
b. B, C, D, E, F, A, G

12 Parameter Estimation
a. 2/3
b. 3/5

13 Decision Theory
a. ((ski (.08 win not-broken 100) (.02 win broken 50) (.72 not-win not-broken
0) (.18 not-win broken -50)) (not-ski (.2 broken -10) (.8 not-broken 0))
b. U(ski) = 8 + 1 + 0 + -9 = 0; U(not ski) = -2; so we ski!
c. Given perfect info about my leg, we have the tree ((0.2 broken ((ski (.1 win
50) (.9 not-win -50)) (not-ski -10))) (0.8 not-broken ((ski (.1 win 100) (.9
not-win 0)) (not-ski 0)))) which evaluates to ((0.2 broken ((ski -40) (not-ski
-10))) (0.8 not-broken ((ski 10) (not-ski 0))) ((0.2 broken -10) (0.8 not-broken
10)) With perfect information I have expected utility -2 + 8 = 6. So expected
value of perfect info is 6 - 0 = 6.
d. Given perfect info about winning the race, we have the tree ((0.1 win ((ski
(.2 broken 50) (.8 not-broken 100)) (not-ski (.2 broken -10) (.8 not-broken
0)))) (0.9 not-win ((ski (.2 broken -50) (.8 not-broken 0)) (not-ski (.2 broken
-10) (.8 not-broken 0))))) which evaluates to ((0.1 win ((ski 90) (not-ski -2))
(0.9 not-win ((ski -10) (not-ski -2)))) which evaluates to ((0.1 win 90) (0.9
not-win -2)) = 9 - 1.8 = 7.2. So expected value of perfect info is 7.2 - 0 =
7.2.
e. Yes. You just put the win branch after the broken branch, and use the
conditional probabilities for win given broken.

14 Markov Decision Processes

a. V(s1) = .9 * .9 * 5.5 = 4.455
b. V(s2) = 5.5
c. V(s3) = 4.5
d. V(s4) = 0
e. V(s5) = 10

4
Problem 3 Reinforcement Learning

Deterministic world

Part A

You were to ﬁll in a table as follows:

Initial State: MILD Action: West Action: East Action: East Action: West
New State: HOT New State: MILD New State: COLD New State: MILD
East West East West East West East West East West
HOT 0 0 0 0 5 0 5 0 5 0
MILD 0 0 0 10 0 10 0 10 0 10
COLD 0 0 0 0 0 0 0 0 0 -5

Part B

You were to determin the number of policies there are. There are 8 possible policies, because there are 2
actions out of each of the 3 states.

Part C

You were to explain why is the policy π(s) = West, for all states, better than the policy π(s) = East, for all
states.

The policy π(s) = West, for all states, is better than the policy π(s) = East, for all states, because the value
of at least one state, in particular the state HOT, is higher for that policy.

Let π1 (s) = West, for all states, γ = 0.5. Then:

• V π1 (HOT) = 10 + γV π1 (HOT) = 20.

Let π2 (s) = East, for all states, γ = 0.5. Then:

• V π2 (COLD) = −10 + γV π2 (COLD) = −20,

• V π2 (MILD) = 0 + γV π2 (COLD) = −10,

• V π2 (HOT) = 0 + γV π2 (MILD) = −5.

7
Nondeterministic world

Part D

D.1

You were to compute the optimal values of each state, namely V ∗ (S1), V ∗ (S2), V ∗ (S3), V ∗ (S4) according
to a given policy.

Remember that γ = 0.9.

V(S2) = r(S2,D) + 0.9 (1.0 V(S2))

V*(S2) = 100 + 0.9 V*(S2)
V*(S2) = 100 (1/1 - 0.9)
V*(S2) = 1000.

V(S1) = r(S1,D) + 0.9 (1.0 V(S2))

V*(S1) = 0 + 0.9 x 1000
V*(S1) = 900.

V(S3) = r(S3,D) + 0.9 (0.9 V(S2) + 0.1 V*(S3))

V*(S3) = 0 + 0.9 (0.9 x 1000 + 0.1 V*(S3))
V*(S3) = 0.9 (900 + 0.1 V*(S3))
V*(S3) = 810 + 0.09 V*(S3)
V*(S3) = 810 (1/1 - 0.09)
V*(S3) = 81000/91.

V(S4) = r(S4,D) + 0.9 (0.9 V(S2) + 0.1 V*(S4))

V*(S4) = 10 + 0.9 (0.9 x 1000 + 0.1 V*(S4))
V*(S4) = 82000/91.

D.2

You were to determine the Q-value, Q(S2,R).

Q(S2,R) = r(S2,R) + 0.9 (0.9 V(S1) + 0.1 V(S2))

Q(S2,R) = 100 + 0.9 (0.9 x 900 + 0.1 x 1000)
Q(S2,R) = 100 + 0.9 (810 + 100)
Q(S2,R) = 100 + 0.9 x 910
Q(S2,R) = 919.

8
Part E

E.1

You were to determine the problem with a Q-learning agent that always takes the action whose estimated
Q-value is currently the highest.

The agent will probably not learn the optimal policy, as it commits to the ﬁrst policy that it ﬁnds.

E.2

You were to determine the problem with a Q-learning agent that ignores its current estimates of Q in order
to explore everywhere.

The agent will take a very long time to converge to the optimal policy and it will not necessarily improve
its performance while actively learning.

Part F

You were to consider the Q-learning training rule for a nondetermistic Markov Decision Process:

Q̂n (s, a) ← (1 − αn )Q̂n−1 (s, a) + αn [r + γ max

Q̂n−1 (s , a )],
a

1
where αn = 1+visitsn (s,a) , and visitsn (s, a) is the number of times that the transition s, a was visited at
iteration n.

Then you were to answer with True (T) and False (F):

• αn decreases as the number of times the learner visits the transition s, a increases.
ANSWER: True.

• The weighted sum through αn makes the Q-values oscillate as a function of the nondeterministic
transition and therefore not converge.
ANSWER: False.

• If the world is deterministic, the Q-learning training rule given above converges to the same as the
speciﬁc one for the deterministic worlds.
ANSWER: True.

Problem 4 Odds and Ends

Circle T or F, for True or False. It will pay to guess if you are not sure. Roughly, you get no points until
you have about half right; you do not have to get them all right to full credit.

9
Parents(Speed(Camaro1)) = {EngineSize(Camaro1), Mood(John)}
Parents(Speed(Pinto2)) = {EngineSize(Pinto2), Mood(Mary)}

Parents(Mood(John)) = {BankBalance(John)}
Parents(Mood(Mary)) = {BankBalance(Mary)}

Parents(BankBalance(John)) = {Employer(John)}
Parents(BankBalence(Mary)) = {Employer(Mary)}

(c) (2 pts) Say what would have to change in your model if the mood of the car’s owner
also depended on how comfortable the seats were?

Parents(Mood(person)) = {BankBalance(person), Seats(CarOf(person))}

(d) (1 pt) Are the speeds of Camaro1 and Pinto2 independent, assuming we don’t know who
John and Mary work for?

Yes

(e) (1 pt) Are the speeds of Camaro1 and Pinto2 independent given that they both work
for Yoyodyne?

4. (10 points)
Consider a house-cleaning robot. It can be either in the living room or at its charging station.
The living room can be clean or dirty. So there are four states: LD (in the living room, dirty),
LC (in the living room, clean), CD (at the charger, dirty), and CC (at the charger, clean).
The robot can either choose to suck up dirt or return to its charger. Reward for being in the
charging station when the living room is clean is 0; reward for being in the charging station
when the living room is dirty is -10; reward for other states is -1. Assume also that after the
robot has gotten a -10 penalty for entering the charging station when the living room is still
dirty, it will get rewards of 0 thereafter, no matter what it does.
Assume that if the robot decides to suck up dirt while it is in the living room, then the
probability of going from a dirty to a clean floor is 0.5. The return action always takes the
robot to the charging station, leaving the dirtiness of the room unchanged. The discount
factor is 0.8.

(a) (1 pt) What is V ∗ (CC) (the value of being in the CC state)?

5
(b) (1 pt) What is V ∗ (CD) ?

-10

(c) (2 pts) Write the Bellman equation for V ∗ (LC).

V ∗ (LC) = −1 + 0.8 ∗ maxa

P
s T (LC, a, s)V (s)

(d) (2 pts) What is the value of V ∗ (LC)?

-1

(e) (2 pts) Write the Bellman equation for V ∗ (LD) and simplify it as much as possible.

V ∗ (LD) = −1 + 0.8 ∗ maxa s T (LD, a, s)V (s)

V ∗ (LD) = −1 + 0.8 ∗ maxa {actiongoBack , actionsuckU p }

V ∗ (LD) = −1 + 0.8 ∗ maxa {−10, 0.5 ∗ V (LC) + 0.5 ∗ V (LD)}
V ∗ (LD) = −1 + 0.8 ∗ maxa {−10, 0.5 ∗ −1 + 0.5 ∗ V (LD)}
V ∗ (LD) = −1 + 0.8 ∗ maxa {−10, −0.5 + 0.5 ∗ V (LD)}

(f) (2 pts) If V0 (LD) = 0 (that is, the initial value assigned to this state is 0), what is
V1 (LD), the value of LD with one step to go (computed via one iteration of value
iteration)?

V0 (LD) = 0
V1 (LD) = −1 + 0.8 ∗ maxa {−10, −0.5 + 0.5 ∗ V0 (LD)}
V1 (LD) = −1 + 0.8 ∗ maxa {−10, −0.5}
V1 (LD) = −1 − 0.8 ∗ 0.5
V1 (LD) = −1 − 0.4
V1 (LD) = −1.4

6
3. (8 points) True or False?

(a) The resolution/refutation proof strategy for propositional logic runs in polynomial
time in the size of the input.
Solution:
False. The result of converting to clausal form alone can be exponentially larger than the input.

(b) GraphPlan runs in time polynomial in plan depth.

Solution:
False. While creating the plan graph is polynomial, searching for a plan is still non-polynomial.

(c) Iterative deepening requires less space than breadth-first search.

Solution:
True. Iterative deepening uses only as much space as depth-first search.

(d) In first-order logic, if a formula is entailed by a theory, it can always be proven using
the resolution/refutation proof strategy.
Solution:
True, resolution-refutation is a complete proof procedure for first-order logic.

4. (30 points) A robot has to deliver identical packages to locations A, B, and C, in an office en-
vironment. Assume it starts off holding all three packages. The environment is represented
as a grid of squares, some of which are free (so the robot can move into them) and some of
which are occupied (by walls, doors, etc.). The robot can move into neighboring squares,
and can pick up and drop packages if they are in the same square as the robot.

(a) (4 points) Formulate this problem as a search problem, specifying the state space, action
space, goal test, and cost function.
Solution:
The state space needs to include enough information so that, by looking at the current values of
the state features, the robot knows what it needs to do. For this task, the robot needs to be able to
determine its position in the grid and which packages it has already delivered.
• State: { (x,y), deliveredA, deliveredB, deliveredC }
• Action space: MoveN, MoveE, MoveS, MoveW, DropA, DropB, DropC.
• Goal test: { (x,y), delA, delB, delC } = { (any x, any y), 1, 1, 1 }
• Cost function: cost of 1 for each action taken.

3
Common Mistakes:
• Including only the number of packages in the state space: this is insufficient be-
cause, if we know we are holding 2 packages, we don’t know if, for example, we’ve
already been to A, or if we should try to go there next.
• Specifying a goal test that couldn’t be derived from the state space that was de-
scribed.

(b) (3 points) Give a non-trivial, polynomial-time, admissible heuristic for this domain, or
argue that there isn’t one.
Solution:
There were many acceptable solutions. Some examples:
• Straight line or manhattan distance to furthest unvisited drop-off location.
• Straight line or manhattan distance to nearest unvisited drop-off location. (slightly
looser underestimate than the previous one)
• Number of packages left to deliver. (Not as “good” as the above two, but still
acceptable.)
Common Mistakes:
• Trying to solve the Travelling Salesman Problem: In the general case, not poly in
the number of drop-off locations.
• Summing the distances between current robot location and all unvisited drop-off
locations: Can overestimate the distance.
• Summing up the perimeter of the convex hull of robot location and drop-off loca-
tions: Can overestimate the distance.

(c) (3 points) Package A has to be delivered to the boss, and it’s important that it be done
quickly. But we should deliver the other packages promptly as well. How could we
encode this in the problem?
Solution:
The idea of putting “pressure” on the delivering to the boss is correctly handled in the cost
function. We want to emphasize delivery to A, but we don’t want to ignore B if we happen to
pass by it. Please note that to encode something into the “problem,” it must be encoded into
one of the problem components: the state space, action space, goal test, or cost function. For
example:
while we haven’t delivered to A, actions cost 2; after that, actions cost 1.
Common Mistakes:
• Decreasing the cost instead of increasing the cost.
• Constraining the order of the problem; requiring delivery to A before considering
the other two locations at all.
• Modifying the search heuristic instead of the cost function.

4
(d) (3 points) Now, consider the case where the robot doesn’t start with the packages, but
it has to pick up a package from location 1 to deliver to location A, a package from
location 2 to deliver to B, and from 3 to deliver to C. What is an appropriate state space
for this problem?
Solution:
Again, multiple solutions were possible. One example:
• State: { (x,y), deliveredA, deliveredB deliveredC, holdingA, holdingB, holdingC }
Common Mistakes:
• Again, just the number of packages and the number of visited locations are insuffi-
cient.

(e) (3 points) What is a good admissible heuristic?

Solution:
• The same heuristic as before. (Not so “good,” but acceptable.)
• The maximum (distance from my location to unvisited pickup location si plus the
distance from si to unvisited delivery location).

(f) (4 points) This problem can also be treated as a planning problem.

Describe the effects of the action of moving north using situation calculus axioms. As-
sume that we are representing the location of the robot using integer coordinates for its
x, y location in the grid, and that we can name the next integer coordinate greater than
x as x + 1. Additionally, assume that we have a predicate occupied that applies to x, y
pairs, indicating whether the robot may pass through the indicated locations.
Solution:

∀x, y, s.atrobot(x, y, s) ∧ ¬occupied (x, y + 1) → atrobot(x, y + 1, result(moveNorth, s))

A little variability in this answer was acceptable, but the idea of result as a function from
actions and situations to actions was crucial.

(g) (3 points) Describe what frame axioms would be necessary for this domain (but don’t
write them down).
Solution:
• When the robot moves, the positions of the (non-held) packages and the delivery
locations don’t change; which packages are held doesn’t change.
• When the robot picks up or drops off a package, the robot’s location doesn’t change;
the locations of packages and delivery locations don’t change.

5
Some people mentioned that which squares are occupied doesn’t change. Because these
facts never change (unless we have moving obstacles), it would probably be best to
model them without a situation argument, thereby obviating the need for frame axioms
for occupied.

(h) (3 points) Either provide a description of the move operation in the STRIPS language,
or say why it is difficult to do so.
Solution:
There were a couple of different acceptable answers here. The most common one was
some variation on :
move(x,y):
Pre: at(x), ¬ occupied(y), neighbor(x,y)
Effect: ¬ at(x), at(y)
This requires some mention of the fact that it might be a big pain to specify the neighbor
relation.
Another answer was
moveNorth(x,y):
Pre: at(x,y), ¬ occupied(x,y+1)
Effect: ¬ at(x,y), at(x,y+1)
This requires some mention, at least, of the fact that you’d need 4 operators. I gave
full credit for this, but, in general, it’s not legal to use functions in STRIPS, so you can’t
write “y+1” in the effects.
One more answer, saying that it was too hard, because you’d need to have a specific
operator for each possible location (due to the fact that you can’t have functions and/or
that the “neighbor” relation would be hard to specify) was also okay.

(i) (2 points) When does it make sense to treat a problem as a planning problem rather
than as a basic state-space search problem?
Solution:
There are a lot of decent answers here. I was looking for at least a couple of the follow-
ing points.
• The initial state is not known exactly (so you have to search over the space of sets
of states).
• There is a direction connection between actions and effects that can be heuristically
exploited.
• The goal is described conjunctively, so solving the subgoals individually might lead
to a plan more efficiently.
• The domain is described using a factored, logical representation (this really con-
tributes to all of the above points).

6
• There is not a big concern with minimizing cost (because planning methods usually
seek any path to the goal).
Saying “when the state space is too big” was not sufficient.

(j) (2 points) Should this problem be treated as a planning problem? Explain why or why
not.
Solution:
I accepted a lot of answers here, mostly dependent on how cogently they were argued
in the context of the answer to the previous problem.
I actually think that treating this as a planning problem won’t be much help because:
the initial state is known exactly, the subgoals aren’t particularly separable (you can’t
solve the problem of going to location A without knowing where you’re going to be
when you start), and we have a strong desire for minimizing path cost.

Environment Handwritten Notes
0% (2)
Environment Handwritten Notes
23 pages
Qustion Bank With Solution
No ratings yet
Qustion Bank With Solution
147 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Complete Bundle Semiconductor Physics and Devices Basic Principles 4th Edition Neamen
No ratings yet
Complete Bundle Semiconductor Physics and Devices Basic Principles 4th Edition Neamen
401 pages
Quarter 1 - MELC 1: Applied Physics Activity Sheet
No ratings yet
Quarter 1 - MELC 1: Applied Physics Activity Sheet
8 pages
6 General Science SEM-1 Textbook
No ratings yet
6 General Science SEM-1 Textbook
162 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Lesson Plan 1
No ratings yet
Lesson Plan 1
10 pages
6 English Textbook
No ratings yet
6 English Textbook
118 pages
Adobe Scan Nov 18, 2024
No ratings yet
Adobe Scan Nov 18, 2024
13 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
Launchpad BOPTFRM 1001713990
No ratings yet
Launchpad BOPTFRM 1001713990
2 pages
Bar Council of India
100% (3)
Bar Council of India
17 pages
GS3 Environment Notes - Anudeep AIR 1
No ratings yet
GS3 Environment Notes - Anudeep AIR 1
34 pages
Assignment
No ratings yet
Assignment
2 pages
Tut RL-1
No ratings yet
Tut RL-1
2 pages
REPORT Period of Activism 1970 To 1972
No ratings yet
REPORT Period of Activism 1970 To 1972
10 pages
Analytic Geometry
No ratings yet
Analytic Geometry
41 pages
Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Reinforcement Learning Prof. B. Ravindran
4 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Reading Test - Hanoi University
100% (1)
Reading Test - Hanoi University
4 pages
Easwari Engineering College: (Autonomous Institution)
No ratings yet
Easwari Engineering College: (Autonomous Institution)
84 pages
Wa 1
No ratings yet
Wa 1
9 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Resume For Job Training
100% (2)
Resume For Job Training
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
EEE4114F 2022 ML Tutorial Solution 2 of 2
No ratings yet
EEE4114F 2022 ML Tutorial Solution 2 of 2
4 pages
6b Soln
No ratings yet
6b Soln
3 pages
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
No ratings yet
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
15 pages
Counselling Psychology in Medical Settings: The Promising Role of Counselling Health Psychology
No ratings yet
Counselling Psychology in Medical Settings: The Promising Role of Counselling Health Psychology
20 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Homework #3: MDPS, Q-Learning, &: Pomdps
No ratings yet
Homework #3: MDPS, Q-Learning, &: Pomdps
18 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
7 Maths A Notes
No ratings yet
7 Maths A Notes
47 pages
AI Unit3 Part 1
No ratings yet
AI Unit3 Part 1
5 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Modern Child Public School
No ratings yet
Modern Child Public School
2 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Solutions To Reinforcement Learning by Sutton Chapter 3 rx1
No ratings yet
Solutions To Reinforcement Learning by Sutton Chapter 3 rx1
10 pages
Four Corners Activity
No ratings yet
Four Corners Activity
2 pages
Lisa Hoang - CV
No ratings yet
Lisa Hoang - CV
6 pages
Bidisha
No ratings yet
Bidisha
2 pages
Life and Works of Rizal Reviewer
No ratings yet
Life and Works of Rizal Reviewer
6 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Rec 3 Solutions
No ratings yet
Rec 3 Solutions
3 pages
Rey Et Al. (2022) - Federated Learning For Malware Detection in IoT Devices
No ratings yet
Rey Et Al. (2022) - Federated Learning For Malware Detection in IoT Devices
14 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Andhra Pradesh State Council of Higher Education Apicet - 2023 Admissions
No ratings yet
Andhra Pradesh State Council of Higher Education Apicet - 2023 Admissions
1 page
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
Internship Presentation Tybcom New Compress 1
No ratings yet
Internship Presentation Tybcom New Compress 1
15 pages
37 RL
No ratings yet
37 RL
18 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
Interview Questions
No ratings yet
Interview Questions
5 pages
Group Work Stages - Guide 2 Social Work
No ratings yet
Group Work Stages - Guide 2 Social Work
4 pages
Quiz2 Sol
No ratings yet
Quiz2 Sol
4 pages
Grade XB
No ratings yet
Grade XB
3 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Solution 3
No ratings yet
Solution 3
4 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models
No ratings yet
Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models
14 pages
Colonial Colleges - Wikipedia
No ratings yet
Colonial Colleges - Wikipedia
55 pages
HW 2
No ratings yet
HW 2
2 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
BSTM-IT Fillable Curriculum
No ratings yet
BSTM-IT Fillable Curriculum
2 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Business Economics II Macroeconomics Revised Edition Debes Mukherjee Download
No ratings yet
Business Economics II Macroeconomics Revised Edition Debes Mukherjee Download
52 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Psychology 3
No ratings yet
Psychology 3
4 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
RL Examples
No ratings yet
RL Examples
6 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
Quantum COMP
No ratings yet
Quantum COMP
6 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
Assignment 4 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 4 (Sol.) : Reinforcement Learning
6 pages
q2B Review
No ratings yet
q2B Review
9 pages
READ Hayden Et Al 2018
No ratings yet
READ Hayden Et Al 2018
12 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
A Supernatural War Magic Divination and Faith During The First World War Owen Davies Download
No ratings yet
A Supernatural War Magic Divination and Faith During The First World War Owen Davies Download
84 pages
HENOK MEZGEBE ASEMAHUGN ID MLO-3436-15A SL
No ratings yet
HENOK MEZGEBE ASEMAHUGN ID MLO-3436-15A SL
3 pages
Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
Job Description JR IT Officer
No ratings yet
Job Description JR IT Officer
2 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages

q2B Review Sol

Uploaded by

q2B Review Sol

Uploaded by

3 MDP (12 points)

P (s0 |s, a)U (s0 )

P (s0 |s, a)U (s0 ))

• states in the search problem are states in MDP

There are two actions available in each state:

Q(a, s) ← Q(a, s) + α(R(s) + γ max

14 Markov Decision Processes

You were to ﬁll in a table as follows:

Let π1 (s) = West, for all states, γ = 0.5. Then:

• V π1 (HOT) = 10 + γV π1 (HOT) = 20.

Let π2 (s) = East, for all states, γ = 0.5. Then:

• V π2 (COLD) = −10 + γV π2 (COLD) = −20,

• V π2 (MILD) = 0 + γV π2 (COLD) = −10,

• V π2 (HOT) = 0 + γV π2 (MILD) = −5.

Remember that γ = 0.9.

V*(S2) = r(S2,D) + 0.9 (1.0 V*(S2))

V*(S1) = r(S1,D) + 0.9 (1.0 V*(S2))

V*(S3) = r(S3,D) + 0.9 (0.9 V*(S2) + 0.1 V*(S3))

V*(S4) = r(S4,D) + 0.9 (0.9 V*(S2) + 0.1 V*(S4))

You were to determine the Q-value, Q(S2,R).

Q(S2,R) = r(S2,R) + 0.9 (0.9 V*(S1) + 0.1 V*(S2))

Q̂n (s, a) ← (1 − αn )Q̂n−1 (s, a) + αn [r + γ max

Problem 4 Odds and Ends

Parents(Mood(person)) = {BankBalance(person), Seats(CarOf(person))}

(a) (1 pt) What is V ∗ (CC) (the value of being in the CC state)?

(c) (2 pts) Write the Bellman equation for V ∗ (LC).

V ∗ (LC) = −1 + 0.8 ∗ maxa

(d) (2 pts) What is the value of V ∗ (LC)?

V ∗ (LD) = −1 + 0.8 ∗ maxa s T (LD, a, s)V (s)

V ∗ (LD) = −1 + 0.8 ∗ maxa {actiongoBack , actionsuckU p }

(b) GraphPlan runs in time polynomial in plan depth.

(c) Iterative deepening requires less space than breadth-first search.

(e) (3 points) What is a good admissible heuristic?

(f) (4 points) This problem can also be treated as a planning problem.

∀x, y, s.atrobot(x, y, s) ∧ ¬occupied (x, y + 1) → atrobot(x, y + 1, result(moveNorth, s))

You might also like

V(S2) = r(S2,D) + 0.9 (1.0 V(S2))

V(S1) = r(S1,D) + 0.9 (1.0 V(S2))

V(S3) = r(S3,D) + 0.9 (0.9 V(S2) + 0.1 V*(S3))

V(S4) = r(S4,D) + 0.9 (0.9 V(S2) + 0.1 V*(S4))

Q(S2,R) = r(S2,R) + 0.9 (0.9 V(S1) + 0.1 V(S2))