0% found this document useful (0 votes)

10 views10 pages

Solutions To Reinforcement Learning by Sutton Chapter 3 rx1

The document discusses various exercises related to Markov Decision Processes (MDPs) and reinforcement learning, focusing on concepts such as the Markov property, defining environments and agents, and calculating returns. It includes examples, exceptions to MDPs, and mathematical formulations like the Bellman equation. The exercises cover a range of topics from basic definitions to complex calculations involving rewards and policies.

Uploaded by

Pranav Jagan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views10 pages

Solutions To Reinforcement Learning by Sutton Chapter 3 rx1

Uploaded by

Pranav Jagan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Solutions to Reinforcement Learning by Sutton

Chapter 3
Yifan Wang

April 2019

Exercise 3.1
We have to start by reviewing the definition of MDP. MDP consists
of agent, state, actions, reward and return. But the most impor-
tant feature is called the Markov property.Markov property is that
each state must provide ALL information with respect to the past
agent-environment interactions. Under such optimized assumption,
we could define a probability function p(s0 , r|s, a) as follows, given s ∈ S
is the state, s0 ∈ S is the next state resulted by action a ∈ R.
.
p(s0 , r|s, a) = Pr{St = s0 , Rt = r|St−1 = s, At−s = a}

Under such p, we can calculate many variations of functions to start

our journey. Going back to the question, I will give the following 3
examples. First, an AI to escape the maze of grid.Second, an AI that
is trying to learn which arm to use to grasp a glass. Third, an AI
that is going to decide how hot its heater should be to make water
60 degrees hot.

Exercise 3.2
One clear exceptions when we do not have enough calculation power
to find each s and r, for example the game Go, which has to be solved
by deep learning framework. Another clear exception is when you
cannot put state clear. For example, playing an online game first
time, how could an AI build models ahead of knowing nothing about
the game? Even the goal is to win, the state is hard to define. Even

1
playing so many times, simple rules can have mathematically impossi-
ble number of states. Finally, the exception could be found to violate
the Markov Property. That is, any former actions have no observable
result in the current state. For example, thinking of playing shooting
game, the agent has no direct information about other players due to
the sight restriction but the state will be influenced by your team-
mate and the opponent, making the agent impossible to figure out
what is the effect of your former action on to the current situation,
which makes is not a MDP.

Exercise 3.3
This problem is asking the proper line to define the environment and
the agent. To my understanding, the line should be divided such that
the effect of agent’s action a on state s could be observed in some way.
For instance, in the auto-driving problem, if we draw the line where
we only consider where to go, how would our actions affect the state
in a clear way? It is not observable and nearly abstract to me un-
less we have a series of agents that could decompose the job. That
is indeed the truth when the modern auto-driving system has many
sub-systems, for example, detection of trees. In general, I think the
line should be drawn by what we can do in nature and based on what
sub-agent we are able to build.

2
Exercise 3.4
The resulted table is the following:

s a s0 r p(s0 , r|s, a)
high search high rsearch α
high search low rsearch 1−α
low search high -3 1−β
low search low rsearch β
high wait high rwait 1
high wait low - 0
low wait high - 0
low wait low rwait 1
low recharge high 0 1
low recharge low - 0

Exercise 3.5

XX
(original) p(s0 , r|s, a) = 1, for all s ∈ S, a ∈ A(s).
s0 ∈S r∈R

X X . .
(modif ied) p(s0 , r|s, a) = 1, for all s ∈ S, a ∈ A(s),S ={Non-terminal States}, S + = {All States}
s0 ∈S + r∈R

Exercise 3.6

First, review the Gt for an episodic task:

.
Gt = Rt+1 + Rt+2 + Rt+3 + ... + RT

If we use discounting, this will be:

−t−1
TX
.
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... + γ T −t−1 RT = γ k Rt+k+1
k=0

3
The reward for success is set to 0 and for failure, RT , is set to -1.
Thus we will have an updated return as
−γ T −t−1
This actually is the same return as the continuing setting where we
have return as −γ K where K is the time step before the failure.

Exercise 3.7
If you do not use γ to implement the discount, the maximum return
is always 1 regardless the time the agent spends. The correct way
to communicate the agent is to add -1 punishment to each time step
before the escape or adding the discount.

Exercise 3.8
G5 = 0 (terminal)
G4 = 2
G3 = 0.5 G4 + 3 = 4
G2 = 0.5 G3 + 6 = 2 + 6 = 8
G1 = 0.5 G2 + 2 = 4 + 2 = 6
G0 = 0.5 G1 − 1 = 3 − 1 = 2

Exercise 3.9
∞
X 0.9 ∗ 7
G0 = R1 + γG1 = 2 + γ 7γ k = 2 + = 65
1 − 0.9
k=0
(you can infer G1 to be 70 here)

Exercise 3.10

Proof:
∞
X ∞
X ∞
X
( γ k )(1 − γ) = γ k (1 − γ) = (γ k − γ k+1 ) = 1 − lim γ k+1 = 1 − 0 = 1
k→∞
k=0 k=0 k=0

4
Thus:
∞
X 1
γk =
1−γ
k=0

Exercise 3.11

X X
E[Rt+1 |St = s] = π(a|St ) p(s0 , r|s, a)r
a s0 ,r

Exercise 3.12

. X
vπ (s) = π(a|s)qπ (s, a)
a

Exercise 3.13

X
qπ (s, a) = p(s0 , r|s, a)[r + γvπ (s0 )]
s0 ,r

Exercise 3.14

Bellman equation is known as the following:

X X
vπ (s) = π(a|s) p(s0 , r|s, a)[r + γvπ (s0 )]
a s0 ,r

, for all s ∈ S. Thus we have

0.9 × (2.3 + 0.7 − 0.4 + 0.4)

vcenter = = 0.675 ≈ 0.7
4

Exercise 3.15

∞
. X
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... = γ k Rt+k+1
k=0

by adding a constant C
∞
. X C
G∗t = Gt + γ k C = Gt +
1−γ
k=0

.
h i h C i h i C
vπ∗ (s) = E G∗t |St = s = E Gt + |St = s = E Gt |St = s +
1−γ 1−γ
The last step uses the linearity of expectation and it is trivial to con-
clude that the new v ∗ (s) does not affect the relative difference among
states.

Exercise 3.16

It is a similar question as one before. The sign of reward is crit-

ical in episodic task because episodic task uses negative reward to
accelerate the agent finishing the task. Thus, adding a constant C, if
changing the sign, would have an impact on how agent moves. Fur-
thermore, if the negative reward remains negative but the value of
it shrinks too much, it will give a wrong signal to the agent that the
time of completing the job is not that important.

Exercise 3.17
.
qπ (s, a) = Eπ [Gt |St = s, At = a]
= Eπ [Rt+1 + γGt+1 |St = s, At = a]
X X
= p(s0 , r|s, a)[r + γ π(a0 |s0 )qπ (s0 , a0 )]
s0 ,r a0

6
Exercise 3.18

vπ (s) = Eπ [qπ (s, a)]

X
= π(a|s)qπ (s, a)
a

Exercise 3.19

qπ (s, a) = E[Rt+1 + γvπ (St+1 )|St = s, At = a]

X
= p(s0 , r|s, a)[r + γvπ (s0 )]
s0 ,r

Exercise 3.20

It is a combination with vputt and q∗ (s, driver) where we have out-

side of green part plus the sand (since we cannot escape using the
putt)the q∗ and the rest for vputt .

Exercise 3.21

**Almost same as vputt but avoid the part of sand? This question
is vague since the rule of golf is not clearly stated.

7
Exercise 3.22

∞
X 1
Gπleft = γ 2i =
i=0
1 − γ2
∞
X 2γ
Gπright = 2γ 1+2i =
i=0
1 − γ2

Based on the above return formulas for each policy, γ = 0.5 seems to
be the borderline. If γ > 0.5, right is optimal; if γ < 0.5, left is optimal.
If γ = 0.5, both are optimal.

Exercise 3.23

Bellman optimality equation for q∗ is:

h i
q∗ (s, a) = E Rt+1 + γ max 0
q∗ (St+1 , a0 ) St = s, At = a
a
X h i
0 0 0
= p(s , r|s, a) r + γ max0
q ∗ (s , a )
a
s0 ,r

If s = high, a = wait:

q∗ (high, wait) = rwait + γ max(q∗ (high, wait), q∗ (high, search))

If s = high, a = search:

q∗ (high, search) = α [rsearch + γ max(q∗ (high, wait), q∗ (high, search))] +

(1 − α)[rsearch + γ max(q∗ (low, recharge), q∗ (low, wait), q∗ (low, search))]

Similar equations can be made for the rest, it is trivial to do more

here.

Exercise 3.24

8
The best solution after reaching A is to quickly go back A after mov-
ing to A. That takes 5 time steps. So we will have
∞
X
v∗ (A) = 10 γ 5t
t=0
10
Theoretical answer is 1−γ 5 By writing a little function in python,

(looping 100 times is enough), or use a calculator, we get the answer

24.419428096993954. Cutting it to 3 digits and we are done at 24.419.

Exercise 3.25

v∗ (s) = max(q∗ (s, a))

Exercise 3.26
X
q∗ (s, a) = p(s0 , r|s, a)[r + γv∗ (s0 )]
s0 ,r

Exercise 3.27

.
a∗ = arg π∗ (a∗ |s) = arg max q∗ (s, a)
a

Policies that map only these a∗ to their arbitrary possibilities would

be the π∗ .

Exercise 3.28

. X
a∗ = arg π∗ (a∗ |s) = arg max p(s0 , r|s, a)[r + γv∗ (s0 )]
a
s0 ,r

9
Exercise 3.29

.
h i
vπ (s) = Eπ Gt |St = s
X X
p(s0 |s, a)vπ (s0 ) π(s, a)

= r(s, a) + γ
a s0

.
h i
v∗ (s) = Eπ Gt |St = s
X X
p(s0 |s, a)v∗ (s0 ) π∗ (s, a)

= r(s, a) + γ
a s0

.
h i
qπ (s, a) = Eπ Gt |St = s, At = a
h i
= Eπ Rt+1 + γGt+1 |St+1 = s0 , At = a
X X
= r(s, a) + γ p(s0 |s, a) qπ (a0 , s0 )π(a0 |s0 )
s0 a0

.
h i
q∗ (s, a) = Eπ∗ Gt |St = s, At = a
h i
= Eπ∗ Rt+1 + γGt+1 |St+1 = s0 , At = a
X X
= r(s, a) + γ p(s0 |s, a) q∗ (a0 , s0 )π∗ (a0 |s0 )
s0 a0

242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lec 08
No ratings yet
Lec 08
59 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Littomore
No ratings yet
Littomore
169 pages
Thesis Fabrizio Galli
No ratings yet
Thesis Fabrizio Galli
22 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Bellemare17a PDF
No ratings yet
Bellemare17a PDF
10 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Lec 3
No ratings yet
Lec 3
15 pages
RL Frra
No ratings yet
RL Frra
9 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Lec 12
No ratings yet
Lec 12
60 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
37 RL
No ratings yet
37 RL
18 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Subtitle
No ratings yet
Subtitle
2 pages
Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Image Compression Models: Fig: Functional Block Diagram of A General Image Compression System
No ratings yet
Image Compression Models: Fig: Functional Block Diagram of A General Image Compression System
2 pages
Government Polytechnic, Washim: "Implement Modifier Caesar's Cipher With Shift of Any Key
No ratings yet
Government Polytechnic, Washim: "Implement Modifier Caesar's Cipher With Shift of Any Key
12 pages
Nopehjdgs Ufvvdyvhuf8trdsvtrveryter Treroetysiov5yhuetyutdbuzfoyifbvigxdftuvsdhuibrsh
0% (1)
Nopehjdgs Ufvvdyvhuf8trdsvtrveryter Treroetysiov5yhuetyutdbuzfoyifbvigxdftuvsdhuibrsh
2 pages
BYJU'S Answer: Study Materials
No ratings yet
BYJU'S Answer: Study Materials
13 pages
3.6 Illustrating Extreme Value Theorem
No ratings yet
3.6 Illustrating Extreme Value Theorem
16 pages
Physical Chemistry Ch3 - A
No ratings yet
Physical Chemistry Ch3 - A
21 pages
Revision V5no
No ratings yet
Revision V5no
14 pages
Lesson 7 Pulse Modulation
No ratings yet
Lesson 7 Pulse Modulation
38 pages
Sl. No. Experiments/Programs Cos
No ratings yet
Sl. No. Experiments/Programs Cos
17 pages
Cycle 3 Probability Sheets - Copyright PDF
No ratings yet
Cycle 3 Probability Sheets - Copyright PDF
6 pages
156az - Finite Element Methods
No ratings yet
156az - Finite Element Methods
2 pages
Enhancing Machine Learning Work Ows: A Comprehensive Study of Machine Learning Pipelines
No ratings yet
Enhancing Machine Learning Work Ows: A Comprehensive Study of Machine Learning Pipelines
7 pages
Review B - Unit 2 Topics 2.1 - 2.4
No ratings yet
Review B - Unit 2 Topics 2.1 - 2.4
4 pages
Middle East Technical University Department of Mechanical Engineering ME 310 Numerical Methods All Sections - Fall 2021
No ratings yet
Middle East Technical University Department of Mechanical Engineering ME 310 Numerical Methods All Sections - Fall 2021
4 pages
Solution: For This Data, We Have The Forward Difference Difference Table
No ratings yet
Solution: For This Data, We Have The Forward Difference Difference Table
7 pages
Investigations Into The Kaprekar Process
No ratings yet
Investigations Into The Kaprekar Process
22 pages
Applied Maths Class 12 Board Paper
No ratings yet
Applied Maths Class 12 Board Paper
13 pages
WORD EMBEDDING Project
No ratings yet
WORD EMBEDDING Project
15 pages
AudioStegano 1
No ratings yet
AudioStegano 1
4 pages
Chapter20 4e
No ratings yet
Chapter20 4e
36 pages
Adaboost
No ratings yet
Adaboost
5 pages
AI Lec 1 Introduction, Foundation, History and State of The Art
No ratings yet
AI Lec 1 Introduction, Foundation, History and State of The Art
7 pages
Adiabatic and Isentropic Gas Compression
No ratings yet
Adiabatic and Isentropic Gas Compression
1 page
What Is Deep Learning and How Does It Work - Towards Data Science
No ratings yet
What Is Deep Learning and How Does It Work - Towards Data Science
38 pages
Basic Data Analysis and Pycbc Tutorial With Google
No ratings yet
Basic Data Analysis and Pycbc Tutorial With Google
19 pages
A Penny Saved Is A Penny Earned
No ratings yet
A Penny Saved Is A Penny Earned
2 pages
Lab No 7
No ratings yet
Lab No 7
4 pages
Exercise 3B: NP P X
No ratings yet
Exercise 3B: NP P X
3 pages
Heap Sort
No ratings yet
Heap Sort
1 page
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)

Solutions To Reinforcement Learning by Sutton Chapter 3 rx1

Uploaded by

Solutions To Reinforcement Learning by Sutton Chapter 3 rx1

Uploaded by

Solutions to Reinforcement Learning by Sutton

Under such p, we can calculate many variations of functions to start

First, review the Gt for an episodic task:

If we use discounting, this will be:

Bellman equation is known as the following:

, for all s ∈ S. Thus we have

0.9 × (2.3 + 0.7 − 0.4 + 0.4)

It is a similar question as one before. The sign of reward is crit-

vπ (s) = Eπ [qπ (s, a)]

qπ (s, a) = E[Rt+1 + γvπ (St+1 )|St = s, At = a]

It is a combination with vputt and q∗ (s, driver) where we have out-

Bellman optimality equation for q∗ is:

q∗ (high, wait) = rwait + γ max(q∗ (high, wait), q∗ (high, search))

q∗ (high, search) = α [rsearch + γ max(q∗ (high, wait), q∗ (high, search))] +

(1 − α)[rsearch + γ max(q∗ (low, recharge), q∗ (low, wait), q∗ (low, search))]

Similar equations can be made for the rest, it is trivial to do more

(looping 100 times is enough), or use a calculator, we get the answer

v∗ (s) = max(q∗ (s, a))

Policies that map only these a∗ to their arbitrary possibilities would

You might also like