0% found this document useful (0 votes)
10 views10 pages

Solutions To Reinforcement Learning by Sutton Chapter 3 rx1

The document discusses various exercises related to Markov Decision Processes (MDPs) and reinforcement learning, focusing on concepts such as the Markov property, defining environments and agents, and calculating returns. It includes examples, exceptions to MDPs, and mathematical formulations like the Bellman equation. The exercises cover a range of topics from basic definitions to complex calculations involving rewards and policies.

Uploaded by

Pranav Jagan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Solutions To Reinforcement Learning by Sutton Chapter 3 rx1

The document discusses various exercises related to Markov Decision Processes (MDPs) and reinforcement learning, focusing on concepts such as the Markov property, defining environments and agents, and calculating returns. It includes examples, exceptions to MDPs, and mathematical formulations like the Bellman equation. The exercises cover a range of topics from basic definitions to complex calculations involving rewards and policies.

Uploaded by

Pranav Jagan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Solutions to Reinforcement Learning by Sutton

Chapter 3
Yifan Wang

April 2019

Exercise 3.1
We have to start by reviewing the definition of MDP. MDP consists
of agent, state, actions, reward and return. But the most impor-
tant feature is called the Markov property.Markov property is that
each state must provide ALL information with respect to the past
agent-environment interactions. Under such optimized assumption,
we could define a probability function p(s0 , r|s, a) as follows, given s ∈ S
is the state, s0 ∈ S is the next state resulted by action a ∈ R.
.
p(s0 , r|s, a) = Pr{St = s0 , Rt = r|St−1 = s, At−s = a}

Under such p, we can calculate many variations of functions to start


our journey. Going back to the question, I will give the following 3
examples. First, an AI to escape the maze of grid.Second, an AI that
is trying to learn which arm to use to grasp a glass. Third, an AI
that is going to decide how hot its heater should be to make water
60 degrees hot. 

Exercise 3.2
One clear exceptions when we do not have enough calculation power
to find each s and r, for example the game Go, which has to be solved
by deep learning framework. Another clear exception is when you
cannot put state clear. For example, playing an online game first
time, how could an AI build models ahead of knowing nothing about
the game? Even the goal is to win, the state is hard to define. Even

1
playing so many times, simple rules can have mathematically impossi-
ble number of states. Finally, the exception could be found to violate
the Markov Property. That is, any former actions have no observable
result in the current state. For example, thinking of playing shooting
game, the agent has no direct information about other players due to
the sight restriction but the state will be influenced by your team-
mate and the opponent, making the agent impossible to figure out
what is the effect of your former action on to the current situation,
which makes is not a MDP. 

Exercise 3.3
This problem is asking the proper line to define the environment and
the agent. To my understanding, the line should be divided such that
the effect of agent’s action a on state s could be observed in some way.
For instance, in the auto-driving problem, if we draw the line where
we only consider where to go, how would our actions affect the state
in a clear way? It is not observable and nearly abstract to me un-
less we have a series of agents that could decompose the job. That
is indeed the truth when the modern auto-driving system has many
sub-systems, for example, detection of trees. In general, I think the
line should be drawn by what we can do in nature and based on what
sub-agent we are able to build. 

2
Exercise 3.4
The resulted table is the following:

s a s0 r p(s0 , r|s, a)
high search high rsearch α
high search low rsearch 1−α
low search high -3 1−β
low search low rsearch β
high wait high rwait 1
high wait low - 0
low wait high - 0
low wait low rwait 1
low recharge high 0 1
low recharge low - 0

Exercise 3.5

XX
(original) p(s0 , r|s, a) = 1, for all s ∈ S, a ∈ A(s).
s0 ∈S r∈R

X X . .
(modif ied) p(s0 , r|s, a) = 1, for all s ∈ S, a ∈ A(s),S ={Non-terminal States}, S + = {All States}
s0 ∈S + r∈R

Exercise 3.6

First, review the Gt for an episodic task:


.
Gt = Rt+1 + Rt+2 + Rt+3 + ... + RT

If we use discounting, this will be:


−t−1
TX
.
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... + γ T −t−1 RT = γ k Rt+k+1
k=0

3
The reward for success is set to 0 and for failure, RT , is set to -1.
Thus we will have an updated return as
−γ T −t−1
This actually is the same return as the continuing setting where we
have return as −γ K where K is the time step before the failure. 

Exercise 3.7
If you do not use γ to implement the discount, the maximum return
is always 1 regardless the time the agent spends. The correct way
to communicate the agent is to add -1 punishment to each time step
before the escape or adding the discount. 

Exercise 3.8
G5 = 0 (terminal)
G4 = 2
G3 = 0.5 G4 + 3 = 4
G2 = 0.5 G3 + 6 = 2 + 6 = 8
G1 = 0.5 G2 + 2 = 4 + 2 = 6
G0 = 0.5 G1 − 1 = 3 − 1 = 2


Exercise 3.9

X 0.9 ∗ 7
G0 = R1 + γG1 = 2 + γ 7γ k = 2 + = 65
1 − 0.9
k=0
(you can infer G1 to be 70 here)


Exercise 3.10

Proof:

X ∞
X ∞
X
( γ k )(1 − γ) = γ k (1 − γ) = (γ k − γ k+1 ) = 1 − lim γ k+1 = 1 − 0 = 1
k→∞
k=0 k=0 k=0

4
Thus:

X 1
γk =
1−γ
k=0


Exercise 3.11

X X
E[Rt+1 |St = s] = π(a|St ) p(s0 , r|s, a)r
a s0 ,r


Exercise 3.12

. X
vπ (s) = π(a|s)qπ (s, a)
a


Exercise 3.13

X
qπ (s, a) = p(s0 , r|s, a)[r + γvπ (s0 )]
s0 ,r


Exercise 3.14

Bellman equation is known as the following:


X X
vπ (s) = π(a|s) p(s0 , r|s, a)[r + γvπ (s0 )]
a s0 ,r

, for all s ∈ S. Thus we have

0.9 × (2.3 + 0.7 − 0.4 + 0.4)


vcenter = = 0.675 ≈ 0.7
4

5


Exercise 3.15


. X
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... = γ k Rt+k+1
k=0

by adding a constant C

. X C
G∗t = Gt + γ k C = Gt +
1−γ
k=0

.
h i h C i h i C
vπ∗ (s) = E G∗t |St = s = E Gt + |St = s = E Gt |St = s +
1−γ 1−γ
The last step uses the linearity of expectation and it is trivial to con-
clude that the new v ∗ (s) does not affect the relative difference among
states. 

Exercise 3.16

It is a similar question as one before. The sign of reward is crit-


ical in episodic task because episodic task uses negative reward to
accelerate the agent finishing the task. Thus, adding a constant C, if
changing the sign, would have an impact on how agent moves. Fur-
thermore, if the negative reward remains negative but the value of
it shrinks too much, it will give a wrong signal to the agent that the
time of completing the job is not that important. 

Exercise 3.17
.
qπ (s, a) = Eπ [Gt |St = s, At = a]
= Eπ [Rt+1 + γGt+1 |St = s, At = a]
X X
= p(s0 , r|s, a)[r + γ π(a0 |s0 )qπ (s0 , a0 )]
s0 ,r a0

6
Exercise 3.18

vπ (s) = Eπ [qπ (s, a)]


X
= π(a|s)qπ (s, a)
a

Exercise 3.19

qπ (s, a) = E[Rt+1 + γvπ (St+1 )|St = s, At = a]


X
= p(s0 , r|s, a)[r + γvπ (s0 )]
s0 ,r

Exercise 3.20

It is a combination with vputt and q∗ (s, driver) where we have out-


side of green part plus the sand (since we cannot escape using the
putt)the q∗ and the rest for vputt .


Exercise 3.21

**Almost same as vputt but avoid the part of sand? This question
is vague since the rule of golf is not clearly stated.


7
Exercise 3.22


X 1
Gπleft = γ 2i =
i=0
1 − γ2

X 2γ
Gπright = 2γ 1+2i =
i=0
1 − γ2

Based on the above return formulas for each policy, γ = 0.5 seems to
be the borderline. If γ > 0.5, right is optimal; if γ < 0.5, left is optimal.
If γ = 0.5, both are optimal.


Exercise 3.23

Bellman optimality equation for q∗ is:


h i
q∗ (s, a) = E Rt+1 + γ max 0
q∗ (St+1 , a0 ) St = s, At = a
a
X h i
0 0 0
= p(s , r|s, a) r + γ max0
q ∗ (s , a )
a
s0 ,r

If s = high, a = wait:

q∗ (high, wait) = rwait + γ max(q∗ (high, wait), q∗ (high, search))

If s = high, a = search:

q∗ (high, search) = α [rsearch + γ max(q∗ (high, wait), q∗ (high, search))] +


a

(1 − α)[rsearch + γ max(q∗ (low, recharge), q∗ (low, wait), q∗ (low, search))]


a

Similar equations can be made for the rest, it is trivial to do more


here.


Exercise 3.24

8
The best solution after reaching A is to quickly go back A after mov-
ing to A. That takes 5 time steps. So we will have

X
v∗ (A) = 10 γ 5t
t=0
10
Theoretical answer is 1−γ 5 By writing a little function in python,

(looping 100 times is enough), or use a calculator, we get the answer


24.419428096993954. Cutting it to 3 digits and we are done at 24.419.


Exercise 3.25

v∗ (s) = max(q∗ (s, a))


a

Exercise 3.26
X
q∗ (s, a) = p(s0 , r|s, a)[r + γv∗ (s0 )]
s0 ,r

Exercise 3.27

.
a∗ = arg π∗ (a∗ |s) = arg max q∗ (s, a)
a

Policies that map only these a∗ to their arbitrary possibilities would


be the π∗ .


Exercise 3.28

. X
a∗ = arg π∗ (a∗ |s) = arg max p(s0 , r|s, a)[r + γv∗ (s0 )]
a
s0 ,r

9
Exercise 3.29

.
h i
vπ (s) = Eπ Gt |St = s
X X
p(s0 |s, a)vπ (s0 ) π(s, a)

= r(s, a) + γ
a s0

.
h i
v∗ (s) = Eπ Gt |St = s
X X
p(s0 |s, a)v∗ (s0 ) π∗ (s, a)

= r(s, a) + γ
a s0

.
h i
qπ (s, a) = Eπ Gt |St = s, At = a
h i
= Eπ Rt+1 + γGt+1 |St+1 = s0 , At = a
X X
= r(s, a) + γ p(s0 |s, a) qπ (a0 , s0 )π(a0 |s0 )
s0 a0

.
h i
q∗ (s, a) = Eπ∗ Gt |St = s, At = a
h i
= Eπ∗ Rt+1 + γGt+1 |St+1 = s0 , At = a
X X
= r(s, a) + γ p(s0 |s, a) q∗ (a0 , s0 )π∗ (a0 |s0 )
s0 a0

10

You might also like