Solutions To Reinforcement Learning by Sutton Chapter 3 rx1
Solutions To Reinforcement Learning by Sutton Chapter 3 rx1
Chapter 3
Yifan Wang
April 2019
Exercise 3.1
We have to start by reviewing the definition of MDP. MDP consists
of agent, state, actions, reward and return. But the most impor-
tant feature is called the Markov property.Markov property is that
each state must provide ALL information with respect to the past
agent-environment interactions. Under such optimized assumption,
we could define a probability function p(s0 , r|s, a) as follows, given s ∈ S
is the state, s0 ∈ S is the next state resulted by action a ∈ R.
.
p(s0 , r|s, a) = Pr{St = s0 , Rt = r|St−1 = s, At−s = a}
Exercise 3.2
One clear exceptions when we do not have enough calculation power
to find each s and r, for example the game Go, which has to be solved
by deep learning framework. Another clear exception is when you
cannot put state clear. For example, playing an online game first
time, how could an AI build models ahead of knowing nothing about
the game? Even the goal is to win, the state is hard to define. Even
1
playing so many times, simple rules can have mathematically impossi-
ble number of states. Finally, the exception could be found to violate
the Markov Property. That is, any former actions have no observable
result in the current state. For example, thinking of playing shooting
game, the agent has no direct information about other players due to
the sight restriction but the state will be influenced by your team-
mate and the opponent, making the agent impossible to figure out
what is the effect of your former action on to the current situation,
which makes is not a MDP.
Exercise 3.3
This problem is asking the proper line to define the environment and
the agent. To my understanding, the line should be divided such that
the effect of agent’s action a on state s could be observed in some way.
For instance, in the auto-driving problem, if we draw the line where
we only consider where to go, how would our actions affect the state
in a clear way? It is not observable and nearly abstract to me un-
less we have a series of agents that could decompose the job. That
is indeed the truth when the modern auto-driving system has many
sub-systems, for example, detection of trees. In general, I think the
line should be drawn by what we can do in nature and based on what
sub-agent we are able to build.
2
Exercise 3.4
The resulted table is the following:
s a s0 r p(s0 , r|s, a)
high search high rsearch α
high search low rsearch 1−α
low search high -3 1−β
low search low rsearch β
high wait high rwait 1
high wait low - 0
low wait high - 0
low wait low rwait 1
low recharge high 0 1
low recharge low - 0
Exercise 3.5
XX
(original) p(s0 , r|s, a) = 1, for all s ∈ S, a ∈ A(s).
s0 ∈S r∈R
X X . .
(modif ied) p(s0 , r|s, a) = 1, for all s ∈ S, a ∈ A(s),S ={Non-terminal States}, S + = {All States}
s0 ∈S + r∈R
Exercise 3.6
3
The reward for success is set to 0 and for failure, RT , is set to -1.
Thus we will have an updated return as
−γ T −t−1
This actually is the same return as the continuing setting where we
have return as −γ K where K is the time step before the failure.
Exercise 3.7
If you do not use γ to implement the discount, the maximum return
is always 1 regardless the time the agent spends. The correct way
to communicate the agent is to add -1 punishment to each time step
before the escape or adding the discount.
Exercise 3.8
G5 = 0 (terminal)
G4 = 2
G3 = 0.5 G4 + 3 = 4
G2 = 0.5 G3 + 6 = 2 + 6 = 8
G1 = 0.5 G2 + 2 = 4 + 2 = 6
G0 = 0.5 G1 − 1 = 3 − 1 = 2
Exercise 3.9
∞
X 0.9 ∗ 7
G0 = R1 + γG1 = 2 + γ 7γ k = 2 + = 65
1 − 0.9
k=0
(you can infer G1 to be 70 here)
Exercise 3.10
Proof:
∞
X ∞
X ∞
X
( γ k )(1 − γ) = γ k (1 − γ) = (γ k − γ k+1 ) = 1 − lim γ k+1 = 1 − 0 = 1
k→∞
k=0 k=0 k=0
4
Thus:
∞
X 1
γk =
1−γ
k=0
Exercise 3.11
X X
E[Rt+1 |St = s] = π(a|St ) p(s0 , r|s, a)r
a s0 ,r
Exercise 3.12
. X
vπ (s) = π(a|s)qπ (s, a)
a
Exercise 3.13
X
qπ (s, a) = p(s0 , r|s, a)[r + γvπ (s0 )]
s0 ,r
Exercise 3.14
5
Exercise 3.15
∞
. X
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... = γ k Rt+k+1
k=0
by adding a constant C
∞
. X C
G∗t = Gt + γ k C = Gt +
1−γ
k=0
.
h i h C i h i C
vπ∗ (s) = E G∗t |St = s = E Gt + |St = s = E Gt |St = s +
1−γ 1−γ
The last step uses the linearity of expectation and it is trivial to con-
clude that the new v ∗ (s) does not affect the relative difference among
states.
Exercise 3.16
Exercise 3.17
.
qπ (s, a) = Eπ [Gt |St = s, At = a]
= Eπ [Rt+1 + γGt+1 |St = s, At = a]
X X
= p(s0 , r|s, a)[r + γ π(a0 |s0 )qπ (s0 , a0 )]
s0 ,r a0
6
Exercise 3.18
Exercise 3.19
Exercise 3.20
Exercise 3.21
**Almost same as vputt but avoid the part of sand? This question
is vague since the rule of golf is not clearly stated.
7
Exercise 3.22
∞
X 1
Gπleft = γ 2i =
i=0
1 − γ2
∞
X 2γ
Gπright = 2γ 1+2i =
i=0
1 − γ2
Based on the above return formulas for each policy, γ = 0.5 seems to
be the borderline. If γ > 0.5, right is optimal; if γ < 0.5, left is optimal.
If γ = 0.5, both are optimal.
Exercise 3.23
If s = high, a = wait:
If s = high, a = search:
Exercise 3.24
8
The best solution after reaching A is to quickly go back A after mov-
ing to A. That takes 5 time steps. So we will have
∞
X
v∗ (A) = 10 γ 5t
t=0
10
Theoretical answer is 1−γ 5 By writing a little function in python,
Exercise 3.25
Exercise 3.26
X
q∗ (s, a) = p(s0 , r|s, a)[r + γv∗ (s0 )]
s0 ,r
Exercise 3.27
.
a∗ = arg π∗ (a∗ |s) = arg max q∗ (s, a)
a
Exercise 3.28
. X
a∗ = arg π∗ (a∗ |s) = arg max p(s0 , r|s, a)[r + γv∗ (s0 )]
a
s0 ,r
9
Exercise 3.29
.
h i
vπ (s) = Eπ Gt |St = s
X X
p(s0 |s, a)vπ (s0 ) π(s, a)
= r(s, a) + γ
a s0
.
h i
v∗ (s) = Eπ Gt |St = s
X X
p(s0 |s, a)v∗ (s0 ) π∗ (s, a)
= r(s, a) + γ
a s0
.
h i
qπ (s, a) = Eπ Gt |St = s, At = a
h i
= Eπ Rt+1 + γGt+1 |St+1 = s0 , At = a
X X
= r(s, a) + γ p(s0 |s, a) qπ (a0 , s0 )π(a0 |s0 )
s0 a0
.
h i
q∗ (s, a) = Eπ∗ Gt |St = s, At = a
h i
= Eπ∗ Rt+1 + γGt+1 |St+1 = s0 , At = a
X X
= r(s, a) + γ p(s0 |s, a) q∗ (a0 , s0 )π∗ (a0 |s0 )
s0 a0
10