Solutions To Reinforcement Learning by Sutton Chapter 4 r5
Solutions To Reinforcement Learning by Sutton Chapter 4 r5
Chapter 4
Yifan Wang
May 2019
Exercise 4.1
qπ (11, down) = −1 + vπ (T ) = −1 + 0 = −1
Exercise 4.2
Changing the dynamics will not result the recalculation of the whole
game: the Set S 0 of S = 15 is exactly as the one of S = 13. Thus, they
must share the same state value as −20.
1
Exercise 4.3
.
qπ (s, a) = Eπ Gt | St = s, At = a
= Eπ Rt+1 + γGt+1 | St = s, At = a
X
qπ (s0 , a0 ) | St = s, At = a
= Eπ Rt+1 + γ
s0 ,a0
X h X i
0
= p(s , r|s, a) r + γ π(a0 |s0 )qπ (s0 , a0 )
s0 ,r a0
.
qk+1 (s, a) = Eπ Rt+1 + γGt+1 | St = s, At = a
X h X i
= p(s0 , r|s, a) r + γ π(a0 |s0 )qk (s0 , a0 )
s0 ,r a0
Exercise 4.4
If old-action 6∈ {ai }, which is the all equi-best solutions from π(s), ......
2
Exercise 4.5
1. Initialization
Q(s, a) ∈ R and π(s) ∈ A(s) arbitrarily for all s ∈ S, a ∈ A
2. Policy Evaluation
Loop:
∆← −0
Loop for each s ∈ S and a ∈ A:
q = Q(s, a) h i
− s0 ,r p(s0 , r|s, a) r + γ a0 π(a0 |s0 )Q(s0 , a0 )
P P
Q(s, a) ←
∆← − max(∆, |q − Q(s, a)|)
until ∆ < θ (a small positive number determining the accuracy of estimation)
3. Policy Improvement
policy-stable ←− true
For each s ∈ S and a ∈ A:
old-action ←− π(s)
π(s) ←− arg maxa Q(s, a)
If old-action 6∈ {ai }, which is the set of equi-best solutions from π(s)
Then policy-stable ← − f alse
If policy-stable, then stop and return Q ≈ q∗ and π ≈ π∗ ; else go to 2
Exercise 4.6
Step 2 Changes: θ should not be set above the limit of any sof t-
method.
3
Exercise 4.7
Exercise 4.8
The gambler’s problem has such curious form of optimal policy be-
cause at the number 50, you can suddenly win with probability ph .
Thus, the best policy will bet all when Capital=50 and the possible
dividends of it, like 25.
4
Exercise 4.9
Program here.
Plot A
Plot B
Plot C
Plot D
With proper thinking, you could easily recognize which plot is which.
(Think of human playing technique!)
Exercise 4.10
.
γqk (s0 , a0 )
qk+1 (s, a) = E Rt+1 + max 0
a
X h i
0 0 0
= p(s , r|s, a) r + max 0
γq k (s , a )
a
s0 ,r