0% found this document useful (0 votes)
46 views5 pages

Solutions To Reinforcement Learning by Sutton Chapter 4 r5

This summary covers exercises from Chapter 4 of the book Reinforcement Learning by Sutton. The exercises involve analyzing reinforcement learning algorithms, analyzing optimal policies for different problems, and implementing reinforcement learning algorithms.

Uploaded by

이강민
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views5 pages

Solutions To Reinforcement Learning by Sutton Chapter 4 r5

This summary covers exercises from Chapter 4 of the book Reinforcement Learning by Sutton. The exercises involve analyzing reinforcement learning algorithms, analyzing optimal policies for different problems, and implementing reinforcement learning algorithms.

Uploaded by

이강민
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Solutions to Reinforcement Learning by Sutton

Chapter 4
Yifan Wang

May 2019

Exercise 4.1

if π is the equiprobable random policy:

qπ (11, down) = −1 + vπ (T ) = −1 + 0 = −1

qπ (7, down) = −1 + vπ (11) = −1 + (−14) = −15

Exercise 4.2

Adding state 15 will result:

vπ (15) = −1 + 0.25(−20 − 22 − 14 + vπ (15)) = −15 + 0.25vπ (15)


vπ (15) = −15/0.75 = −20

Changing the dynamics will not result the recalculation of the whole
game: the Set S 0 of S = 15 is exactly as the one of S = 13. Thus, they
must share the same state value as −20.

Here is my script implementation of this game. Feel free to add


a S = 15 in it.


1
Exercise 4.3

.  
qπ (s, a) = Eπ Gt | St = s, At = a
 
= Eπ Rt+1 + γGt+1 | St = s, At = a
X
qπ (s0 , a0 ) | St = s, At = a
 
= Eπ Rt+1 + γ
s0 ,a0
X h X i
0
= p(s , r|s, a) r + γ π(a0 |s0 )qπ (s0 , a0 )
s0 ,r a0

.  
qk+1 (s, a) = Eπ Rt+1 + γGt+1 | St = s, At = a
X h X i
= p(s0 , r|s, a) r + γ π(a0 |s0 )qk (s0 , a0 )
s0 ,r a0

Exercise 4.4

In the step 3. Policy Improvement, it said:

If old-action6= π(s) , then ......

It is a bug and one way to fix it is to say the following instead:

If old-action 6∈ {ai }, which is the all equi-best solutions from π(s), ......

2
Exercise 4.5

1. Initialization
Q(s, a) ∈ R and π(s) ∈ A(s) arbitrarily for all s ∈ S, a ∈ A

2. Policy Evaluation
Loop:
∆← −0
Loop for each s ∈ S and a ∈ A:
q = Q(s, a) h i
− s0 ,r p(s0 , r|s, a) r + γ a0 π(a0 |s0 )Q(s0 , a0 )
P P
Q(s, a) ←
∆← − max(∆, |q − Q(s, a)|)
until ∆ < θ (a small positive number determining the accuracy of estimation)

3. Policy Improvement
policy-stable ←− true
For each s ∈ S and a ∈ A:
old-action ←− π(s)
π(s) ←− arg maxa Q(s, a)
If old-action 6∈ {ai }, which is the set of equi-best solutions from π(s)
Then policy-stable ← − f alse
If policy-stable, then stop and return Q ≈ q∗ and π ≈ π∗ ; else go to 2

Exercise 4.6

Step 3 Changes: We will only decide policy-stable is false under the


condition that the policy does not explore.

Step 2 Changes: θ should not be set above the limit of any sof t-
method.

Step 1 Changes: π should be well defined as sof t- method.  should


be given.


3
Exercise 4.7

Partial answer here. The programming implementation of DP is ex-


tremely time consuming. I did not get the exactly same answer from
the book due to some unknown reason (algorithm difference/float
precision etc.). Still, feel free to check it out and solve the whole
picture! (Warning: It takes 30min to train.)


Exercise 4.8

The gambler’s problem has such curious form of optimal policy be-
cause at the number 50, you can suddenly win with probability ph .
Thus, the best policy will bet all when Capital=50 and the possible
dividends of it, like 25.

Thinking capital of 51 as 50 plus 1. Of course we can bet all when we


have 51, but the best policy is to see if we can earn much from the
extra 1 dollar. If this return g is positive, we can say we have extra
g money and bet it again until 75, where the sudden win chance is
coming. On the contrary, if we bet 50 out of 51 first, our chance of
win is only ph and we lose the chance to reach 75. Instead, we will
have to try our best to reach 25 first with 1 dollar if we lose the bet,
a much worse condition.

Conclusion: The indicated optimal policy creates more chances to


win and guarantees the gambler be better off when he loses.

4
Exercise 4.9

Program here.

Plot A

Plot B

Plot C

Plot D

With proper thinking, you could easily recognize which plot is which.
(Think of human playing technique!)


Exercise 4.10

. 
γqk (s0 , a0 )

qk+1 (s, a) = E Rt+1 + max 0
a
X h i
0 0 0
= p(s , r|s, a) r + max 0
γq k (s , a )
a
s0 ,r

You might also like