Reinforcement Learning
Reinforcement Learning
• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary
Example Class
Reinforcement Learning:
…
Situation Reward Situation Reward
Ping-pong:
Reward on each point scored
Animals:
Hunger and pain - negative reward
food intake – positive reward
Eick: Reinforcement Learning.
Framework: Agent in State Space
Remark: no
Example: XYZ-World terminal states
e e
1 2 3 R=+5
n s s
4 5 R=+3 w 6 R=-9
sw ne
n s s
x/0.3 x/0.7
7 8 R=+4 9 R=-6
Problem: What actions nw s
should an agent choose
to maximize its rewards? 10
Eick: Reinforcement Learning.
Bellman TD P
XYZ-World: Discussion Problem 12
e e (3.3, 0.5)
1 2 3 R=+5
n s s
I tried hard
but: any better 4 5 R=+3 w 6 R=-9
explanations? sw ne
n s s
(3.2, -0.5) x/0.7
x/0.3
7 8 R=+4 9 R=-6
nw s
(0.6, -0.2)
Explanation of discrepancies TD for P/Bellman:
• Most significant discrepancies in states 3 and 8; minor in state 10
10
• P chooses worst successor of 8; should apply operator x instead
• P should apply w in state 6, but only does it only in 2/3 of the cases;
which affects the utility of state 3
• The low utility value of state 8 in TD seems to lower the utility value
of state 10 only a minor discrepancy
P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.
XYZ-World: Discussion Problem 12 Bellman Update g=0.2
e e
10.145 20.72 30.58 R=+5
n s s
40.03 53.63 R=+3 w 6-8.27 R=-9
sw ne
n s s
x/0.383.17 R=+4 x/0.7
70.001 9-5.98 R=-6
nw s
Discussion on using Bellman Update for Problem 12:
• No convergence for g=1.0; utility values seem to run away!
100.63
• State 3 has utility 0.58 although it gives a reward of +5 due to the
immediate penalty that follows; we were able to detect that.
• Did anybody run the algorithm for other g e.g. 0.4 or 0.6 values; if
yes, did it converge to the same values?
• Speed of convergence seems to depend on the value of g.
TD TD inverse R
XYZ-World: Discussion Problem 12
e e (0.57, -0.65)
1 2 3 R=+5
n s s
(2.98, -2.99)
4 5 R=+3 w 6 R=-9
sw ne
n s s
x/0.3 (-0.50, 0.47) x/0.7
7 8 R=+4 9 R=-6
Other observations: nw s
• The Bellman update did not converge for g=1 (-0.18, -0.12)
• The Bellman update converged very fast for g=0.2
• Did anybody try other values for g (e.g. 0.6)?
10
• The Bellman update suggest a utility value for 3.6 for state 5; what
does this tell us about the optimal policy? E.g. is 1-2-5-7-4-1
optimal?
• TD reversed utility values quite neatly when reward were inversed;
x become –x+u with u[-0.08,0.08].
• P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.
XYZ-World --- Other
Considerations
• R(s) might be known in advance or has to be
learnt.
• R(s) might be probabilistic or not
• R(s) might change over time --- agent has to
adapt.
• Results of actions might be known in advance or
have to be learnt; results of actions can be fixed,
or may change over time.
• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary
(1,1) U = 0.72
(2,1) U = 0.68
…
Learn to map states to utilities.
Bellman Equation
U(1,3) = 0.84
(1,3) (2,3)
U(2,3) = 0.92
• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary
Q(a,s) Q(a,s) +
α [ R(s) + γ*maxa’Q(a’,s’) - Q(a,s) ]
• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary
vs
• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary
• Introduction
• Passive Reinforcement Learning
• Temporal Difference Learning
• Active Reinforcement Learning
• Applications
• Summary