0% found this document useful (0 votes)
1 views88 pages

06 TD Methods

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views88 pages

06 TD Methods

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Temporal Difference Methods

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Sept 24, 30, Oct 01, 07, 2021


Agenda Introduction TD Evaluation TD Control

Agenda

§ Understand incremental computation of Monte Carlo methods


§ From incremental Monte Carlo methods, the journey will take us to
different Temporal Difference (TD) based methods.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 2 / 43
Agenda Introduction TD Evaluation TD Control

Resources

§ Reinforcement Learning by Udacity [Link]


§ Reinforcement Learning by Balaraman Ravindran [Link]
§ Reinforcement Learning by David Silver [Link]
§ SB: Chapter 6

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 3 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Model Based


§ Like the previous approaches, here also we are going to first look at
the evaluation problems using TD methods and then later, we will do
TD control.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Model Based


§ Like the previous approaches, here also we are going to first look at
the evaluation problems using TD methods and then later, we will do
TD control.
§ Let us take a MRP. Why MRP?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Model Based


§ Like the previous approaches, here also we are going to first look at
the evaluation problems using TD methods and then later, we will do
TD control.
§ Let us take a MRP. Why MRP?
0.9
𝑆" +1 𝑆% +1
+0
𝑆$ 𝑆'

𝑆# +2 𝑆& +10
0.1

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Model Based


§ Like the previous approaches, here also we are going to first look at
the evaluation problems using TD methods and then later, we will do
TD control.
§ Let us take a MRP. Why MRP?
0.9
𝑆" +1 𝑆% +1
+0
𝑆$ 𝑆'

𝑆# +2 𝑆& +10
0.1
§ Find V (S3 ), given γ = 1

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Model Based


§ Like the previous approaches, here also we are going to first look at
the evaluation problems using TD methods and then later, we will do
TD control.
§ Let us take a MRP. Why MRP?
0.9
𝑆" +1 𝑆% +1
+0
𝑆$ 𝑆'

𝑆# +2 𝑆& +10
0.1
§ Find V (S3 ), given γ = 1
§ V (SF ) = 0

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Model Based


§ Like the previous approaches, here also we are going to first look at
the evaluation problems using TD methods and then later, we will do
TD control.
§ Let us take a MRP. Why MRP?
0.9
𝑆" +1 𝑆% +1
+0
𝑆$ 𝑆'

𝑆# +2 𝑆& +10
0.1
§ Find V (S3 ), given γ = 1
§ V (SF ) = 0
§ Then V (S4 ) = 1 + 1 × 0 = 1, V (S5 ) = 10 + 1 × 0 = 10

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Model Based


§ Like the previous approaches, here also we are going to first look at
the evaluation problems using TD methods and then later, we will do
TD control.
§ Let us take a MRP. Why MRP?
0.9
𝑆" +1 𝑆% +1
+0
𝑆$ 𝑆'

𝑆# +2 𝑆& +10
0.1
§ Find V (S3 ), given γ = 1
§ V (SF ) = 0
§ Then V (S4 ) = 1 + 1 × 0 = 1, V (S5 ) = 10 + 1 × 0 = 10
§ Then V (S3 ) = 0 + 1 × (0.9 × 1 + 0.1 × 10) = 1.9

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Monte Carlo

§ Now let us think about how to get the values from ‘experience’
without knowing the model.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 5 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Monte Carlo

§ Now let us think about how to get the values from ‘experience’
without knowing the model.
§ Let’s say we have the following samples/episodes.
+1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1 +0 +1
𝑆$ 𝑆' 𝑆" 𝑆$ 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 5 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Monte Carlo

§ Now let us think about how to get the values from ‘experience’
without knowing the model.
§ Let’s say we have the following samples/episodes.
+1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1 +0 +1
𝑆$ 𝑆' 𝑆" 𝑆$ 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'

§ What is the estimated value of V (S1 ) - after 3 epiodes? after 4


episodes?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 5 / 43
Agenda Introduction TD Evaluation TD Control

MRP Evaluation - Monte Carlo

§ Now let us think about how to get the values from ‘experience’
without knowing the model.
§ Let’s say we have the following samples/episodes.
+1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1 +0 +1
𝑆$ 𝑆' 𝑆" 𝑆$ 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'

§ What is the estimated value of V (S1 ) - after 3 epiodes? after 4


episodes?
(1+0+1)+(1+0+10)+(1+0+1)
§ After 3 episodes: 3 = 5.0
(1+0+1)+(1+0+10)+(1+0+1)+(1+0+1)
§ After 4 episodes: 4 = 4.25

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 5 / 43
Agenda Introduction TD Evaluation TD Control

Incremental Monte Carlo


§ Next we are going to see how we can ‘incrementally’ compute an
estimate for the value of a state given the previous estimate, i.e.,
given the estimate after 3 episodes, how do we get that after 4
episodes and so on.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 6 / 43
Agenda Introduction TD Evaluation TD Control

Incremental Monte Carlo


§ Next we are going to see how we can ‘incrementally’ compute an
estimate for the value of a state given the previous estimate, i.e.,
given the estimate after 3 episodes, how do we get that after 4
episodes and so on.
§ Let VT −1 (S1 ) is the estimate of the value function at state S1 after
(T − 1)th episode.
§ Let the return (or total discounted reward) of the T th episode be
GT (S1 )
§ Then,

VT −1 (S1 ) ∗ (T − 1) + GT (S1 )
VT (S1 ) =
T
T −1 1
= VT −1 (S1 ) + GT (S1 )
T T
1
= VT −1 (S1 ) + αT (GT (S1 ) − VT −1 (S1 )) , αT =
T
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 6 / 43
Agenda Introduction TD Evaluation TD Control

Incremental Monte Carlo

1
VT (S1 ) = VT −1 (S1 ) + αT (GT (S1 ) − VT −1 (S1 )) , αT =
T

§ Think of T as time i.e., you are drawing sampling trajectories and


getting the (T − 1)th episode at time (T − 1), T th episode at time T
and so on.
§ Then we are looking at a ‘Temporal difference’. The ‘update’ to the
value of S1 is going to be equal to the difference between the return
(GT (S1 )) at step T and the estimate (VT −1 (S1 )) at the previous time
step T − 1
§ As we get more and more episodes, the learning rate αT , gets smaller
and smaller. So we make smaller and smaller changes.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 7 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate

§ This learning falls under a general learning rule where the value at
time T = the value at time T − 1 + some learning rate*(difference
between what you get and what you expected it to be)

VT (S1 ) = VT −1 (S1 ) + αT (GT (S1 ) − VT −1 (S1 ))

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 8 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate

§ This learning falls under a general learning rule where the value at
time T = the value at time T − 1 + some learning rate*(difference
between what you get and what you expected it to be)

VT (S1 ) = VT −1 (S1 ) + αT (GT (S1 ) − VT −1 (S1 ))

§ In limit, the estimate is going to converge to the true value, i.e.,


lim (S) = V (S), given two conditions that the learning rate
T →∞
sequence has to obey.
P
I. αT = ∞
T
αT2 < ∞
P
II.
T

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 8 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate



§ Let us see what 1
P
T is.
T =1

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate



§ Let us see what 1
P
T is.
T =1
§ It is 1 + 1
2 + 1 1
3 + 4 + · · · What is it known as?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate



§ Let us see what 1
P
T is.
T =1
§ It is 1 + 1
2 + 1 1
3 + 4 + · · · What is it known as? Harmonic series.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate



§ Let us see what 1
P
T is.
T =1
§ It is 1 + 1
2 + 1 1
3 + 4 + · · · What is it known as? Harmonic series.
§ Does it converge?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate



§ Let us see what 1
P
T is.
T =1
§ It is 1 + 1
2 + 1 1
3 + 4 + · · · What is it known as? Harmonic series.
§ Does it converge? No.
1 1 1 1 1 1 1 1
1++ + + + + + + + ···
2 3 4 5 6 7 8 9
1 1 1 1 1 1 1 1
>1 + + + + + + + + + · · ·
2 |4 {z 4} |8 8 {z 8 8} 16
1 1
2 2

1 1 1
=1 + + + + ··· = ∞
2 2 2

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate


§ A generalization of the harmonic series is the p-series (or

P 1
hyperharmonic series), defined as np , for any +ve real number p.
n=1
§ p-series converges for all p > 1 (in which case, it is called the
over-harmonic series) and diverges for all p ≤ 1.
§ So, according to these rules, lets see if the following αT ’s result in a
converging algorithm.
P P 2
αT αT αT Algo Converges
1
T2
1
T
1
2
T3
1
1
T2

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate


§ A generalization of the harmonic series is the p-series (or

P 1
hyperharmonic series), defined as np , for any +ve real number p.
n=1
§ p-series converges for all p > 1 (in which case, it is called the
over-harmonic series) and diverges for all p ≤ 1.
§ So, according to these rules, lets see if the following αT ’s result in a
converging algorithm.
P P 2
αT αT αT Algo Converges
1
T2
<∞ <∞ No
1
T
1
2
T3
1
1
T2

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate


§ A generalization of the harmonic series is the p-series (or

P 1
hyperharmonic series), defined as np , for any +ve real number p.
n=1
§ p-series converges for all p > 1 (in which case, it is called the
over-harmonic series) and diverges for all p ≤ 1.
§ So, according to these rules, lets see if the following αT ’s result in a
converging algorithm.
P P 2
αT αT αT Algo Converges
1
T2
<∞ <∞ No
1
T
∞ <∞ Yes
1
2
T3
1
1
T2

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate


§ A generalization of the harmonic series is the p-series (or

P 1
hyperharmonic series), defined as np , for any +ve real number p.
n=1
§ p-series converges for all p > 1 (in which case, it is called the
over-harmonic series) and diverges for all p ≤ 1.
§ So, according to these rules, lets see if the following αT ’s result in a
converging algorithm.
P P 2
αT αT αT Algo Converges
1
T2
<∞ <∞ No
1
T
∞ <∞ Yes
1
2
T3 ∞ <∞ Yes
1
1
T2

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control

Properties of Learning Rate


§ A generalization of the harmonic series is the p-series (or

P 1
hyperharmonic series), defined as np , for any +ve real number p.
n=1
§ p-series converges for all p > 1 (in which case, it is called the
over-harmonic series) and diverges for all p ≤ 1.
§ So, according to these rules, lets see if the following αT ’s result in a
converging algorithm.
P P 2
αT αT αT Algo Converges
1
T2
<∞ <∞ No
1
T
∞ <∞ Yes
1
2
T3 ∞ <∞ Yes
1
1
T2 ∞ ∞ No
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control

TD(1)
Algorithm 1: TD(1)
1 initialization: Episode No. T ← 1;
2 repeat
3 foreach s ∈ S do
4 initialize e(s) = 0 // e(s) is called ‘eligibility’ of state s.
5 VT (s) = V(T −1) (s)// same as the previous episode.
6 t ← 1;
7 repeat Rt
8 After state transition, st−1 −−→ st
9 e(st−1 ) = e(st−1 ) + 1// updating state eligibility.
10 foreach s ∈ S do
11 VT (s) ← VT −1 (s) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 )) e(s);
12 e(s) = γe(s)
13 t←t+1
14 until this episode terminates;
15 T ←T +1
16 until all episodes are done;
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 11 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example
§ Let us try to walk through the pseudocode with the help of a very
little example.
𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 0 0 0

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 12 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example
§ Let us try to walk through the pseudocode with the help of a very
little example.
𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 0 0 0

§ Now as a result of transition from s0 to s1 the eligibilities change as,


𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 1 0 0

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 12 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example
§ Let us try to walk through the pseudocode with the help of a very
little example.
𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 0 0 0

§ Now as a result of transition from s0 to s1 the eligibilities change as,


𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 1 0 0
§ Now, we are going to loop through all the  states and apply the TD
update R1 + γV(T −1) (s1 ) − V(T −1) (s0 ) proportional to the
eligibility and the learning rate of all the states.

I VT (s0 ) = αT R1 + γV(T −1) (s1 ) − V(T −1) (s0 )
I VT (s1 ) = 0
I VT (s2 ) = 0

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 12 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example

§ Now transition from s1 to s2 happens and the eligibilities become


𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 𝛾 1 0
§ The temporal difference is R2 + γV(T −1) (s2 ) − V(T −1) (s1 )
 
 
−1) (s1 ) − V(T
I VT (s0 ) = αT R1 + 
γV(T
 −1) (s0 ) +

γαT R2 + γV(T −1) (s2 ) − 
V(T−1)
(s) =
1 
2
αT R1 + γR2 + γ V(T −1) (s2 ) − V(T −1) (s0 )

I VT (s1 ) = αT R2 + γV(T −1) (s2 ) − V(T −1) (s1 )
I VT (s2 ) = 0

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 13 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example

§ Now transition from s2 to sF happens and the eligibilities become


𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 𝛾# 𝛾 1
§ The temporal difference is R3 + γV(T −1) (sF ) − V(T −1) (s2 )
 
 

I VT (s0 ) = αT R1 + γR2 +  γ 2V(T−1)
(s 2 ) − V(T −1) (s0 ) +

αT γ 2 R3 + γV(T −1) (sF ) − 


V(T−1) ) =
(s 2 
αT R1 + γR2 + γ 2 R3 + γ 3 V(T −1) (sF ) − V(T −1) (s0 )
 
−1) (s2 ) − V(T−1) (s1 ) +
I VT (s1 ) = αT R2 + γV(T

αT γ R3 + γV(T −1) (sF ) − V(T−1)
(s) =
2 
2
αT R2 + γR3 + γ V(T −1) (sF ) − V(T −1) (s1 )

I VT (s2 ) = αT R3 + γV(T −1) (sF ) − V(T −1) (s2 )
I So, some pattern is emerging!!

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 14 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example

§ Let us try to apply TD(1) to our starting MRP.


+1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1 +0 +1
𝑆$ 𝑆' 𝑆" 𝑆$ 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1
𝑆# +2 𝑆$ +0 𝑆& +10 𝑆'

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 15 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example

§ Let us try to apply TD(1) to our starting MRP.


+1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1 +0 +1
𝑆$ 𝑆' 𝑆" 𝑆$ 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1
𝑆# +2 𝑆$ +0 𝑆& +10 𝑆'

§ s2 is seen only once.


 So, V (s2 ) will be computed for this episode
*0

2 3 : 0 
only. V (s2 ) = αt 2 + γ ∗ 0 + γ ∗ 10 + γ ∗ V (sF ) − 

 
V (s2 ) =
1 ∗ 12 = 12
§ γ is taken to be 1 for easy computation.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 15 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example
§ What is the maximum likelihood estimate? +1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1
𝑆$ 𝑆' 𝑆" +1 𝑆$ +0 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'

§ Estimated state transition probabilities:


3
I s3 → s4 : 5 = 0.6
2
I s3 → s5 : 5 = 0.4

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 16 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example
§ What is the maximum likelihood estimate? +1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1
𝑆$ 𝑆' 𝑆" +1 𝑆$ +0 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'

§ Estimated state transition probabilities:


3
I s3 → s4 : 5 = 0.6
2
I s3 → s5 : 5 = 0.4
§ So,
I V (SF ) = 0
I Then V (S4 ) = 1 + 1 × 0 = 1, V (S5 ) = 10 + 1 × 0 = 10
I Then V (S3 ) = 0 + 1 × (0.6 × 1 + 0.4 × 10) = 4.6
I and V (S2 ) = 2 + 1 × 4.6 = 6.6

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 16 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Example
§ What is the maximum likelihood estimate? +1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1
𝑆$ 𝑆' 𝑆" +1 𝑆$ +0 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'

§ Estimated state transition probabilities:


3
I s3 → s4 : 5 = 0.6
2
I s3 → s5 : 5 = 0.4
§ So,
I V (SF ) = 0
I Then V (S4 ) = 1 + 1 × 0 = 1, V (S5 ) = 10 + 1 × 0 = 10
I Then V (S3 ) = 0 + 1 × (0.6 × 1 + 0.4 × 10) = 4.6
I and V (S2 ) = 2 + 1 × 4.6 = 6.6
§ The true value of state s2 , we found when the true transition
probabilities are known, is 3.9
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 16 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) Analysis

§ One reason why TD(1) estimate is far off is because - we only used
one of the five trajectories to propagate information. But, the
maximum likelihood estimate used information from all 5 trajectories.
§ So, TD(1) suffers when a rare event occurs in a run (s3 → s5 → sF ).
Then the estimate can be far off.
§ We will try to shore up some of these issues next

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 17 / 43
Agenda Introduction TD Evaluation TD Control

TD(0)
§ Let us look at the TD(1) update rule more carefully.
VT (s) ← VT −1 (s) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 )) e(s)

§ Let us change only a few terms in the above rule.


VT (st−1 ) ← VT −1 (st−1 ) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 ))

§ What would we expect this outcome to be on average?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 18 / 43
Agenda Introduction TD Evaluation TD Control

TD(0)
§ Let us look at the TD(1) update rule more carefully.
VT (s) ← VT −1 (s) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 )) e(s)

§ Let us change only a few terms in the above rule.


VT (st−1 ) ← VT −1 (st−1 ) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 ))

§ What would we expect this outcome to be on average?


§ The random thing here is the state st . We are in some state st−1 and
we make a transition, we don’t really know where we are going to end
up. There is some probability involved in that.
§ So, ignoring αT for the time being, the expected value of the above
modified rule is Est [Rt + γVT (st )], which is basically averaging after
sampling different possible st values.
§ This is what maximum likelihood is also doing.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 18 / 43
Agenda Introduction TD Evaluation TD Control

TD(1) and TD(0)


Algorithm 2: TD(1) Algorithm 3: TD(0)
17 initialization: Episode No. T ← 1;
18 repeat 32 initialization: Episode No. T ← 1;
19 foreach s ∈ S do 33 repeat
20 initialize e(s) = 0; 34 foreach s ∈ S do
21 VT (s) = V(T −1) (s) 35 VT (s) = V(T −1) (s)
22 t ← 1; 36 t ← 1;
23 repeat 37 repeat
24 After state transition, Rt
Rt
38 After st−1 −−→ st
st−1 −−→ st 39 for s = st−1 do
25 e(st−1 ) = e(st−1 ) + 1 40 VT (s) ←
foreach s ∈ S do VT −1 (s) + αT (Rt +
26 VT (s) ← VT −1 (s) + γVT −1 (st ) − VT −1 (st−1 ))
αT (Rt + γVT −1 (st ) −
41 t←t+1
VT −1 (st−1 ))e(s);
42 until this episode terminates;
27 e(s) = γe(s)
28 t←t+1 43 T ←T +1
29 until this episode terminates; 44 until all episodes are done;
30 T ←T +1
31 until all episodes are done;
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 19 / 43
Agenda Introduction TD Evaluation TD Control

TD(λ)
Algorithm 4: TD(λ)
45 initialization: Episode No. T ← 1;
46 repeat
47 foreach s ∈ S do
48 initialize e(s) = 0;
49 VT (s) = V(T −1) (s)
50 t ← 1;
51 repeat
t R
52 After st−1 −−→ st
53 e(st−1 ) = e(st−1 ) + 1;
54 foreach s ∈ S do
55 VT (s) ← VT −1 (s) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 ))e(s);
56 e(s) = λγe(s)
57 t←t+1
58 until this episode terminates;
59 T ←T +1
60 until all episodes are done;
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 20 / 43
Agenda Introduction TD Evaluation TD Control

K-Step Estimators
§ For some convenience in later analysis, let us change the time index
by adding 1 everywhere. Thus, the TD(0) update rule becomes,
V (st ) ← V (st ) + αT (Rt+1 + γV (st+1 ) − V (st ))

§ The interpretation remains the same i.e., estimating the value of a


state (st ) that we are just leaving by moving a little bit (αT ) in the
direction of the immediate reward (Rt+1 ) plus the discounted
estimated value of the state (V (st+1 )) that we just landed in and
subtract the value of the state (V (st )) we just left.
§ This basically means a one step look ahead or one step estimator.
Lets call it E1 .
§ Similarly a two-step estimator (E2 ) is,
V (st ) ← V (st ) + αT Rt+1 + γRt+2 + γ 2 V (st+2 ) − V (st )


Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 21 / 43
Agenda Introduction TD Evaluation TD Control

K-Step Estimators

E1 :V (st ) ← V (st ) + αT (Rt+1 + γV (st+1 ) − V (st ))


E2 :V (st ) ← V (st ) + αT Rt+1 + γRt+2 + γ 2 V (st+2 ) − V (st )


E3 :V (st ) ← V (st ) + αT Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 V (st+3 ) − V (st )




..
.
Ek :V (st ) ← V (st ) + αT Rt+1 + · · · + γ k−1 Rt+k + γ k V (st+k ) − V (st )


E∞ :V (st ) ← V (st ) + αT Rt+1 + · · · + γ k−1 Rt+k + · · · − V (st )




§ E1 : is basically TD(0) and E∞ : is TD(1)


§ Next we will relate these estimators to TD(λ) which will be a
weighted combination of all these infinite estimators.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 22 / 43
Agenda Introduction TD Evaluation TD Control

K-Step Estimators and TD(λ)


λ λ=0 λ=1
E1 1−λ 1 0
E2 λ(1 − λ) 0 0
E3 λ2 (1 − λ) 0 0
Ek λk−1 (1 − λ) 0 0
E∞ λ∞ 0 1
§ The idea is when we are updating the value of a state V (s), using any
of the TD(λ) methods, all the estimators give their preferences to
what the value update should be.
§ Checking that the sum of weights is 1.

X ∞
X
λk−1 (1 − λ) = (1 − λ) λk−1
k=1 k=1
1
= (1 − λ) =1
(1 − λ)
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 23 / 43
Agenda Introduction TD Evaluation TD Control

Good Value of λ

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 24 / 43
Agenda Introduction TD Evaluation TD Control

Unified View: Temporal-Difference Backup


V (st ) ← V (st ) + αT (Rt+1 + γV (st+1 ) − V (st ))

T T T

I
I \
I
I
T
I \
Q T
I \
T
9 I
I \
\
T
I
I
\ I
'
\
I I \ I I \ ' I
Figure credit: David Silver, DeepMind

§ Use of ‘sample backups’ and ‘bootstrapping’.


Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 25 / 43
Agenda Introduction TD Evaluation TD Control

Unified View: Dynamic Programing Backup


( )
. X X
vπ = v (k+1) (s) ← π(a|s) r(s, a) + γ p(s0 |s, a)v (k) (s0 )
a∈A s0 ∈S

Figure credit: David Silver, DeepMind

§ Use of ‘full backups’ and ‘bootstrapping’.


Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 26 / 43
Agenda Introduction TD Evaluation TD Control

Unified View: Monte-Carlo Backup


V (st ) ← V (st ) + αT (Gt − V (st ))

Figure credit: David Silver, DeepMind

§ Use of ‘sample backups’ and no ‘bootstrapping’.


Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 27 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ We will now, see how TD estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 28 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ We will now, see how TD estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

§ Policy evaluation is done as TD evaluation


§ Then, we can do greedy policy improvement.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 28 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ We will now, see how TD estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

§ Policy evaluation is done as TD evaluation


§ Then, we can do greedy policy improvement.
§ What is the problem!! Remember the MC Lectures!!

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 28 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ We will now, see how TD estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

§ Policy evaluation is done as TD evaluation


§ Then, we can do greedy policy improvement.
§ What is the problem!! Remember the MC Lectures!!
 
.
§ π (s) = arg max r(s, a) + γ
0 0 0
P
p(s |s, a)vπ (s )
a∈A s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 28 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ Greedy policy improvement
 over v(s) requires model  of MDP
.
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
P
a∈A s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 29 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ Greedy policy improvement
 over v(s) requires model  of MDP
.
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
P
a∈A s0 ∈S
§ Greedy policy improvement over Q(s, a) is model-free
.
π 0 (s) = arg max Q(s, a)
a∈A

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 29 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ Greedy policy improvement
 over v(s) requires model  of MDP
.
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
P
a∈A s0 ∈S
§ Greedy policy improvement over Q(s, a) is model-free
.
π 0 (s) = arg max Q(s, a)
a∈A
§ How can we do TD policy evaluation for Q(s, a)?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 29 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ Greedy policy improvement
 over v(s) requires model  of MDP
.
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
P
a∈A s0 ∈S
§ Greedy policy improvement over Q(s, a) is model-free
.
π 0 (s) = arg max Q(s, a)
a∈A
§ How can we do TD policy evaluation for Q(s, a)?
§ The TD(0) update rule for V (s) is,

VT (st ) ← VT −1 (st ) + αT (Rt+1 + γVT −1 (st+1 ) − VT −1 (st ))

§ The TD(0) update rule for Q(s, a) is also similar,

QT (st , at ) ← QT −1 (st , at )+
αT (Rt+1 + γQT −1 (st+1 , at+1 ) − QT −1 (st , at ))

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 29 / 43
Agenda Introduction TD Evaluation TD Control

TD Control

§ Let us spend some time on the update equation.

QT (st , at ) ← QT −1 (st , at )+
αT (Rt+1 + γQT −1 (st+1 , at+1 ) − QT −1 (st , at ))

§ what we really want in place of the red term is VT −1 (st+1 ).


§ So, why using QT −1 (st+1 , at+1 ) in place of VT −1 (st+1 ) is fine?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 30 / 43
Agenda Introduction TD Evaluation TD Control

TD Control

§ Let us spend some time on the update equation.

QT (st , at ) ← QT −1 (st , at )+
αT (Rt+1 + γQT −1 (st+1 , at+1 ) − QT −1 (st , at ))

§ what we really want in place of the red term is VT −1 (st+1 ).


§ So, why using QT −1 (st+1 , at+1 ) in place of VT −1 (st+1 ) is fine?
§ Remember V (s) = Ea [Q(s, a)] =
P
π(a/s)Q(s, a).
a∈A
§ So instead of taking the expectation we are replacing it with one
sample. So, if we take enough samples, this will eventually converge
to V (s).

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 30 / 43
Agenda Introduction TD Evaluation TD Control

TD Control

§ Let us spend some time on the update equation.

QT (st , at ) ← QT −1 (st , at )+
αT (Rt+1 + γQT −1 (st+1 , at+1 ) − QT −1 (st , at ))

§ what we really want in place of the red term is VT −1 (st+1 ).


§ So, why using QT −1 (st+1 , at+1 ) in place of VT −1 (st+1 ) is fine?
§ Remember V (s) = Ea [Q(s, a)] =
P
π(a/s)Q(s, a).
a∈A
§ So instead of taking the expectation we are replacing it with one
sample. So, if we take enough samples, this will eventually converge
to V (s).
§ But think carefully again - Could we not have taken the expectation
also?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 30 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ Like MC Control algorithms, we would use -soft policies like -greedy
policies for exploration here.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 31 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ Like MC Control algorithms, we would use -soft policies like -greedy
policies for exploration here.
Algorithm 6: On-policy TD Control
73 Parameters: Learning rate α ∈ (0, 1], small  > 0 ;
74 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
75 repeat
76 t ← 0, Choose st i.e., s0 ;
77 Pick at according to Q(st , .) (e.g., -greedy);
78 repeat
79 Apply action at from st , observe Rt+1 and st+1 ;
80 Pick at+1 according to Q(st+1 , .) (e.g., -greedy);
81 Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ));
82 t←t+1
83 until this episode terminates;
84 until all episodes are done;

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 31 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ Like MC Control algorithms, we would use -soft policies like -greedy
policies for exploration here.
Algorithm 7: On-policy TD Control
85 Parameters: Learning rate α ∈ (0, 1], small  > 0 ;
86 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
87 repeat
88 t ← 0, Choose st i.e., s0 ;
89 Pick at according to Q(st , .) (e.g., -greedy);
90 repeat
91 Apply action at from st , observe Rt+1 and st+1 ;
92 Pick at+1 according to Q(st+1 , .) (e.g., -greedy);
93 Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ));
94 t←t+1
95 until this episode terminates;
96 until all episodes are done;

§ Any guess for the name of this algorithm?


Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 31 / 43
Agenda Introduction TD Evaluation TD Control

SARSA Example
§ The windy-gridworld example is taken from SB [Chapter 6].
§ Standard gridworld with start and end states, but upward wind
through the middle of the grid. The strength of the wind is given
below each column.
§ Actions are standard four - left, right, up, down. Undiscounted
episodic task, with constant rewards of −1 until the goal state is
reached.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 32 / 43
Agenda Introduction TD Evaluation TD Control

SARSA Variants
§ Coming back to the question of taking expectation over Q values.
This gives what is called an expected SARSA.
Q(st , at ) ← Q(st , at )+
!
X
α Rt+1 + γ π(a/st+1 )Q(st+1 , a) − Q(st , at )
a∈A

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 33 / 43
Agenda Introduction TD Evaluation TD Control

SARSA Variants
§ Coming back to the question of taking expectation over Q values.
This gives what is called an expected SARSA.
Q(st , at ) ← Q(st , at )+
!
X
α Rt+1 + γ π(a/st+1 )Q(st+1 , a) − Q(st , at )
a∈A

§ Also can we think of sample backups but no bootstraping? - This will


be more like MC control. The TD error term is,
Rt+1 + γRt+2 + γ 2 Rt+3 + · · · + γ k−1 Rt+k + · · · − Q(st , at )

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 33 / 43
Agenda Introduction TD Evaluation TD Control

SARSA Variants
§ Coming back to the question of taking expectation over Q values.
This gives what is called an expected SARSA.
Q(st , at ) ← Q(st , at )+
!
X
α Rt+1 + γ π(a/st+1 )Q(st+1 , a) − Q(st , at )
a∈A

§ Also can we think of sample backups but no bootstraping? - This will


be more like MC control. The TD error term is,
Rt+1 + γRt+2 + γ 2 Rt+3 + · · · + γ k−1 Rt+k + · · · − Q(st , at )

§ Can we also in the same way, think of a spectrum of algorithms like


those in between TD(0) and TD(1) a.k.a MC?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 33 / 43
Agenda Introduction TD Evaluation TD Control

k-step SARSA
§ Let us define k-step Q-return as,
(k)
Qt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · + γ k−1 Rt+k + γ k Q(st+k , at+k )

§ Consider the following k-step returns for k = 1, 2, · · · , ∞


(1)
k = 1 :Qt = Rt+1 + γQ(st+1 , at+1 )(SARSA)
(2)
k = 2 :Qt = Rt+1 + γRt+2 + γ 2 Q(st+2 , at+2 )
(3)
k = 3 :Qt = Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Q(st+3 , at+3 )
..
.
(k)
k = k :Qt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · + γ k−1 Rt+k +
γ k Q(st+k , at+k )
(∞)
k = ∞ :Qt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · + γ k−1 Rt+k + · · ·

§ k-step SARSA updates Q(s, a) towards the k-step Q-return


 
(k)
Q(st , at ) ← Q(st , at ) + α Qt − Q(st , at )
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 34 / 43
Agenda Introduction TD Evaluation TD Control

SARSA(λ)

§ The Qλ return combines all k-step


(k)
Q-returns Qt .
§ Using weight (1 − λ)λk−1

(k)
X
Qλt = (1 − λ) λk−1 Qt
k=1

§ The update equation for SARSA(λ)


is,
λ

Q(st , at ) ← Q(st , at )+α Qt − Q(st , at )
Figure credit: David Silver, DeepMind

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 35 / 43
Agenda Introduction TD Evaluation TD Control

SARSA(λ)

§ Just like TD(λ) evaluation, SARSA(λ) control uses the concept of


‘eligibility of states’ in the implementation.
§ In TD(λ) evaluation, we had eligibility traces for each state, for
SARSA(λ) control we will have eligibility traces for each state-action
pair.
§ Lets say we get a reward at the end of some step. What eligibility
trace says is that the credit for the reward should trickle down in
proportion to all the way to the first state. The credit should be more
for the state-action pairs which were close to the rewarding step and
also for those state-action pairs which were visited frequently along
the way.
§ Q(s, a) is updated for every state and action in proportion to the
TD-error and eligibility of the state-action pair.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 36 / 43
Agenda Introduction TD Evaluation TD Control

SARSA(λ) Algorithm

Figure credit: David Silver, DeepMind

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 37 / 43
Agenda Introduction TD Evaluation TD Control

SARSA(λ) Gridworld Example

Figure credit: David Silver, DeepMind

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 38 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ The SARSA update rule is  

Q(st , at ) ← Q(st , at ) + α Rt+1 + γQ(st+1 , at+1 ) −Q(st , at )


 
| {z }
TD Target

§ The TD target gives a one-step estimate of Q function. Optimal Q


function gives the long-term expected reward for taking action at at
state st and then behaving optimally thereafter.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 39 / 43
Agenda Introduction TD Evaluation TD Control

TD Control
§ The SARSA update rule is  

Q(st , at ) ← Q(st , at ) + α Rt+1 + γQ(st+1 , at+1 ) −Q(st , at )


 
| {z }
TD Target

§ The TD target gives a one-step estimate of Q function. Optimal Q


function gives the long-term expected reward for taking action at at
state st and then behaving optimally thereafter.
§ Going back to the MDP slides
𝑞∗(𝑠, 𝑎) 𝑠
𝑟
𝑞∗(𝑠, 𝑎) 𝑠
𝑟 𝑠′

𝑣∗(𝑠′) 𝑣∗(𝑠′′) s
𝑠′ 𝑠′′
X 𝑞∗(𝑠′, 𝑎′) 𝑎′ X
q∗ (s, a) = r(s, a)+γ p(s0 |s, a)v∗ (s0 ) q∗ (s, a) = r(s, a) + γ p(s0 |s, a)
s0 ∈S s0 ∈S
max q∗ (s0 , a0 )
a0 ∈A
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 39 / 43
Agenda Introduction TD Evaluation TD Control

Revisiting Bellman equations

§ SARSA:
( )
X X
0 0 0 0 0
qπ (s, a) = r(s, a) + γ p(s |s, a) π(a |s )qπ (s , a )
s0 ∈S a0 ∈A
Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ))

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 40 / 43
Agenda Introduction TD Evaluation TD Control

Revisiting Bellman equations

§ SARSA:
( )
X X
0 0 0 0 0
qπ (s, a) = r(s, a) + γ p(s |s, a) π(a |s )qπ (s , a )
s0 ∈S a0 ∈A
Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ))

§ Q-learning:
X
q∗ (s, a) = r(s, a) + γ p(s0 |s, a) max
0 ∈A
q∗ (s0 , a0 )
a
s0 ∈S 
0
Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max 0
Q(st+1 , a ) − Q(st , at )
a

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 40 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning
Algorithm 8: Off-policy TD Control
97 Parameters: Learning rate α ∈ (0, 1], small  > 0 ;
98 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
99 repeat
100 t ← 0, Choose st i.e., s0 ;
101 repeat
102 Pick at according to Q(st , .) (e.g., -greedy);
103 Apply action at from st , observe
 Rt+1 and st+1 ; 
104 Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max
0
Q(st+1 , a0 ) − Q(st , at ) ;
a
105 t←t+1
106 until this episode terminates;
107 until all episodes are done;

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 41 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning
Algorithm 9: Off-policy TD Control
108 Parameters: Learning rate α ∈ (0, 1], small  > 0 ;
109 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
110 repeat
111 t ← 0, Choose st i.e., s0 ;
112 repeat
113 Pick at according to Q(st , .) (e.g., -greedy);
114 Apply action at from st , observe
 Rt+1 and st+1 ; 
115 Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max
0
Q(st+1 , a0 ) − Q(st , at ) ;
a
116 t←t+1
117 until this episode terminates;
118 until all episodes are done;
§ Note the differences with SARSA. Why is it off-policy?

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 41 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning
Algorithm 10: Off-policy TD Control
119 Parameters: Learning rate α ∈ (0, 1], small  > 0 ;
120 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
121 repeat
122 t ← 0, Choose st i.e., s0 ;
123 repeat
124 Pick at according to Q(st , .) (e.g., -greedy);
125 Apply action at from st , observe
 Rt+1 and st+1 ; 
126 Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max
0
Q(st+1 , a0 ) − Q(st , at ) ;
a
127 t←t+1
128 until this episode terminates;
129 until all episodes are done;
§ Note the differences with SARSA. Why is it off-policy?
§ Next action is picked after the update here. In SARSA the next
action was picked before the update.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 41 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage??

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage?? – Asynchronous update.
§ Disadvantage of arbitrarily choosing states for update??

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage?? – Asynchronous update.
§ Disadvantage of arbitrarily choosing states for update?? – Like we
saw in RTDP, making updates along trajectory makes sure the
state-action pairs that are visited frequently i.e., state-action pairs
that are important gets to the optimal values quickly.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage?? – Asynchronous update.
§ Disadvantage of arbitrarily choosing states for update?? – Like we
saw in RTDP, making updates along trajectory makes sure the
state-action pairs that are visited frequently i.e., state-action pairs
that are important gets to the optimal values quickly.
§ Q-learning generally learns faster than SARSA. This may be due to
the fact that Q-learning updates only when it finds a better move. In
contrast, SARSA uses the estimate of the next action value in its
target. The value thus, changes everytime an exploratory action is
taken.

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage?? – Asynchronous update.
§ Disadvantage of arbitrarily choosing states for update?? – Like we
saw in RTDP, making updates along trajectory makes sure the
state-action pairs that are visited frequently i.e., state-action pairs
that are important gets to the optimal values quickly.
§ Q-learning generally learns faster than SARSA. This may be due to
the fact that Q-learning updates only when it finds a better move. In
contrast, SARSA uses the estimate of the next action value in its
target. The value thus, changes everytime an exploratory action is
taken.
§ There are some undesirable situations also for Q-learning.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control

Q-learning

Figure credit: [SB-Chapter 6]

Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 43 / 43

You might also like