06 TD Methods
06 TD Methods
Abir Das
IIT Kharagpur
Agenda
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 2 / 43
Agenda Introduction TD Evaluation TD Control
Resources
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 3 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control
𝑆# +2 𝑆& +10
0.1
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control
𝑆# +2 𝑆& +10
0.1
§ Find V (S3 ), given γ = 1
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control
𝑆# +2 𝑆& +10
0.1
§ Find V (S3 ), given γ = 1
§ V (SF ) = 0
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control
𝑆# +2 𝑆& +10
0.1
§ Find V (S3 ), given γ = 1
§ V (SF ) = 0
§ Then V (S4 ) = 1 + 1 × 0 = 1, V (S5 ) = 10 + 1 × 0 = 10
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control
𝑆# +2 𝑆& +10
0.1
§ Find V (S3 ), given γ = 1
§ V (SF ) = 0
§ Then V (S4 ) = 1 + 1 × 0 = 1, V (S5 ) = 10 + 1 × 0 = 10
§ Then V (S3 ) = 0 + 1 × (0.9 × 1 + 0.1 × 10) = 1.9
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 4 / 43
Agenda Introduction TD Evaluation TD Control
§ Now let us think about how to get the values from ‘experience’
without knowing the model.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 5 / 43
Agenda Introduction TD Evaluation TD Control
§ Now let us think about how to get the values from ‘experience’
without knowing the model.
§ Let’s say we have the following samples/episodes.
+1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1 +0 +1
𝑆$ 𝑆' 𝑆" 𝑆$ 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 5 / 43
Agenda Introduction TD Evaluation TD Control
§ Now let us think about how to get the values from ‘experience’
without knowing the model.
§ Let’s say we have the following samples/episodes.
+1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1 +0 +1
𝑆$ 𝑆' 𝑆" 𝑆$ 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 5 / 43
Agenda Introduction TD Evaluation TD Control
§ Now let us think about how to get the values from ‘experience’
without knowing the model.
§ Let’s say we have the following samples/episodes.
+1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1 +0 +1
𝑆$ 𝑆' 𝑆" 𝑆$ 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 5 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 6 / 43
Agenda Introduction TD Evaluation TD Control
VT −1 (S1 ) ∗ (T − 1) + GT (S1 )
VT (S1 ) =
T
T −1 1
= VT −1 (S1 ) + GT (S1 )
T T
1
= VT −1 (S1 ) + αT (GT (S1 ) − VT −1 (S1 )) , αT =
T
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 6 / 43
Agenda Introduction TD Evaluation TD Control
1
VT (S1 ) = VT −1 (S1 ) + αT (GT (S1 ) − VT −1 (S1 )) , αT =
T
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 7 / 43
Agenda Introduction TD Evaluation TD Control
§ This learning falls under a general learning rule where the value at
time T = the value at time T − 1 + some learning rate*(difference
between what you get and what you expected it to be)
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 8 / 43
Agenda Introduction TD Evaluation TD Control
§ This learning falls under a general learning rule where the value at
time T = the value at time T − 1 + some learning rate*(difference
between what you get and what you expected it to be)
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 8 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control
1 1 1
=1 + + + + ··· = ∞
2 2 2
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 9 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 10 / 43
Agenda Introduction TD Evaluation TD Control
TD(1)
Algorithm 1: TD(1)
1 initialization: Episode No. T ← 1;
2 repeat
3 foreach s ∈ S do
4 initialize e(s) = 0 // e(s) is called ‘eligibility’ of state s.
5 VT (s) = V(T −1) (s)// same as the previous episode.
6 t ← 1;
7 repeat Rt
8 After state transition, st−1 −−→ st
9 e(st−1 ) = e(st−1 ) + 1// updating state eligibility.
10 foreach s ∈ S do
11 VT (s) ← VT −1 (s) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 )) e(s);
12 e(s) = γe(s)
13 t←t+1
14 until this episode terminates;
15 T ←T +1
16 until all episodes are done;
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 11 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
§ Let us try to walk through the pseudocode with the help of a very
little example.
𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 0 0 0
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 12 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
§ Let us try to walk through the pseudocode with the help of a very
little example.
𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 0 0 0
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 12 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
§ Let us try to walk through the pseudocode with the help of a very
little example.
𝑅! 𝑅" 𝑅#
𝑆! 𝑆" 𝑆# 𝑆$
𝑒 0 0 0
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 12 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 13 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 14 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 15 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 15 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
§ What is the maximum likelihood estimate? +1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1
𝑆$ 𝑆' 𝑆" +1 𝑆$ +0 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 16 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
§ What is the maximum likelihood estimate? +1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1
𝑆$ 𝑆' 𝑆" +1 𝑆$ +0 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 16 / 43
Agenda Introduction TD Evaluation TD Control
TD(1) Example
§ What is the maximum likelihood estimate? +1 +0 +1
𝑆" 𝑆$ 𝑆% 𝑆'
0.9
𝑆" +1 𝑆% +1 +1 +0 +10
𝑆" 𝑆$ 𝑆& 𝑆'
+0 +1
𝑆$ 𝑆' 𝑆" +1 𝑆$ +0 𝑆% 𝑆'
+1 +0 +1
𝑆# +2 𝑆& 𝑆" 𝑆$ 𝑆% 𝑆'
+10
0.1 +2 +0 +10
𝑆# 𝑆$ 𝑆& 𝑆'
TD(1) Analysis
§ One reason why TD(1) estimate is far off is because - we only used
one of the five trajectories to propagate information. But, the
maximum likelihood estimate used information from all 5 trajectories.
§ So, TD(1) suffers when a rare event occurs in a run (s3 → s5 → sF ).
Then the estimate can be far off.
§ We will try to shore up some of these issues next
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 17 / 43
Agenda Introduction TD Evaluation TD Control
TD(0)
§ Let us look at the TD(1) update rule more carefully.
VT (s) ← VT −1 (s) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 )) e(s)
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 18 / 43
Agenda Introduction TD Evaluation TD Control
TD(0)
§ Let us look at the TD(1) update rule more carefully.
VT (s) ← VT −1 (s) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 )) e(s)
TD(λ)
Algorithm 4: TD(λ)
45 initialization: Episode No. T ← 1;
46 repeat
47 foreach s ∈ S do
48 initialize e(s) = 0;
49 VT (s) = V(T −1) (s)
50 t ← 1;
51 repeat
t R
52 After st−1 −−→ st
53 e(st−1 ) = e(st−1 ) + 1;
54 foreach s ∈ S do
55 VT (s) ← VT −1 (s) + αT (Rt + γVT −1 (st ) − VT −1 (st−1 ))e(s);
56 e(s) = λγe(s)
57 t←t+1
58 until this episode terminates;
59 T ←T +1
60 until all episodes are done;
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 20 / 43
Agenda Introduction TD Evaluation TD Control
K-Step Estimators
§ For some convenience in later analysis, let us change the time index
by adding 1 everywhere. Thus, the TD(0) update rule becomes,
V (st ) ← V (st ) + αT (Rt+1 + γV (st+1 ) − V (st ))
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 21 / 43
Agenda Introduction TD Evaluation TD Control
K-Step Estimators
..
.
Ek :V (st ) ← V (st ) + αT Rt+1 + · · · + γ k−1 Rt+k + γ k V (st+k ) − V (st )
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 22 / 43
Agenda Introduction TD Evaluation TD Control
Good Value of λ
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 24 / 43
Agenda Introduction TD Evaluation TD Control
T T T
I
I \
I
I
T
I \
Q T
I \
T
9 I
I \
\
T
I
I
\ I
'
\
I I \ I I \ ' I
Figure credit: David Silver, DeepMind
TD Control
§ We will now, see how TD estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 28 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ We will now, see how TD estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 28 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ We will now, see how TD estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 28 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ We will now, see how TD estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.
TD Control
§ Greedy policy improvement
over v(s) requires model of MDP
.
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
P
a∈A s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 29 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ Greedy policy improvement
over v(s) requires model of MDP
.
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
P
a∈A s0 ∈S
§ Greedy policy improvement over Q(s, a) is model-free
.
π 0 (s) = arg max Q(s, a)
a∈A
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 29 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ Greedy policy improvement
over v(s) requires model of MDP
.
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
P
a∈A s0 ∈S
§ Greedy policy improvement over Q(s, a) is model-free
.
π 0 (s) = arg max Q(s, a)
a∈A
§ How can we do TD policy evaluation for Q(s, a)?
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 29 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ Greedy policy improvement
over v(s) requires model of MDP
.
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
P
a∈A s0 ∈S
§ Greedy policy improvement over Q(s, a) is model-free
.
π 0 (s) = arg max Q(s, a)
a∈A
§ How can we do TD policy evaluation for Q(s, a)?
§ The TD(0) update rule for V (s) is,
QT (st , at ) ← QT −1 (st , at )+
αT (Rt+1 + γQT −1 (st+1 , at+1 ) − QT −1 (st , at ))
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 29 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
QT (st , at ) ← QT −1 (st , at )+
αT (Rt+1 + γQT −1 (st+1 , at+1 ) − QT −1 (st , at ))
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 30 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
QT (st , at ) ← QT −1 (st , at )+
αT (Rt+1 + γQT −1 (st+1 , at+1 ) − QT −1 (st , at ))
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 30 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
QT (st , at ) ← QT −1 (st , at )+
αT (Rt+1 + γQT −1 (st+1 , at+1 ) − QT −1 (st , at ))
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 30 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ Like MC Control algorithms, we would use -soft policies like -greedy
policies for exploration here.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 31 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ Like MC Control algorithms, we would use -soft policies like -greedy
policies for exploration here.
Algorithm 6: On-policy TD Control
73 Parameters: Learning rate α ∈ (0, 1], small > 0 ;
74 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
75 repeat
76 t ← 0, Choose st i.e., s0 ;
77 Pick at according to Q(st , .) (e.g., -greedy);
78 repeat
79 Apply action at from st , observe Rt+1 and st+1 ;
80 Pick at+1 according to Q(st+1 , .) (e.g., -greedy);
81 Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ));
82 t←t+1
83 until this episode terminates;
84 until all episodes are done;
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 31 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ Like MC Control algorithms, we would use -soft policies like -greedy
policies for exploration here.
Algorithm 7: On-policy TD Control
85 Parameters: Learning rate α ∈ (0, 1], small > 0 ;
86 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
87 repeat
88 t ← 0, Choose st i.e., s0 ;
89 Pick at according to Q(st , .) (e.g., -greedy);
90 repeat
91 Apply action at from st , observe Rt+1 and st+1 ;
92 Pick at+1 according to Q(st+1 , .) (e.g., -greedy);
93 Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ));
94 t←t+1
95 until this episode terminates;
96 until all episodes are done;
SARSA Example
§ The windy-gridworld example is taken from SB [Chapter 6].
§ Standard gridworld with start and end states, but upward wind
through the middle of the grid. The strength of the wind is given
below each column.
§ Actions are standard four - left, right, up, down. Undiscounted
episodic task, with constant rewards of −1 until the goal state is
reached.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 32 / 43
Agenda Introduction TD Evaluation TD Control
SARSA Variants
§ Coming back to the question of taking expectation over Q values.
This gives what is called an expected SARSA.
Q(st , at ) ← Q(st , at )+
!
X
α Rt+1 + γ π(a/st+1 )Q(st+1 , a) − Q(st , at )
a∈A
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 33 / 43
Agenda Introduction TD Evaluation TD Control
SARSA Variants
§ Coming back to the question of taking expectation over Q values.
This gives what is called an expected SARSA.
Q(st , at ) ← Q(st , at )+
!
X
α Rt+1 + γ π(a/st+1 )Q(st+1 , a) − Q(st , at )
a∈A
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 33 / 43
Agenda Introduction TD Evaluation TD Control
SARSA Variants
§ Coming back to the question of taking expectation over Q values.
This gives what is called an expected SARSA.
Q(st , at ) ← Q(st , at )+
!
X
α Rt+1 + γ π(a/st+1 )Q(st+1 , a) − Q(st , at )
a∈A
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 33 / 43
Agenda Introduction TD Evaluation TD Control
k-step SARSA
§ Let us define k-step Q-return as,
(k)
Qt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · + γ k−1 Rt+k + γ k Q(st+k , at+k )
SARSA(λ)
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 35 / 43
Agenda Introduction TD Evaluation TD Control
SARSA(λ)
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 36 / 43
Agenda Introduction TD Evaluation TD Control
SARSA(λ) Algorithm
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 37 / 43
Agenda Introduction TD Evaluation TD Control
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 38 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ The SARSA update rule is
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 39 / 43
Agenda Introduction TD Evaluation TD Control
TD Control
§ The SARSA update rule is
𝑣∗(𝑠′) 𝑣∗(𝑠′′) s
𝑠′ 𝑠′′
X 𝑞∗(𝑠′, 𝑎′) 𝑎′ X
q∗ (s, a) = r(s, a)+γ p(s0 |s, a)v∗ (s0 ) q∗ (s, a) = r(s, a) + γ p(s0 |s, a)
s0 ∈S s0 ∈S
max q∗ (s0 , a0 )
a0 ∈A
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 39 / 43
Agenda Introduction TD Evaluation TD Control
§ SARSA:
( )
X X
0 0 0 0 0
qπ (s, a) = r(s, a) + γ p(s |s, a) π(a |s )qπ (s , a )
s0 ∈S a0 ∈A
Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ))
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 40 / 43
Agenda Introduction TD Evaluation TD Control
§ SARSA:
( )
X X
0 0 0 0 0
qπ (s, a) = r(s, a) + γ p(s |s, a) π(a |s )qπ (s , a )
s0 ∈S a0 ∈A
Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ))
§ Q-learning:
X
q∗ (s, a) = r(s, a) + γ p(s0 |s, a) max
0 ∈A
q∗ (s0 , a0 )
a
s0 ∈S
0
Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max 0
Q(st+1 , a ) − Q(st , at )
a
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 40 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
Algorithm 8: Off-policy TD Control
97 Parameters: Learning rate α ∈ (0, 1], small > 0 ;
98 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
99 repeat
100 t ← 0, Choose st i.e., s0 ;
101 repeat
102 Pick at according to Q(st , .) (e.g., -greedy);
103 Apply action at from st , observe
Rt+1 and st+1 ;
104 Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max
0
Q(st+1 , a0 ) − Q(st , at ) ;
a
105 t←t+1
106 until this episode terminates;
107 until all episodes are done;
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 41 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
Algorithm 9: Off-policy TD Control
108 Parameters: Learning rate α ∈ (0, 1], small > 0 ;
109 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
110 repeat
111 t ← 0, Choose st i.e., s0 ;
112 repeat
113 Pick at according to Q(st , .) (e.g., -greedy);
114 Apply action at from st , observe
Rt+1 and st+1 ;
115 Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max
0
Q(st+1 , a0 ) − Q(st , at ) ;
a
116 t←t+1
117 until this episode terminates;
118 until all episodes are done;
§ Note the differences with SARSA. Why is it off-policy?
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 41 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
Algorithm 10: Off-policy TD Control
119 Parameters: Learning rate α ∈ (0, 1], small > 0 ;
120 Initialization: Q(s, a), ∀s ∈ S, a ∈ A arbitrarily except Q(terminal, .) = 0 ;
121 repeat
122 t ← 0, Choose st i.e., s0 ;
123 repeat
124 Pick at according to Q(st , .) (e.g., -greedy);
125 Apply action at from st , observe
Rt+1 and st+1 ;
126 Q(st , at ) ← Q(st , at ) + α Rt+1 + γ max
0
Q(st+1 , a0 ) − Q(st , at ) ;
a
127 t←t+1
128 until this episode terminates;
129 until all episodes are done;
§ Note the differences with SARSA. Why is it off-policy?
§ Next action is picked after the update here. In SARSA the next
action was picked before the update.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 41 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage??
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage?? – Asynchronous update.
§ Disadvantage of arbitrarily choosing states for update??
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage?? – Asynchronous update.
§ Disadvantage of arbitrarily choosing states for update?? – Like we
saw in RTDP, making updates along trajectory makes sure the
state-action pairs that are visited frequently i.e., state-action pairs
that are important gets to the optimal values quickly.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage?? – Asynchronous update.
§ Disadvantage of arbitrarily choosing states for update?? – Like we
saw in RTDP, making updates along trajectory makes sure the
state-action pairs that are visited frequently i.e., state-action pairs
that are important gets to the optimal values quickly.
§ Q-learning generally learns faster than SARSA. This may be due to
the fact that Q-learning updates only when it finds a better move. In
contrast, SARSA uses the estimate of the next action value in its
target. The value thus, changes everytime an exploratory action is
taken.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
§ In essence, SARSA picks actions from old Q’s and Q-learning picks
actions from new Q’s.
§ Since Q-learning updates the Q values by maximizing over all possible
actions, getting the states from a trajectory is not necessary.
§ Advantage?? – Asynchronous update.
§ Disadvantage of arbitrarily choosing states for update?? – Like we
saw in RTDP, making updates along trajectory makes sure the
state-action pairs that are visited frequently i.e., state-action pairs
that are important gets to the optimal values quickly.
§ Q-learning generally learns faster than SARSA. This may be due to
the fact that Q-learning updates only when it finds a better move. In
contrast, SARSA uses the estimate of the next action value in its
target. The value thus, changes everytime an exploratory action is
taken.
§ There are some undesirable situations also for Q-learning.
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 42 / 43
Agenda Introduction TD Evaluation TD Control
Q-learning
Abir Das (IIT Kharagpur) CS60077 Sept 24, 30, Oct 01, 07, 2021 43 / 43