0% found this document useful (0 votes)

159 views6 pages

EE675A Lecture 16

- TD(λ) uses eligibility traces to allow online updates in temporal difference learning by tracking how "eligible" each state is for the TD error update based on its recent history. - The backward view of TD(λ) shows that it is equivalent to updating all states weighted by their eligibility, with higher eligibility states receiving larger updates. - Off-policy prediction estimates the value of a target policy while following a different behavior policy, using importance sampling to reweight returns based on the probability ratios between policies. - Off-policy Monte Carlo prediction directly applies importance sampling to the returns to estimate the target policy's value function from data collected under the behavior policy.

Uploaded by

sachin bhadang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views6 pages

EE675A Lecture 16

Uploaded by

sachin bhadang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

EE675A - IITK, 2022-23-II

Lecture 16: TD (Lambda) Backward view & Off-policy Prediction

20th March 2023
Lecturer: Subrahmanya Swamy Peruru Scribe: Suhas, Akansh Agrawal

1 Recap and Overview

In the previous lecture,1 we discussed different variants of Temporal Difference (TD) updates.

TD Update:
Vnew (St ) = Vold (St ) + α[Gt − Vold (St )] (1)
Gt = Rt+1 + γVold (St+1 )
n-Step TD:
(n)
Vnew (St ) = Vold (St + α[Gt − Vold (St )]) (2)
(n)
Gt = Rt+1 + γRt+2 .... + γ n−1 Rt+n + γ n Vold (St+n )

Notice how in Eq. 2, substituting n = 1 gives us the TD update and n = ∞ gives us the Monte
Carlo update. In practice, it is observed that intermediate values of n work well.
(3)
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Vold (St+3 )
(2)
Gt = Rt+1 + γRt+2 + γ 2 Vold (St+2 )
(2,3) 1 (2) 1 (3)
Gt = Gt + Gt
2 2

The composite return possesses an error reduction property similar to that of individual n-step re-
turns and thus can be used to construct updates with guaranteed convergence properties. Any set
of n-step returns can be averaged in this way, even an inﬁnite set, as long as the weights on the
component returns are positive and sum to 1.2

1.1 TD(λ)
∑
∞
(n)
Gλt = (1 − λ) Gt λn−1
n=1
(n)
Gt = Gt ∀n ≥ T − t

1
1.2 Issues
• Computing Gλt requires completion of an episode
• Online updates are not possible

In order to alleviate these issues, we make use of Eligibility Traces to make the TD update.

2 Eligibility Traces
Consider an experiment where a rat is given food at some time instant (Say t = 5). Now suppose,
a bulb was lit at t = 4, and bells were rung at times t = 1, 2, 3. To which event should the rat
attribute the event of getting food? Should it be the bulb because it happened recently or should
it be the bell because it happened more number of times? The ﬁrst line of thought represents a
Recency bias whereas the second represents a Frequency bias.
This idea can be modelled using an eligibility trace function Et (s) that simulates short-term
memory. Et (s) tells us how informative a state s is at a point in time t. Whenever we land in a
state s, we increase its eligibility value. We then exponentially decay the eligibility values of all
the states to simulate the passage of time. More formally, we have

E0 (s) = 0, ∀s

Et (s) = λEt−1 (s) + 1{st =s}

TD Error:
δt = Rt+1 + γV (st+1 ) − V (st )

3 Backward view of TD(λ)

Normal TD Update:
V (st ) = V (st ) + αδt
TD(λ):
V (s) = V (s) + αEt (s)δt ; ∀s
The important difference is that we make an update to all the states weighted by how informative
that state is. The higher the eligibility value of a state, the higher the contribution of Rt to it.

Consider the case when λ = 0.

Et (s) = 1{st =s}
V (s) = V (s) + α(1{st =s} )δt ; ∀s

V (s) = V (s) ; ∀s ̸= st

2
V (st ) = V (st ) + αδt
Thus, we see how this case boils down to the Normal TD Update.

3.1 Theoretical guarantees

Consider an episode
St , At , Rt+1 , St+1 , At+1 , . . . , ST
For the Backward view TD (λ) algorithm discussed above, it can be shown that the ofﬂine update
(at the end of the episode) is equivalent to the update performed by Forward view TD (λ). More
recent methods that modify the eligibility trace function are able to prove equivalence of updates
even in the online case.

4 Off Policy Prediction

Off-policy prediction is a problem in RL, where an agent tries to learn the value of a policy dif-
ferent from the one it is currently following. This problem arises when the agent is exploring the
environment and collecting data (e.g., through random actions or by following a different policy),
but needs to estimate the value of a target policy in order to make decisions. Formally, if given Π ,
the objective is to estimate VΠ by following the other policy µ ̸= Π.
St , At , Rt+1 , St+1 , At+1 ∼ µ(.|St ) (3)
The policy Π is the target policy and the policy µ is the behaviour policy. One approach to
estimate the value of the target policy is to use the off-policy prediction algorithm called importance
sampling (which we will cover soon in this lecture). For now we can understand importance
sampling as name suggests to be an approach which basically takes the weighted form of observed
rewards and actions into account which is based on the probability of their occurrence under the
target policy. In simple words, this indicates that actions and rewards that are more likely to occur
under the target policy are given more weight than those that are less likely.
Consider the example: One can approximately estimate Ex∼p [f (x)] using Monte-Carlo Ap-
proximation as: ∑N
∑ f (x(i) )
Ex∼p [f (x)] = f (x)p(x) ≈ i=1 (4)
x
N
where x(i) ∼ p . The other way we may use Monte-Carlo Approximation as:
∑ ∑
Ex∼p [f (x)] = f (x)p(x)q(x)/q(x) = [f (x)p(x)/q(x)]q(x) = Ex∼q [f (x)p(x)/q(x)] (5)
x x
∑N
f (x(i) )p(x(i) /q(x(i) )
Ex∼q [f (x)p(x)/q(x)] ≈ i=1
(6)
N
where x(i) ∼ q . We use the similar idea for achieving the target policy based on different policy
via Off-Policy Monte Carlo Prediction.

3
4.1 Off-Policy Monte Carlo Prediction
Consider a sample episode x as follows :
∆
x = (St , At , Rt+1 , St+1 , At+1 , Rt+2 , St+2 ...) (7)
Say we denote the Gt by f (x) of the above example, therefore :
∆
∑
Gt = f (x) = γ k−1 Rt+k (8)
k≥1

VΠ = Ex∼µ [f (x)|St = s] (9)

or,

VΠ = Ex∼q [f (x) p(x while following policy Π )/p(x while following policy µ )|St = s]

In above equation, p(x while following policy Π ) = Π(At |St = s)RSAtt PSAtt,St+1 Π(At+1 |St+1 )... ,
where RSAtt and PSAtt,St+1 are part of dynamics of MDP and are unknown to us.

Also one can write , p(x while following policy µ ) = µ(At |St = s)RSAtt PSAtt,St+1 µ(At+1 |St+1 )....

We can write the ratio of p(x while following policy Π ) and p(x while following policy µ )
using above two expressions as :

Π(At |St = s)Π(At+1 |St+1 )...

ρΠ/µ = (10)
µ(At |St = s)µ(At+1 |St+1 )...
This ratio is known as the importance sampling ratio. Since the ratio does not have explicit terms
of RSAtt and PSAtt,St+1 we can directly use it to get the value function as :

VΠ = Ex∼µ [Gµt ρΠ/µ ] (11)

Thus, the off-policy MC prediction takes average over multiple episodes obtained by following
policy µ by appropriately weighting Gt . The ordinary importance sampling ratio we saw ear-
lier can have very large value , so to normalize this effect we can utilise another variant which
is weighted importance. It is key point to understand that the importance sampling is unbiased
whereas weighted importance sampling is biased (though the bias converges asymptotically to 0
). Whereas, the variance of ordinary importance sampling is in general unbounded because the
variance of the ratios can take any large value (unbounded), whereas in the case of the weighted
estimator the largest weight on any single return is one (because of the normalization done). For
more details about it please refer to Sutton and Barto Book.

4
4.2 Advantages
This approach can get the value function estimate of target policy by re-using the data correspond-
ing to older policies. Also if we wish to learn about some state which is less exploratory by the
target policy then we can use some better exploratory policy (which has high chance of exploring
the rare state) efﬁciently for this purpose. Also, this techniques can be utilised in Q-learning (which
we will be discussed in future lectures) where we can learn of optimal policy using the behavior
policy .

5
References
1
S. S. Peruru. Lecture notes of introduction to reinforcement learning, January 2023.
2
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second
edition, 2018.

The King Speaks To The Scribe
0% (1)
The King Speaks To The Scribe
7 pages
LP Eapp Approaches in Writing A Critique
No ratings yet
LP Eapp Approaches in Writing A Critique
2 pages
Facilitating Reflection: A Manual For Leaders and Educators
100% (1)
Facilitating Reflection: A Manual For Leaders and Educators
51 pages
PDF
No ratings yet
PDF
11 pages
Reduplication in Philippine Language
No ratings yet
Reduplication in Philippine Language
8 pages
Module 1 ProfEd 109
No ratings yet
Module 1 ProfEd 109
4 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Enneagram Presentation
100% (1)
Enneagram Presentation
17 pages
Fraser Allison Et Al. - Design Patterns For Voice Interaction in Games
No ratings yet
Fraser Allison Et Al. - Design Patterns For Voice Interaction in Games
13 pages
Contoh Soal Essay Conditional Sentence Beserta Jawabannya
No ratings yet
Contoh Soal Essay Conditional Sentence Beserta Jawabannya
3 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Choosing Appropriate Qualitative Research Design: Target
No ratings yet
Choosing Appropriate Qualitative Research Design: Target
8 pages
EE 675 Lecture 27th March
No ratings yet
EE 675 Lecture 27th March
4 pages
Assignment 7 (Sol.) : Reinforcement Learning
0% (1)
Assignment 7 (Sol.) : Reinforcement Learning
3 pages
Assignment 5 (Sol.) : Reinforcement Learning
100% (1)
Assignment 5 (Sol.) : Reinforcement Learning
4 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Job Rotation From The Employees' Point of View
No ratings yet
Job Rotation From The Employees' Point of View
7 pages
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
No ratings yet
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
57 pages
Assessment Portfolio
No ratings yet
Assessment Portfolio
12 pages
(Ebook PDF) Social Psychology, 7th Edition by Franzoi Instant Download
100% (4)
(Ebook PDF) Social Psychology, 7th Edition by Franzoi Instant Download
56 pages
Unit 1 - Patterning
No ratings yet
Unit 1 - Patterning
36 pages
Lecture 6 Value Function Approximation
No ratings yet
Lecture 6 Value Function Approximation
56 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Module 04
No ratings yet
Module 04
63 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
03.12.25 Stat RL
No ratings yet
03.12.25 Stat RL
26 pages
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
No ratings yet
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
40 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Doubly Robust Off-Policy RL
No ratings yet
Doubly Robust Off-Policy RL
14 pages
QP Ans
No ratings yet
QP Ans
40 pages
bản full thực tập giữa khóa
No ratings yet
bản full thực tập giữa khóa
20 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
Elements of Experiential Consumption
No ratings yet
Elements of Experiential Consumption
13 pages
LeCompte - Problems of Reliability and Validity in Ethnographic Research
No ratings yet
LeCompte - Problems of Reliability and Validity in Ethnographic Research
30 pages
02 Making Decisions
No ratings yet
02 Making Decisions
17 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
GTD2 TDC Suttonetal2009
No ratings yet
GTD2 TDC Suttonetal2009
8 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
29115-Article Text-33169-1-2-20240324
No ratings yet
29115-Article Text-33169-1-2-20240324
8 pages
Juassic MVPTemplate
No ratings yet
Juassic MVPTemplate
22 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Communication Skill in English
No ratings yet
Communication Skill in English
3 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reading in Print Versus Digital Media Uses Differe
No ratings yet
Reading in Print Versus Digital Media Uses Differe
21 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
9541-Article Text-13069-1-2-20201228
No ratings yet
9541-Article Text-13069-1-2-20201228
7 pages
Work Ethics Notes On Team Work
No ratings yet
Work Ethics Notes On Team Work
3 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Chapter 6: Temporal Difference Learning: Objectives of This Chapter
No ratings yet
Chapter 6: Temporal Difference Learning: Objectives of This Chapter
49 pages
Notes On Non-Cooperative Game Theory Econ 8103, Spring 2009, Aldo Rustichini
No ratings yet
Notes On Non-Cooperative Game Theory Econ 8103, Spring 2009, Aldo Rustichini
30 pages
Wepik Mastering The Art of Negotiation Unleashing Your Inner Deal Maker 20231018052824117Q
No ratings yet
Wepik Mastering The Art of Negotiation Unleashing Your Inner Deal Maker 20231018052824117Q
12 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Iplan English Q1W1
No ratings yet
Iplan English Q1W1
5 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Dulingo Questions
No ratings yet
Dulingo Questions
2 pages
Assignment 6 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 6 (Sol.) : Reinforcement Learning
4 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
Lec 17 SARSA Expected SARSA Q Learning
No ratings yet
Lec 17 SARSA Expected SARSA Q Learning
4 pages
TCC
No ratings yet
TCC
12 pages
Lesson Plan
No ratings yet
Lesson Plan
8 pages
Coupled Pendulum
No ratings yet
Coupled Pendulum
6 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Chris Bush Resume
No ratings yet
Chris Bush Resume
2 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Alzheimer's Research Paper
No ratings yet
Alzheimer's Research Paper
5 pages
Learning and Development: Trainer and Trainee Assessment
No ratings yet
Learning and Development: Trainer and Trainee Assessment
10 pages
DLL - ORAL COM W2 - Jessica
No ratings yet
DLL - ORAL COM W2 - Jessica
4 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

EE675A Lecture 16

Uploaded by

EE675A Lecture 16

Uploaded by

EE675A - IITK, 2022-23-II

Lecture 16: TD (Lambda) Backward view & Off-policy Prediction

1 Recap and Overview

Et (s) = λEt−1 (s) + 1{st =s}

3 Backward view of TD(λ)

Consider the case when λ = 0.

3.1 Theoretical guarantees

4 Off Policy Prediction

VΠ = Ex∼µ [f (x)|St = s] (9)

Π(At |St = s)Π(At+1 |St+1 )...

VΠ = Ex∼µ [Gµt ρΠ/µ ] (11)

You might also like