EE675A Lecture 16
EE675A Lecture 16
EE675A Lecture 16
TD Update:
Vnew (St ) = Vold (St ) + α[Gt − Vold (St )] (1)
Gt = Rt+1 + γVold (St+1 )
n-Step TD:
(n)
Vnew (St ) = Vold (St + α[Gt − Vold (St )]) (2)
(n)
Gt = Rt+1 + γRt+2 .... + γ n−1 Rt+n + γ n Vold (St+n )
Notice how in Eq. 2, substituting n = 1 gives us the TD update and n = ∞ gives us the Monte
Carlo update. In practice, it is observed that intermediate values of n work well.
(3)
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Vold (St+3 )
(2)
Gt = Rt+1 + γRt+2 + γ 2 Vold (St+2 )
(2,3) 1 (2) 1 (3)
Gt = Gt + Gt
2 2
The composite return possesses an error reduction property similar to that of individual n-step re-
turns and thus can be used to construct updates with guaranteed convergence properties. Any set
of n-step returns can be averaged in this way, even an infinite set, as long as the weights on the
component returns are positive and sum to 1.2
1.1 TD(λ)
∑
∞
(n)
Gλt = (1 − λ) Gt λn−1
n=1
(n)
Gt = Gt ∀n ≥ T − t
1
1.2 Issues
• Computing Gλt requires completion of an episode
• Online updates are not possible
In order to alleviate these issues, we make use of Eligibility Traces to make the TD update.
2 Eligibility Traces
Consider an experiment where a rat is given food at some time instant (Say t = 5). Now suppose,
a bulb was lit at t = 4, and bells were rung at times t = 1, 2, 3. To which event should the rat
attribute the event of getting food? Should it be the bulb because it happened recently or should
it be the bell because it happened more number of times? The first line of thought represents a
Recency bias whereas the second represents a Frequency bias.
This idea can be modelled using an eligibility trace function Et (s) that simulates short-term
memory. Et (s) tells us how informative a state s is at a point in time t. Whenever we land in a
state s, we increase its eligibility value. We then exponentially decay the eligibility values of all
the states to simulate the passage of time. More formally, we have
E0 (s) = 0, ∀s
TD Error:
δt = Rt+1 + γV (st+1 ) − V (st )
V (s) = V (s) ; ∀s ̸= st
2
V (st ) = V (st ) + αδt
Thus, we see how this case boils down to the Normal TD Update.
3
4.1 Off-Policy Monte Carlo Prediction
Consider a sample episode x as follows :
∆
x = (St , At , Rt+1 , St+1 , At+1 , Rt+2 , St+2 ...) (7)
Say we denote the Gt by f (x) of the above example, therefore :
∆
∑
Gt = f (x) = γ k−1 Rt+k (8)
k≥1
VΠ = Ex∼q [f (x) p(x while following policy Π )/p(x while following policy µ )|St = s]
In above equation, p(x while following policy Π ) = Π(At |St = s)RSAtt PSAtt,St+1 Π(At+1 |St+1 )... ,
where RSAtt and PSAtt,St+1 are part of dynamics of MDP and are unknown to us.
Also one can write , p(x while following policy µ ) = µ(At |St = s)RSAtt PSAtt,St+1 µ(At+1 |St+1 )....
We can write the ratio of p(x while following policy Π ) and p(x while following policy µ )
using above two expressions as :
4
4.2 Advantages
This approach can get the value function estimate of target policy by re-using the data correspond-
ing to older policies. Also if we wish to learn about some state which is less exploratory by the
target policy then we can use some better exploratory policy (which has high chance of exploring
the rare state) efficiently for this purpose. Also, this techniques can be utilised in Q-learning (which
we will be discussed in future lectures) where we can learn of optimal policy using the behavior
policy .
5
References
1
S. S. Peruru. Lecture notes of introduction to reinforcement learning, January 2023.
2
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second
edition, 2018.