EE675A Lecture 16

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

EE675A - IITK, 2022-23-II

Lecture 16: TD (Lambda) Backward view & Off-policy Prediction


20th March 2023
Lecturer: Subrahmanya Swamy Peruru Scribe: Suhas, Akansh Agrawal

1 Recap and Overview


In the previous lecture,1 we discussed different variants of Temporal Difference (TD) updates.

TD Update:
Vnew (St ) = Vold (St ) + α[Gt − Vold (St )] (1)
Gt = Rt+1 + γVold (St+1 )
n-Step TD:
(n)
Vnew (St ) = Vold (St + α[Gt − Vold (St )]) (2)
(n)
Gt = Rt+1 + γRt+2 .... + γ n−1 Rt+n + γ n Vold (St+n )

Notice how in Eq. 2, substituting n = 1 gives us the TD update and n = ∞ gives us the Monte
Carlo update. In practice, it is observed that intermediate values of n work well.
(3)
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Vold (St+3 )
(2)
Gt = Rt+1 + γRt+2 + γ 2 Vold (St+2 )
(2,3) 1 (2) 1 (3)
Gt = Gt + Gt
2 2

The composite return possesses an error reduction property similar to that of individual n-step re-
turns and thus can be used to construct updates with guaranteed convergence properties. Any set
of n-step returns can be averaged in this way, even an infinite set, as long as the weights on the
component returns are positive and sum to 1.2

1.1 TD(λ)


(n)
Gλt = (1 − λ) Gt λn−1
n=1
(n)
Gt = Gt ∀n ≥ T − t

1
1.2 Issues
• Computing Gλt requires completion of an episode
• Online updates are not possible

In order to alleviate these issues, we make use of Eligibility Traces to make the TD update.

2 Eligibility Traces
Consider an experiment where a rat is given food at some time instant (Say t = 5). Now suppose,
a bulb was lit at t = 4, and bells were rung at times t = 1, 2, 3. To which event should the rat
attribute the event of getting food? Should it be the bulb because it happened recently or should
it be the bell because it happened more number of times? The first line of thought represents a
Recency bias whereas the second represents a Frequency bias.
This idea can be modelled using an eligibility trace function Et (s) that simulates short-term
memory. Et (s) tells us how informative a state s is at a point in time t. Whenever we land in a
state s, we increase its eligibility value. We then exponentially decay the eligibility values of all
the states to simulate the passage of time. More formally, we have

E0 (s) = 0, ∀s

Et (s) = λEt−1 (s) + 1{st =s}

TD Error:
δt = Rt+1 + γV (st+1 ) − V (st )

3 Backward view of TD(λ)


Normal TD Update:
V (st ) = V (st ) + αδt
TD(λ):
V (s) = V (s) + αEt (s)δt ; ∀s
The important difference is that we make an update to all the states weighted by how informative
that state is. The higher the eligibility value of a state, the higher the contribution of Rt to it.

Consider the case when λ = 0.


Et (s) = 1{st =s}
V (s) = V (s) + α(1{st =s} )δt ; ∀s

V (s) = V (s) ; ∀s ̸= st

2
V (st ) = V (st ) + αδt
Thus, we see how this case boils down to the Normal TD Update.

3.1 Theoretical guarantees


Consider an episode
St , At , Rt+1 , St+1 , At+1 , . . . , ST
For the Backward view TD (λ) algorithm discussed above, it can be shown that the offline update
(at the end of the episode) is equivalent to the update performed by Forward view TD (λ). More
recent methods that modify the eligibility trace function are able to prove equivalence of updates
even in the online case.

4 Off Policy Prediction


Off-policy prediction is a problem in RL, where an agent tries to learn the value of a policy dif-
ferent from the one it is currently following. This problem arises when the agent is exploring the
environment and collecting data (e.g., through random actions or by following a different policy),
but needs to estimate the value of a target policy in order to make decisions. Formally, if given Π ,
the objective is to estimate VΠ by following the other policy µ ̸= Π.
St , At , Rt+1 , St+1 , At+1 ∼ µ(.|St ) (3)
The policy Π is the target policy and the policy µ is the behaviour policy. One approach to
estimate the value of the target policy is to use the off-policy prediction algorithm called importance
sampling (which we will cover soon in this lecture). For now we can understand importance
sampling as name suggests to be an approach which basically takes the weighted form of observed
rewards and actions into account which is based on the probability of their occurrence under the
target policy. In simple words, this indicates that actions and rewards that are more likely to occur
under the target policy are given more weight than those that are less likely.
Consider the example: One can approximately estimate Ex∼p [f (x)] using Monte-Carlo Ap-
proximation as: ∑N
∑ f (x(i) )
Ex∼p [f (x)] = f (x)p(x) ≈ i=1 (4)
x
N
where x(i) ∼ p . The other way we may use Monte-Carlo Approximation as:
∑ ∑
Ex∼p [f (x)] = f (x)p(x)q(x)/q(x) = [f (x)p(x)/q(x)]q(x) = Ex∼q [f (x)p(x)/q(x)] (5)
x x
∑N
f (x(i) )p(x(i) /q(x(i) )
Ex∼q [f (x)p(x)/q(x)] ≈ i=1
(6)
N
where x(i) ∼ q . We use the similar idea for achieving the target policy based on different policy
via Off-Policy Monte Carlo Prediction.

3
4.1 Off-Policy Monte Carlo Prediction
Consider a sample episode x as follows :

x = (St , At , Rt+1 , St+1 , At+1 , Rt+2 , St+2 ...) (7)
Say we denote the Gt by f (x) of the above example, therefore :


Gt = f (x) = γ k−1 Rt+k (8)
k≥1

VΠ = Ex∼µ [f (x)|St = s] (9)


or,

VΠ = Ex∼q [f (x) p(x while following policy Π )/p(x while following policy µ )|St = s]

In above equation, p(x while following policy Π ) = Π(At |St = s)RSAtt PSAtt,St+1 Π(At+1 |St+1 )... ,
where RSAtt and PSAtt,St+1 are part of dynamics of MDP and are unknown to us.

Also one can write , p(x while following policy µ ) = µ(At |St = s)RSAtt PSAtt,St+1 µ(At+1 |St+1 )....

We can write the ratio of p(x while following policy Π ) and p(x while following policy µ )
using above two expressions as :

Π(At |St = s)Π(At+1 |St+1 )...


ρΠ/µ = (10)
µ(At |St = s)µ(At+1 |St+1 )...
This ratio is known as the importance sampling ratio. Since the ratio does not have explicit terms
of RSAtt and PSAtt,St+1 we can directly use it to get the value function as :

VΠ = Ex∼µ [Gµt ρΠ/µ ] (11)


Thus, the off-policy MC prediction takes average over multiple episodes obtained by following
policy µ by appropriately weighting Gt . The ordinary importance sampling ratio we saw ear-
lier can have very large value , so to normalize this effect we can utilise another variant which
is weighted importance. It is key point to understand that the importance sampling is unbiased
whereas weighted importance sampling is biased (though the bias converges asymptotically to 0
). Whereas, the variance of ordinary importance sampling is in general unbounded because the
variance of the ratios can take any large value (unbounded), whereas in the case of the weighted
estimator the largest weight on any single return is one (because of the normalization done). For
more details about it please refer to Sutton and Barto Book.

4
4.2 Advantages
This approach can get the value function estimate of target policy by re-using the data correspond-
ing to older policies. Also if we wish to learn about some state which is less exploratory by the
target policy then we can use some better exploratory policy (which has high chance of exploring
the rare state) efficiently for this purpose. Also, this techniques can be utilised in Q-learning (which
we will be discussed in future lectures) where we can learn of optimal policy using the behavior
policy .

5
References
1
S. S. Peruru. Lecture notes of introduction to reinforcement learning, January 2023.
2
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second
edition, 2018.

You might also like