5 Temporal Difference Learning
5 Temporal Difference Learning
Temporal-Difference Learning
Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB
2023.01.24
Temporal-Difference Learning
Recap (i)
v π ( s ) : = E π [ Gt | S t = s ]
= Eπ [Rt +1 + γGt +1 | St = s ]
= ∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) ,
a s 0 ,r
qπ (s , a ) := Eπ [Gt | St = s , At = a ]
= Eπ [Rt +1 + γGt +1 | St = s , At = a ]
" #
= ∑ p (s 0 , r | s , a ) r + γ ∑ π (a 0 | s 0 )qπ (s 0 , a 0 ) ,
s 0 ,r a0
for all s ∈ S , where it is implicit that the actions are taken from the set A(s ), that the next
states are taken from the set S , and that the rewards are taken from the set R.
Temporal-Difference Learning
Recap (ii)
= max ∑ p (s 0 , r | s , a ) r + γv∗ (s 0 ) ,
a
s 0 ,r
q∗ (s , a ) = E Rt +1 + γ max q∗ (St +1 , a 0 ) | St = s , At = a
a0
= ∑ p (s 0 , r | s , a ) r + γ max q∗ (s 0 , a 0 ) ,
a0
s 0 ,r
Temporal-Difference learning
I According to [SB18],
If one had to identify one idea as central and novel to reinforcement learning (RL),
it would undoubtedly be temporal-difference (TD) learning.
I TD is a combination of dynamic programming (DP) ideas and Monte Carlo (MC) ideas.
I Like MC methods, TD methods can learn directly from raw experience without a model
of the environment’s dynamics.
I Like DP, TD methods update estimates based in part on other learned estimates,
without waiting for a final outcome (they bootstrap).
I The relationship betweeen DP, MC, and TD methods is a recurring theme in RL.
I We shall again focus on the prediction (policy evaluation) problem.
Temporal-Difference Learning
TD prediction
I Both TD and MC methods use experience to solve the prediction problem.
I Given some experience following a policy π , both methods update their estimate V of
vπ for the nonterminal states St occuring in that sequence.
I MC methods wait until the return following the visit is known, then use that return as a
target for V (St ).
I A simple every-visit MC method suitable for nonstationary environments is
where Gt is the actual return following time t, and α is a constant step-size parameter.
I Let us call this method a constant-α MC.
I Whereas Monte Carlo methods must wait until the end of the episode to determine the
increment to V (St ) (only then is Gt known), TD methods need to wait only until the
next time step. At time t + 1 they immediately form a target and make a useful update
using the observed reward Rt +1 and the estimate V (St +1 ).
I The simplest TD method makes the update
Bootstrapping
∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vk ( s 0 ) ,
vk + 1 ( s ) =
a s 0 ,r
V (S ) ← V (S ) + α R + γV (S 0 ) − V (S ) .
Sampling
vπ (s ) := Eπ [Gt | St = s ]
= Eπ [Rt +1 + γGt +1 | St = s ]
= Eπ [Rt +1 + γvπ (St +1 ) | St = s ] .
I DP assumes that the expected values are completely provided by a model of the
environment, so it does not sample1 .
I MC samples the expectation in the first equation above.
I TD (0) samples the expectation in the third equation above.
1 However, the solution of the Bellman equation or, more generally, the Hamilton–Jacobi–Bellman IPDE, is linked to
sampling via the Feynmann-Kac representation [KP15].
Temporal-Difference Learning
Sample updates
TD error
I Note that the quantity in brackets in the TD (0) update is a sort of error, measuring the
difference between the estimated value of St and the better estimate
Rt +1 + γV (St +1 ).
I This quantity, called the TD error, arises in various forms throughout reinforcement
learning:
δt := Rt +1 + γV (St +1 ) − V (St ).
I Notice that the TD error at each time is the error in the estimate made at that time.
I Because the TD error depends on the next state and next reward, it is not actually
available until one time step later. That is, δt is the error in V (St ), available at time
t + 1.
I Also note that if the array V does not change during the episode (as it does not in MC
methods), then the MC error can be written as a sum of TD errors:
T −1
Gt − V ( S t ) = ∑ γk −t δk .
k =t
I This identity is not exact if V is updated during the episode (as it is in TD (0)), but if the
step size is small then it may still hold approximately.
Temporal-Difference Learning
Convergence
I For any fixed policy π , TD (0) has been proved to converge to vπ , in the mean for a
constant step-size parameter if it is sufficiently small, and with probability 1 if the
step-size parameter decreases according to the usual stochastic approximation
conditions
∞ ∞
∑ αn (a ) = ∞ and ∑ α2n (a ) < ∞.
n =1 n =1
Temporal-Difference Learning
Rate of convergence
Batch updating
I Suppose there is available only a finite amount of experience, say 10 episodes or 100
time steps.
I In this case, a common approach with incremental learning methods is to present the
experience repeatedly until the method converges upon an answer.
I Given an approximate value function, V, the increments (MC or TD (0)) are computed
for every time step t at which a nonterminal state is visited, but the value function is
changed only once, by the sum of all the increments.
I Then all the available experience is processed again with the new value function to
produce a new overall increment, and so on until the value function converges.
I We call this batch updating because updates are made only after processing each
complete batch of training data.
I Under batch updating, TD (0) converges deterministically to a single answer
independent of the step-size parameter, α, as long as α is chosen to be sufficiently
small.
I The constant α MC method also converges deterministically under the same
conditions, but to a different answer.
Temporal-Difference Learning
I Place yourself now in the role of the predictor of returns for an unknown Markov
reward process.
I Suppose you observe the following eight episodes:
I A, 0, B, 0
I B, 1
I B, 1
I B, 1
I B, 1
I B, 1
I B, 1
I B, 0
I This means that the episode started in state A , transitioned to B with a reward 0, and
then terminated from B with a reward of 0. The other seven episodes were even
shorter, starting from B and terminating immediately.
I Given this batch of data, what would you say are the optimal predictions, the best
values for the estimates V (A ) and V (B )?
Temporal-Difference Learning
Answers
I Everyone would probably agree that the optimal value for V (B ) is 3 , because six out
4
of eight times in state B the process terminated immediately with a return of 1, and the
other two times in B the process terminated immediately with a return of 0.
I But what is the optimal value for the estimate V (A ) given this data? There are two
reasonable answers.
I One is to observe that 100% of the times the process was in state A it traversed immediately
to B (with a reward of zero); and because we have already decided that B has value 34 , A
must have value 34 as well. One way of viewing this answer is that it is based on first
modelling the Markov process, and then computing the correct estimates given the model,
which indeed in this case gives V (A ) = 34 . This is also the answer that batch TD (0) gives.
I The other reasonable answer is simply to observe that we have seen A once and the return
that followed it was 0; we therefore estimate V (A ) as 0. This is the answer that batch MC
methods give. Notice that it is also the answer that gives minimum squared error on the
training data. In fact, it gives zero error on the data.
I But still we expect the first answer to be better. If the process is Markov, we expect that
the first answer will produce lower error on future data, even though the Monte Carlo
answer is better on the existing data.
Temporal-Difference Learning
Certainty-equivalence estimate
I We turn now to the use of TD prediction methods for the control problem.
I As usual, we follow the pattern of generalised policy iteration (GPI), only this time
using TD methods for the evaluation or prediction part.
I The first step is to learn an action-value function rather than a state-value function.
I For an on-policy method we must estimate qπ (s , a ) for the current behaviour policy π
and for all states s and actions a.
I This can be done using essentially the same TD method described above for learning
vπ .
I Recall that an episode consists of an alternating sequence of states and state–action
pairs:
. . . , St , At , Rt +1 , St +1 , At +1 , Rt +2 , St +2 , At +2 , Rt +3 , St +3 , At +3 , . . .
Temporal-Difference Learning
I Previously we considered transitions from state to state and learned the values of
states.
I Now we consider transitions from state–action pair to state–action pair, and learn the
values of state–action pairs.
I Formally these cases are identical: they are both Markov chains with a reward
process. The theorems assuring the convergence of state values under TD (0) also
apply to the corresponding algorithm for action values:
Q-learning
I In this case, the learned action-value function, Q, directly approximates q∗ , the optimal
action-value function, independent of the policy being followed.
I This dramatically simplifies the analysis of the algorithm and enabled early
convergence proofs.
I The policy still has an effect in that it determines which state–action pairs are visited
and updated.
I However, all that is required for correct convergence is that all pairs continue to be
updated.
I This is a minimal requirement in the sense that any method guaranteed to find optimal
behaviour in the general case must require it.
I Under this assumption and a variant of the usual stochastic approximation conditions
on the sequence of step-size parameters, Q has been shown to converge with
probability 1 to q∗ .
Temporal-Difference Learning