0% found this document useful (0 votes)

42 views25 pages

5 Temporal Difference Learning

Temporal-Difference (TD) learning is a combination of Monte Carlo and dynamic programming ideas that can learn directly from raw experience without a model of the environment. Unlike Monte Carlo methods, TD methods update estimates based in part on other learned estimates through bootstrapping, without waiting for a final outcome. The simplest TD method, TD(0), makes updates using the observed reward and estimated value of the next state. This TD error is used to update the value estimate and drives the algorithm towards convergence. TD learning provides a balance between the sample-based learning of Monte Carlo methods and the efficient bootstrapping of dynamic programming.

Uploaded by

yilvas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views25 pages

5 Temporal Difference Learning

Uploaded by

yilvas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Temporal-Difference Learning

Paul Alexander Bilokon, PhD

Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB

2023.01.24
Temporal-Difference Learning

Recap (i)

Recall the Bellman equations:

v π ( s ) : = E π [ Gt | S t = s ]
= Eπ [Rt +1 + γGt +1 | St = s ]
= ∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) ,

a s 0 ,r

qπ (s , a ) := Eπ [Gt | St = s , At = a ]
= Eπ [Rt +1 + γGt +1 | St = s , At = a ]
" #
= ∑ p (s 0 , r | s , a ) r + γ ∑ π (a 0 | s 0 )qπ (s 0 , a 0 ) ,
s 0 ,r a0

for all s ∈ S , where it is implicit that the actions are taken from the set A(s ), that the next
states are taken from the set S , and that the rewards are taken from the set R.
Temporal-Difference Learning

Recap (ii)

Recall the Bellman optimality equations:

v∗ (s ) = max E [Rt +1 + γv∗ (St +1 ) | St = s , At = a ]

= max ∑ p (s 0 , r | s , a ) r + γv∗ (s 0 ) ,

a
s 0 ,r

q∗ (s , a ) = E Rt +1 + γ max q∗ (St +1 , a 0 ) | St = s , At = a
a0

= ∑ p (s 0 , r | s , a ) r + γ max q∗ (s 0 , a 0 ) ,
a0
s 0 ,r

for all s ∈ S , a ∈ A(s ), and s 0 ∈ S .

Temporal-Difference Learning

Temporal-Difference learning

I According to [SB18],
If one had to identify one idea as central and novel to reinforcement learning (RL),
it would undoubtedly be temporal-difference (TD) learning.
I TD is a combination of dynamic programming (DP) ideas and Monte Carlo (MC) ideas.
I Like MC methods, TD methods can learn directly from raw experience without a model
of the environment’s dynamics.
I Like DP, TD methods update estimates based in part on other learned estimates,
without waiting for a final outcome (they bootstrap).
I The relationship betweeen DP, MC, and TD methods is a recurring theme in RL.
I We shall again focus on the prediction (policy evaluation) problem.
Temporal-Difference Learning

TD prediction
I Both TD and MC methods use experience to solve the prediction problem.
I Given some experience following a policy π , both methods update their estimate V of
vπ for the nonterminal states St occuring in that sequence.
I MC methods wait until the return following the visit is known, then use that return as a
target for V (St ).
I A simple every-visit MC method suitable for nonstationary environments is

V (St ) ← V (St ) + α [Gt − V (St )] ,

where Gt is the actual return following time t, and α is a constant step-size parameter.
I Let us call this method a constant-α MC.
I Whereas Monte Carlo methods must wait until the end of the episode to determine the
increment to V (St ) (only then is Gt known), TD methods need to wait only until the
next time step. At time t + 1 they immediately form a target and make a useful update
using the observed reward Rt +1 and the estimate V (St +1 ).
I The simplest TD method makes the update

V (St ) ← V (St ) + α [Rt +1 + γV (St +1 ) − V (St )]

immediately on transition to St +1 and receiving Rt +1 . This method is called TD (0) or

one-step TD.
Temporal-Difference Learning

Tabular TD(0) for estimating vπ

I Input: the policy π to be evaluated

I Algorithm parameter: step size α ∈ (0, 1]
I Initialise V (s ), for all s ∈ S , arbitrarily except that V (terminal) = 0
I Loop for each episode:
I Initialise S
I Loop for each step of episode:
I A ← action given by π for S
I Take action A , observe 0
R, S
I V (S ) ← V (S ) + α R + γV (S 0 ) − V (S )

I S ← S0
until S is terminal
Temporal-Difference Learning

Bootstrapping

I Just like the dynamic programming (DP),

∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vk ( s 0 ) ,

vk + 1 ( s ) =
a s 0 ,r

the TD (0) update is based in part on an existing estimate,

V (S ) ← V (S ) + α R + γV (S 0 ) − V (S ) .

I This is known as bootstrapping.

I Unlike DP and TD (0), the Monte Carlo (MC) method does not bootstrap in its update:
I Append G to Returns(St )
I V (St ) ← average(Returns(St ))
Temporal-Difference Learning

Sampling

I Consider the definition of vπ (s ):

vπ (s ) := Eπ [Gt | St = s ]
= Eπ [Rt +1 + γGt +1 | St = s ]
= Eπ [Rt +1 + γvπ (St +1 ) | St = s ] .
I DP assumes that the expected values are completely provided by a model of the
environment, so it does not sample1 .
I MC samples the expectation in the first equation above.
I TD (0) samples the expectation in the third equation above.

1 However, the solution of the Bellman equation or, more generally, the Hamilton–Jacobi–Bellman IPDE, is linked to
sampling via the Feynmann-Kac representation [KP15].
Temporal-Difference Learning

Sample updates

I We refer to TD and MC updates as sample updates because they involve looking

ahead to a sample successor state (or state–action pair), using the value of the
successor and the reward along the way to compute a backed-up value, and then
updating the value of the original state (or state–action pair) accordingly.
I Sample updates differ from the expected updates of DP methods in that they are
based on a single sample successor rather than on a complete distribution of all
possible successors.
Temporal-Difference Learning

TD error

I Note that the quantity in brackets in the TD (0) update is a sort of error, measuring the
difference between the estimated value of St and the better estimate
Rt +1 + γV (St +1 ).
I This quantity, called the TD error, arises in various forms throughout reinforcement
learning:
δt := Rt +1 + γV (St +1 ) − V (St ).
I Notice that the TD error at each time is the error in the estimate made at that time.
I Because the TD error depends on the next state and next reward, it is not actually
available until one time step later. That is, δt is the error in V (St ), available at time
t + 1.
I Also note that if the array V does not change during the episode (as it does not in MC
methods), then the MC error can be written as a sum of TD errors:

T −1
Gt − V ( S t ) = ∑ γk −t δk .
k =t

I This identity is not exact if V is updated during the episode (as it is in TD (0)), but if the
step size is small then it may still hold approximately.
Temporal-Difference Learning

Convergence

I For any fixed policy π , TD (0) has been proved to converge to vπ , in the mean for a
constant step-size parameter if it is sufficiently small, and with probability 1 if the
step-size parameter decreases according to the usual stochastic approximation
conditions
∞ ∞
∑ αn (a ) = ∞ and ∑ α2n (a ) < ∞.
n =1 n =1
Temporal-Difference Learning

Rate of convergence

I If both TD and MC methods converge asymptotically to the correct predictions, which

gets there first? Which makes the more efficient use of limited data?
I At the current time this is an open question in the sense that no-one has been able to
prove mathematically that one method converges faster than the other.
I In fact, it is not even clear what is the most appropriate formal way to phrase this
question!
I In practice, however, TD methods have usually been found to converge faster than
constant-α MC methods on stochastic tasks.
Temporal-Difference Learning

Batch updating

I Suppose there is available only a finite amount of experience, say 10 episodes or 100
time steps.
I In this case, a common approach with incremental learning methods is to present the
experience repeatedly until the method converges upon an answer.
I Given an approximate value function, V, the increments (MC or TD (0)) are computed
for every time step t at which a nonterminal state is visited, but the value function is
changed only once, by the sum of all the increments.
I Then all the available experience is processed again with the new value function to
produce a new overall increment, and so on until the value function converges.
I We call this batch updating because updates are made only after processing each
complete batch of training data.
I Under batch updating, TD (0) converges deterministically to a single answer
independent of the step-size parameter, α, as long as α is chosen to be sufficiently
small.
I The constant α MC method also converges deterministically under the same
conditions, but to a different answer.
Temporal-Difference Learning

You are the predictor

I Place yourself now in the role of the predictor of returns for an unknown Markov
reward process.
I Suppose you observe the following eight episodes:
I A, 0, B, 0
I B, 1
I B, 1
I B, 1
I B, 1
I B, 1
I B, 1
I B, 0
I This means that the episode started in state A , transitioned to B with a reward 0, and
then terminated from B with a reward of 0. The other seven episodes were even
shorter, starting from B and terminating immediately.
I Given this batch of data, what would you say are the optimal predictions, the best
values for the estimates V (A ) and V (B )?
Temporal-Difference Learning

Answers

I Everyone would probably agree that the optimal value for V (B ) is 3 , because six out
4
of eight times in state B the process terminated immediately with a return of 1, and the
other two times in B the process terminated immediately with a return of 0.
I But what is the optimal value for the estimate V (A ) given this data? There are two
reasonable answers.
I One is to observe that 100% of the times the process was in state A it traversed immediately
to B (with a reward of zero); and because we have already decided that B has value 34 , A
must have value 34 as well. One way of viewing this answer is that it is based on first
modelling the Markov process, and then computing the correct estimates given the model,
which indeed in this case gives V (A ) = 34 . This is also the answer that batch TD (0) gives.
I The other reasonable answer is simply to observe that we have seen A once and the return
that followed it was 0; we therefore estimate V (A ) as 0. This is the answer that batch MC
methods give. Notice that it is also the answer that gives minimum squared error on the
training data. In fact, it gives zero error on the data.
I But still we expect the first answer to be better. If the process is Markov, we expect that
the first answer will produce lower error on future data, even though the Monte Carlo
answer is better on the existing data.
Temporal-Difference Learning

Batch TD (0) versus batch MC

I This example illustrates a general difference between the estimates found by batch
TD (0) and batch MC methods.
I Batch MC methods always find the estimates that minimise mean-squared error on the
training set.
I Batch TD (0) always finds the estimates that would be exactly correct for the
maximum-likelihood model of the Markov process.
I In general, the maximum-likelihood estimate of a parameter is the parameter value
whose probability of generating the data is greatest.
I In this case, the maximum-likelihood estimate is the model of the Markov process
formed in the obvious way from the observed episodes: the estimated transition
probability from i that went to j, and the associated expected reward is the average of
the rewards observed on those transitions.
I Given this model, we can compute the estimate of the value function that would be
exactly correct if the model were exactly correct.
I This is called the certainty-equivalence estimate because it is equivalent to
assuming that the estimate of the underlying process was known with certainty rather
than being approximated.
I In general, batch TD (0) converges to the certainty-equivalence estimate.
Temporal-Difference Learning

Certainty-equivalence estimate

I It is worth noting that although the certainty-equivalence estimate is in some sense an

optimal solution, it is almost never feasible to compute it directly.
I If n = |S| is the number of states, then just forming the maximum-likelihood estimate
of the process may require on the order of n2 memory, and computing the
corresponding value function requires on the order of n3 computational steps if done
conventionally.
I In these terms it is indeed striking that TD methods can approximate the same solution
using memory no more than order n and repeated computations over the training set.
I On tasks with large spaces, TD methods may be the only feasible way of
approximating the certainty-equivalence solution.
Temporal-Difference Learning

Sarsa: on-policy TD control

I We turn now to the use of TD prediction methods for the control problem.
I As usual, we follow the pattern of generalised policy iteration (GPI), only this time
using TD methods for the evaluation or prediction part.
I The first step is to learn an action-value function rather than a state-value function.
I For an on-policy method we must estimate qπ (s , a ) for the current behaviour policy π
and for all states s and actions a.
I This can be done using essentially the same TD method described above for learning
vπ .
I Recall that an episode consists of an alternating sequence of states and state–action
pairs:

. . . , St , At , Rt +1 , St +1 , At +1 , Rt +2 , St +2 , At +2 , Rt +3 , St +3 , At +3 , . . .
Temporal-Difference Learning

Sarsa: on-policy TD control

I Previously we considered transitions from state to state and learned the values of
states.
I Now we consider transitions from state–action pair to state–action pair, and learn the
values of state–action pairs.
I Formally these cases are identical: they are both Markov chains with a reward
process. The theorems assuring the convergence of state values under TD (0) also
apply to the corresponding algorithm for action values:

Q (St , At ) ← Q (St , At ) + α [Rt +1 + γQ (St +1 , At +1 ) − Q (St , At )] .

I This update is done after every transition from a nonterminal state St .

I If St +1 is terminal, then Q (St +1 , At +1 ) is defined as zero.
I This rule uses every element of the quintuple of events, (St , At , Rt +1 , St +1 , At +1 ) that
make up a transition from one state–action pair to the next.
I This quintuple gives rise to the name Sarsa for the algorithm.
Temporal-Difference Learning

On-policy control algorithm based on Sarsa

I It is straightforward to design an on-policy control algorithm based on the Sarsa

prediction method.
I As in all on-policy methods, we continually estimate qπ for the behaviour policy π , and
at the same time change π towards greediness with respect to qπ .
I The convergence properties of the Sarsa algorithm depend on the nature of the
policy’s dependence on Q.
I For example, one could use e-greedy or e-soft policies. Sarsa converges with
probability 1 to an optimal policy and action-value function as long as all state–action
pairs are visited an infinite number of times and the policy converges in the limit to the
greedy policy (which can be arranged, for example, with e-greedy policies by setting
e = 1t ).
Temporal-Difference Learning

Sarsa (on-policy TD control) for estimating Q ≈ q∗

I Algorithm parameters: step size α ∈ (0, 1], small e > 0

I Initialise Q (s , a ) for all s ∈ S , a ∈ A(s ), arbitrarily except that Q (terminal, ·) = 0
I Loop for each episode:
I Initialise S
I Choose A from S using policy derived from Q (e.g., e-greedy)
I Loop for each step of episode:
I Take action A , observe R, S 0
I Choose A 0 from S 0 using policy derived from Q (e.g. e-greedy)
I Q (S , A ) ← Q (S , A ) + α R + γQ (S 0 , A 0 ) − Q (S , A )
I S ← S0, A ← A 0
until S is terminal
Temporal-Difference Learning

Q-learning

I One of the early breakthroughs in reinforcement learning was the development of an

off-policy TD control algorithm known as Q-learning [Wat89, WD92], defined by
h i
Q (St , At ) ← Q (St , At ) + α Rt +1 + γ max Q (St +1 , a ) − Q (St , At ) .
a

I In this case, the learned action-value function, Q, directly approximates q∗ , the optimal
action-value function, independent of the policy being followed.
I This dramatically simplifies the analysis of the algorithm and enabled early
convergence proofs.
I The policy still has an effect in that it determines which state–action pairs are visited
and updated.
I However, all that is required for correct convergence is that all pairs continue to be
updated.
I This is a minimal requirement in the sense that any method guaranteed to find optimal
behaviour in the general case must require it.
I Under this assumption and a variant of the usual stochastic approximation conditions
on the sequence of step-size parameters, Q has been shown to converge with
probability 1 to q∗ .
Temporal-Difference Learning

Q-learning (off-policy TD control) for estimating π ≈ π ∗

I Algorithm parameters: step size α ∈ (0, 1], small e > 0

I Initialise Q (s , a ) for all s ∈ S , a ∈ A(s ), arbitrarily except that Q (terminal, ·) = 0
I Loop for each episode:
I Initialise S
I Loop for each step of episode:
I Choose A from S using policy derived from Q (e.g., e-greedy)
I Take action A , observe R, S 0
I Choose A 0 from S 0 using policy derived from Q (e.g. e-greedy)
I Q (S , A ) ← Q (S , A ) + α R + γ maxa Q (S 0 , a ) − Q (S , A )

I S ← S0
until S is terminal
Temporal-Difference Learning

From the Summary of [Wat89]:

...there is a range of alternative methods of organising the dynamic programming
calculation, in ways that are plausible computational models of animal learning.
In particular, there is an incremental Monte-Carlo method that enables the op-
timal values or (‘canonical costs’) of actions to be learned directly, without any
requirement for the animal to model its environment, or to remember situations
and actions for more than a short period of time. A proof is given that this learning
method works...
Temporal-Difference Learning
Bibliography

Idris Kharroubi and Huyên Pham.

Feynman–Kac representation for Hamilton–Jacobi–Bellman IPDE.
The Annals of Probability, 43(4):1823–1865, 2015.
Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction.
MIT Press, 2 edition, 2018.
Christopher John Cornish Hellaby Watkins.
Learning from Delayed Rewards.
phdthesis, King’s College, May 1989.
Christopher John Cornish Hellaby Watkins and Peter Dayan.
Q-learning.
Machine Learning, 8:279–292, 1992.

Course Outline LIT 104 2025
No ratings yet
Course Outline LIT 104 2025
2 pages
1 Introduction To Reinforcement Learning
100% (2)
1 Introduction To Reinforcement Learning
104 pages
Year 2 Naplan - Reading Sample Test
No ratings yet
Year 2 Naplan - Reading Sample Test
20 pages
Final Draft 2 Teacher - S Manual
100% (1)
Final Draft 2 Teacher - S Manual
82 pages
Chapter 1 - Somatic Awareness in The Workplace-7
No ratings yet
Chapter 1 - Somatic Awareness in The Workplace-7
6 pages
Matric Pairing Scheme 2025 by Bimillah Academy 0300-7980055
No ratings yet
Matric Pairing Scheme 2025 by Bimillah Academy 0300-7980055
5 pages
Villancicos Edition Complete Wlscm32
No ratings yet
Villancicos Edition Complete Wlscm32
239 pages
Learning Activity Sheet Solving Exponential Equations and Inequalities Background Information For Learners
No ratings yet
Learning Activity Sheet Solving Exponential Equations and Inequalities Background Information For Learners
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Name: Class:: Daily Test 1 No Quantities Units
No ratings yet
Name: Class:: Daily Test 1 No Quantities Units
1 page
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Unit 4
100% (1)
Unit 4
7 pages
S Feature D Escriptio: LTC1255 Dual 24V High-Side MOSFET Driver
No ratings yet
S Feature D Escriptio: LTC1255 Dual 24V High-Side MOSFET Driver
16 pages
Object Relations Therapy
No ratings yet
Object Relations Therapy
9 pages
Creation Story of Luzon
100% (1)
Creation Story of Luzon
2 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Picturesque Quebec: A Sequel To Quebec Past and Present by Le Moine, J. M. (James MacPherson), Sir, 1825-1912
No ratings yet
Picturesque Quebec: A Sequel To Quebec Past and Present by Le Moine, J. M. (James MacPherson), Sir, 1825-1912
459 pages
Module 5-rl
No ratings yet
Module 5-rl
54 pages
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
No ratings yet
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
57 pages
MLT - Module 5
No ratings yet
MLT - Module 5
77 pages
Lecture#4 Temporal DifferenceTD Learning Q Learning & SARSA 2024
No ratings yet
Lecture#4 Temporal DifferenceTD Learning Q Learning & SARSA 2024
62 pages
2023 Week3 Modelfree
No ratings yet
2023 Week3 Modelfree
63 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Model Free Prediction
No ratings yet
Model Free Prediction
38 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
20ai903 - RL - Unit 3
No ratings yet
20ai903 - RL - Unit 3
34 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Lesson Plan Feed Relationship Between Plants and Animals (Concretization)
No ratings yet
Lesson Plan Feed Relationship Between Plants and Animals (Concretization)
8 pages
Model Free Methods
No ratings yet
Model Free Methods
31 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
16 RL
No ratings yet
16 RL
51 pages
3 - Chapter 7 Temporal-Difference Methods
No ratings yet
3 - Chapter 7 Temporal-Difference Methods
26 pages
Module 5
No ratings yet
Module 5
40 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
Beige Scrapbook Geography Presentation
No ratings yet
Beige Scrapbook Geography Presentation
60 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
PPM Assignment
No ratings yet
PPM Assignment
11 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
15 pages
Nutr 4901 Course Reflections
No ratings yet
Nutr 4901 Course Reflections
39 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
Q - Networks (1) 31 50
No ratings yet
Q - Networks (1) 31 50
20 pages
Lecture 4: Model-Free Prediction: David Silver
No ratings yet
Lecture 4: Model-Free Prediction: David Silver
51 pages
Learning Task
No ratings yet
Learning Task
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
11 ML Reinforcement Learning Prediction
No ratings yet
11 ML Reinforcement Learning Prediction
13 pages
Dissecting Reinforcement Learning-Part10
No ratings yet
Dissecting Reinforcement Learning-Part10
19 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
New Perspectives New Truths
No ratings yet
New Perspectives New Truths
50 pages
TD Convergence: An Optimization Perspective: Equal Contribution
No ratings yet
TD Convergence: An Optimization Perspective: Equal Contribution
15 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
GTD2 TDC Suttonetal2009
No ratings yet
GTD2 TDC Suttonetal2009
8 pages
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
No ratings yet
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
14 pages
NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper
No ratings yet
NIPS 1995 Temporal Difference Learning in Continuous Time and Space Paper
7 pages
Unit 06 Temporal Difference Learning
No ratings yet
Unit 06 Temporal Difference Learning
9 pages
Gio Activity PE
No ratings yet
Gio Activity PE
4 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Stock Price Prediction Using Reinforcement Learning
No ratings yet
Stock Price Prediction Using Reinforcement Learning
6 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Rocky Mountain Modern Language Association Rocky Mountain Review of Language and Literature
No ratings yet
Rocky Mountain Modern Language Association Rocky Mountain Review of Language and Literature
17 pages
Temporal Difference
No ratings yet
Temporal Difference
4 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Chapter 6: Temporal Difference Learning: Objectives of This Chapter
No ratings yet
Chapter 6: Temporal Difference Learning: Objectives of This Chapter
49 pages
Unified View: Multi-Step Bootstrapping
No ratings yet
Unified View: Multi-Step Bootstrapping
18 pages
Diagnostic Test Grade 6 Sem 2
No ratings yet
Diagnostic Test Grade 6 Sem 2
5 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Prove FTAP Without Prob
No ratings yet
Prove FTAP Without Prob
19 pages
Aboleth
No ratings yet
Aboleth
1 page
How To Repent in Islam Importance and Salaat Al-Tawbah Method
No ratings yet
How To Repent in Islam Importance and Salaat Al-Tawbah Method
1 page
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Nutritional Needs of A Newborn: Mary Winrose B. Tia, RN
No ratings yet
Nutritional Needs of A Newborn: Mary Winrose B. Tia, RN
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
PostgreSQL Self Join
No ratings yet
PostgreSQL Self Join
5 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 6
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 6
4 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 5
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 5
3 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 3
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 3
3 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 4
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 4
3 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 2
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 2
3 pages
Class Ix - A - Physics Test (Work and Energy) - 22.12.2016
No ratings yet
Class Ix - A - Physics Test (Work and Energy) - 22.12.2016
2 pages
Along XXXZZZ
No ratings yet
Along XXXZZZ
3 pages
Indicative Bib Dark Tourism
No ratings yet
Indicative Bib Dark Tourism
7 pages
"Recharge" The First Indonesia Power Bank Rental App
No ratings yet
"Recharge" The First Indonesia Power Bank Rental App
3 pages
Private International Law-Notes
No ratings yet
Private International Law-Notes
1 page
The Feature How To Set Up Oracle Iexpenses
No ratings yet
The Feature How To Set Up Oracle Iexpenses
21 pages

5 Temporal Difference Learning

Uploaded by

5 Temporal Difference Learning

Uploaded by

Temporal-Difference Learning

Paul Alexander Bilokon, PhD

Recall the Bellman equations:

Recall the Bellman optimality equations:

v∗ (s ) = max E [Rt +1 + γv∗ (St +1 ) | St = s , At = a ]

for all s ∈ S , a ∈ A(s ), and s 0 ∈ S .

V (St ) ← V (St ) + α [Gt − V (St )] ,

V (St ) ← V (St ) + α [Rt +1 + γV (St +1 ) − V (St )]

immediately on transition to St +1 and receiving Rt +1 . This method is called TD (0) or

Tabular TD(0) for estimating vπ

I Input: the policy π to be evaluated

I Just like the dynamic programming (DP),

the TD (0) update is based in part on an existing estimate,

I This is known as bootstrapping.

I Consider the definition of vπ (s ):

I We refer to TD and MC updates as sample updates because they involve looking

I If both TD and MC methods converge asymptotically to the correct predictions, which

You are the predictor

Batch TD (0) versus batch MC

I It is worth noting that although the certainty-equivalence estimate is in some sense an

Sarsa: on-policy TD control

Sarsa: on-policy TD control

Q (St , At ) ← Q (St , At ) + α [Rt +1 + γQ (St +1 , At +1 ) − Q (St , At )] .

I This update is done after every transition from a nonterminal state St .

On-policy control algorithm based on Sarsa

I It is straightforward to design an on-policy control algorithm based on the Sarsa

Sarsa (on-policy TD control) for estimating Q ≈ q∗

I Algorithm parameters: step size α ∈ (0, 1], small e > 0

I One of the early breakthroughs in reinforcement learning was the development of an

Q-learning (off-policy TD control) for estimating π ≈ π ∗

I Algorithm parameters: step size α ∈ (0, 1], small e > 0

From the Summary of [Wat89]:

Idris Kharroubi and Huyên Pham.

You might also like