Module 5
Module 5
Module 5
Dr. D. Sathian
SCOPE
Temporal Difference (TD) Learning
• RL’s model-free methods is temporal difference (TD) learning.
• TD learning is a combination of Monte Carlo ideas and dynamic programming (DP)
ideas.
• Like Monte Carlo methods, TD methods can learn directly from raw experience without
a model of the environment’s dynamics.
• Like DP, TD methods update estimates based in part on other learned estimates,
without waiting for a final outcome (they bootstrap).
• where, Gt is the actual return following time t, and 𝛼 is a constant step-size parameter.
• The simplest TD method makes the update, immediately on transition to St+1 and
receiving Rt+1
DP
• The rewards in this example are the elapsed times on each leg of the journey.
• We are not discounting (𝛾= 1), and thus the return for each state is the actual time to go from
that state.
• The value of each state is the expected time to go.
• The second column of numbers gives the current estimated value for each state encountered.
Your Logo or Name Here 10
Example: Driving Home
• The red arrows show the changes in predictions recommended by the constant-𝛼 MC
method, for 𝛼 = 1.
• These are exactly the errors between the estimated value (predicted time to go) in each
state and the actual return (actual time to go).
• Must you wait until you get home before increasing your estimate for the initial state?
• All episodes start in the center state, C, and proceed either left or right by one state on
each step, with equal probability.
• This behavior is presumably due to the combined effect of a fixed policy and an
environment's state-transition probabilities, but we do not care which; we are
concerned only with predicting returns however they are generated.
• Episodes terminate either on the extreme left or the extreme right. When an episode
terminates on the right a reward of +1 occurs; all other rewards are zero.
Your Logo or Name Here 15
Example: Random Walk
• For example, a typical walk might consist of the following state-and-reward sequence:
C, 0, B, 0, C, 0, D, 0, E, 1.
• Because this task is undiscounted and episodic, the true value of each state is the
probability of terminating on the right if starting from that state.
• Thus, the true value of the center state is v𝜋(C) = 0.5.
• The true values of all the states, A through E, are 1/6, 2/6, 3/6, 4/6, and 5/6.
• The values learned by TD(0) is approaching the true values as more episodes are
experienced.
• with a constant step-size parameter (𝛼 = 0.1 in this example), the values fluctuate
indefinitely in response to the outcomes of the most recent episodes.Your Logo or Name Here 17
Example: Random Walk
• This graph shows learning curves for the two methods for various values of 𝛼.
• The performance measure shown is the root mean-squared (RMS) error between the
value function learned and the true value function, averaged over the five states, then
averaged over 100 runs. Your Logo or Name Here 18
Optimality of TD(0)
• Suppose there is available only a finite amount of experience, say 10 episodes or 100
time steps.
• In this case, a common approach with incremental learning methods is to present the
experience repeatedly until the method converges upon an answer.
• Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10
episodes until convergence.
• Compute updates according to TD(0), but only update estimates after each complete
pass through the data.
• For any finite Markov prediction task, under batch updating, TD(0) converges for
sufficiently small .
• Constant-𝛼 MC also converges under these conditions, but to a difference answer!
Your Logo or Name Here 19
Example: Random Walk under Batch Updating
• Batch-updating versions of TD(0) and constant-𝛼 MC were applied as follows to the
random walk prediction example
• Everyone would probably agree that the optimal value for V(B) = 3/4 , because six out of
the eight times in state B the process terminated immediately with a return of 1, and the
other two times in B the process terminated immediately with a return of 0.
• V(A) =?
• The other reasonable answer is simply to observe that we have seen A once and the return that
followed it was 0; we therefore estimate V (A) as 0.
• This is the answer that batch Monte Carlo methods give.
• Notice that it is also the answer that gives minimum squared error on the training data. In fact, it
gives zero error on the data.
• The use of the maxa function over the available actions makes the Q-learning algorithm
Your Logo or Name Here 32
an off-policy approach.
Example: Cliff Walking
• On-policy, SARSA agent views the cliff edge as
riskier because it chooses and updates actions
subject to its stochastic policy.
• That means it has learned it has a high
likelihood of stepping off the cliff and receiving
a high negative reward.
• Expected Sarsa is computationally more complex than Sarsa but, in return, it eliminates
the variance due to random selection of At+1 . Your Logo or Name Here 34
Expected SARSA
• Episodes always start in A with a choice between two actions, left and right.
• The right action transitions immediately to the terminal state with a reward and return
of 0.
• Left action transitions to B, also with a reward of zero, from which there are a large
number of actions, all with rewards sampled from 𝓝(-0.1,1), which is a normal
distribution with mean -0.1 and variance 1.
Your Logo or Name Here 36
Maximum Bias & Double Learning