0% found this document useful (0 votes)
21 views

Module 5

1) Temporal difference (TD) learning is a model-free reinforcement learning method that combines ideas from Monte Carlo and dynamic programming. Unlike Monte Carlo methods, TD methods update estimates based on other learned estimates without waiting for the final outcome. 2) TD prediction methods like TD(0) update value estimates based on the current state and next state, without waiting for the final reward like Monte Carlo methods do. The difference between the estimated value of the current state and the estimated value of the next state is called the TD error. 3) An example of estimating travel time home from work each day illustrates the difference between Monte Carlo and TD updates. While Monte Carlo must wait until reaching the final outcome, TD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Module 5

1) Temporal difference (TD) learning is a model-free reinforcement learning method that combines ideas from Monte Carlo and dynamic programming. Unlike Monte Carlo methods, TD methods update estimates based on other learned estimates without waiting for the final outcome. 2) TD prediction methods like TD(0) update value estimates based on the current state and next state, without waiting for the final reward like Monte Carlo methods do. The difference between the estimated value of the current state and the estimated value of the next state is called the TD error. 3) An example of estimating travel time home from work each day illustrates the difference between Monte Carlo and TD updates. While Monte Carlo must wait until reaching the final outcome, TD
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Reinforcement Learning

Module 5
Dr. D. Sathian
SCOPE
Temporal Difference (TD) Learning
• RL’s model-free methods is temporal difference (TD) learning.
• TD learning is a combination of Monte Carlo ideas and dynamic programming (DP)
ideas.
• Like Monte Carlo methods, TD methods can learn directly from raw experience without
a model of the environment’s dynamics.
• Like DP, TD methods update estimates based in part on other learned estimates,
without waiting for a final outcome (they bootstrap).

• Temporal Difference Learning is an unsupervised learning technique that is very


commonly used in reinforcement learning for the purpose of predicting the total
reward expected over the future.
Your Logo or Name Here 2
Temporal Difference Prediction
• Both TD and Monte Carlo methods use experience to solve the prediction problem.
• Monte Carlo methods wait until the return following the visit is known which is after
the episode ends is available to update the value of the state, whereas TD methods
update the state value in the next time step, at the next time step t+1 they
immediately form a target and make a useful update using the observed reward.
• A simple every-visit Monte Carlo method suitable for nonstationary environments is

• where, Gt is the actual return following time t, and 𝛼 is a constant step-size parameter.
• The simplest TD method makes the update, immediately on transition to St+1 and
receiving Rt+1

Estimate of the return Your Logo or Name Here 3


Temporal Difference Prediction
• Updating the state value just after one time step is called one-step TD or TD(0), which
is a special case of the TD(𝜆) and n-step TD methods.

Your Logo or Name Here 4


Temporal Difference Prediction

Simple Monte Carlo Simple TD

Your Logo or Name Here 5


Temporal Difference Prediction

DP

Your Logo or Name Here 6


Temporal Difference Prediction
• Because TD(0) bases its update in part on an existing estimate, we say that it is a
bootstrapping method, like DP.
• In the TD update equation, the quantity
in brackets is a measure of an error,
measuring the difference between the
estimated value of St and the better
estimate at the next time step. This
quantity is called the TD error which has
a widespread presence in all
reinforcement learning.

Your Logo or Name Here 7


Temporal Difference Prediction
• The TD error at each time is the error in the estimate made at that time.
• Because the TD error depends on the next state and next reward, it is not actually
available until one time step later.
• That is, 𝛿t is the error in V (St), available at time t + 1.
• Also note that if the array V does not change during the episode (as it does not in
Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD
errors:

Your Logo or Name Here 8


Example: Driving Home
• Each day as you drive home from work, you try to predict how long it will take to get home.
• When you leave your office, you note the time, the day of week, the weather, and anything
else that might be relevant.
• Say on this Friday you are leaving at exactly 6 o’clock, and you estimate that it will take 30
minutes to get home. As you reach your car it is 6:05, and you notice it is starting to rain.
• Traffic is often slower in the rain, so you re-estimate that it will take 35 minutes from then, or a
total of 40 minutes.
• Fifteen minutes later you have completed the highway portion of your journey in good time.
• As you exit onto a secondary road you cut your estimate of total travel time to 35 minutes.
• Unfortunately, at this point you get stuck behind a slow truck, and the road is too narrow to
pass. You end up having to follow the truck until you turn onto the side street where you live at
6:40.
• Three minutes later you are home. Your Logo or Name Here 9
Example: Driving Home
• The sequence of states, times, and predictions is thus as follows:

• The rewards in this example are the elapsed times on each leg of the journey.
• We are not discounting (𝛾= 1), and thus the return for each state is the actual time to go from
that state.
• The value of each state is the expected time to go.
• The second column of numbers gives the current estimated value for each state encountered.
Your Logo or Name Here 10
Example: Driving Home
• The red arrows show the changes in predictions recommended by the constant-𝛼 MC
method, for 𝛼 = 1.
• These are exactly the errors between the estimated value (predicted time to go) in each
state and the actual return (actual time to go).

• For example, when you


exited the highway you
thought it would take
only 15 minutes more
to get home, but in fact
it took 23 minutes.
• The error, Gt - V(St), at
this time is 8 minutes. MC TD
Your Logo or Name Here 11
Example: Driving Home
• Suppose the step-size parameter, 𝛼 is 1/2. Then the predicted time to go after exiting
the highway would be revised upward by four minutes as a result of this experience.
• In any event, the change can only be made offline, that is, after you have reached home.
Only at this point do you know any of the actual returns.
• Suppose on another day you again estimate when leaving your office that it will take 30
minutes to drive home, but then you become stuck in a massive traffic jam. 25 minutes
after leaving the office you are still bumper-to-bumper on the highway. You now
estimate that it will take another 25 minutes to get home, for a total of 50 minutes. As
you wait in traffic, you already know that your initial estimate of 30 minutes was too
optimistic.

• Must you wait until you get home before increasing your estimate for the initial state?

Your Logo or Name Here 12


Example: Driving Home
• According to the Monte Carlo approach you must wait until you get home, because you
don’t yet know the true return.
• On the other hand, according to a TD approach, you would learn immediately, shifting
your initial estimate from 30 minutes toward 50.
• In fact, each estimate would be shifted toward the estimate that immediately follows it.
• Returning to our first day of driving, Figure (right) shows the changes in the predictions
recommended by the TD rule (these are the changes made by the rule if 𝛼 = 1).
• Each error is proportional to the change over time of the prediction, that is, to the
temporal differences in predictions.

Your Logo or Name Here 13


Advantages of TD Prediction Method
• TD methods do not require a model of the environment, only experience.
• TD, but not MC, methods can be fully incremental .
• You can learn before knowing the final outcome
• Less memory
• Less peak computation
• You can learn without the final outcome
• From incomplete sequences
• Both MC and TD converge (under certain assumptions to be detailed later), but which is
faster?

Your Logo or Name Here 14


Example: Random Walk
• In this example we empirically compare the prediction abilities of TD(0) and constant-𝛼
MC when applied to the following Markov reward process:

• All episodes start in the center state, C, and proceed either left or right by one state on
each step, with equal probability.
• This behavior is presumably due to the combined effect of a fixed policy and an
environment's state-transition probabilities, but we do not care which; we are
concerned only with predicting returns however they are generated.
• Episodes terminate either on the extreme left or the extreme right. When an episode
terminates on the right a reward of +1 occurs; all other rewards are zero.
Your Logo or Name Here 15
Example: Random Walk
• For example, a typical walk might consist of the following state-and-reward sequence:
C, 0, B, 0, C, 0, D, 0, E, 1.

• Because this task is undiscounted and episodic, the true value of each state is the
probability of terminating on the right if starting from that state.
• Thus, the true value of the center state is v𝜋(C) = 0.5.
• The true values of all the states, A through E, are 1/6, 2/6, 3/6, 4/6, and 5/6.

Your Logo or Name Here 16


Example: Random Walk

• The values learned by TD(0) is approaching the true values as more episodes are
experienced.
• with a constant step-size parameter (𝛼 = 0.1 in this example), the values fluctuate
indefinitely in response to the outcomes of the most recent episodes.Your Logo or Name Here 17
Example: Random Walk

• This graph shows learning curves for the two methods for various values of 𝛼.
• The performance measure shown is the root mean-squared (RMS) error between the
value function learned and the true value function, averaged over the five states, then
averaged over 100 runs. Your Logo or Name Here 18
Optimality of TD(0)
• Suppose there is available only a finite amount of experience, say 10 episodes or 100
time steps.
• In this case, a common approach with incremental learning methods is to present the
experience repeatedly until the method converges upon an answer.
• Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10
episodes until convergence.
• Compute updates according to TD(0), but only update estimates after each complete
pass through the data.
• For any finite Markov prediction task, under batch updating, TD(0) converges for
sufficiently small .
• Constant-𝛼 MC also converges under these conditions, but to a difference answer!
Your Logo or Name Here 19
Example: Random Walk under Batch Updating
• Batch-updating versions of TD(0) and constant-𝛼 MC were applied as follows to the
random walk prediction example

• After each new episode, all episodes seen so far


were treated as a batch.
• They were repeatedly presented to the algorithm,
either TD(0) or constant-𝛼 MC, with 𝛼 suffciently
small that the value function converged.
• Batch TD method was consistently better than the
batch Monte Carlo method.
• MC is optimal only in a limited way, and
that TD is optimal in a way that is more relevant
to predicting returns.
Your Logo or Name Here 20
Example: You are the Predictor
• Suppose you observe the following eight episodes:
• A,0,B,0
• B,1
• B,1
• B,1
V(A)=?
• B,1 V(B)=?
• B,1
• B,1
• B,0

Your Logo or Name Here 21


Example: You are the Predictor

• Everyone would probably agree that the optimal value for V(B) = 3/4 , because six out of
the eight times in state B the process terminated immediately with a return of 1, and the
other two times in B the process terminated immediately with a return of 0.

• V(A) =?

Your Logo or Name Here 22


Example: You are the Predictor

• There are two reasonable answers.


• One way of viewing - based on first modeling the Markov process, and then computing the
correct estimates given the model, which indeed in this case gives V(A) = 3/4 .
• 100% of the times the process was in state A, it traversed immediately to B (with a reward of 0);
and because we have already decided that B has value 3/4 , therefore A must have value 3/4 as
well.
• This is also the answer that batch TD(0) gives.

Your Logo or Name Here 23


Example: You are the Predictor

• The other reasonable answer is simply to observe that we have seen A once and the return that
followed it was 0; we therefore estimate V (A) as 0.
• This is the answer that batch Monte Carlo methods give.
• Notice that it is also the answer that gives minimum squared error on the training data. In fact, it
gives zero error on the data.

Your Logo or Name Here 24


Example: You are the Predictor

• The prediction that best matches the training data is V(A)=0


• This minimizes the mean-square-error on the training set
• This is what a batch Monte Carlo method gets
• If we consider the sequentiality of the problem, then we would set V(A)=.75
• This is correct for the maximum likelihood estimate of a Markov model generating the data
• i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it
predicts (how?)
• This is called the certainty-equivalence estimate
Your Logo or Name Here 25
• This is what TD(0) gets
SARSA: On Policy TD Control
• SARSA is an on-policy learning method.
• It uses an ε-greedy strategy for all the steps.
• It updates the Q-value for a certain action based on the obtained reward from taking
that action and the reward from the state after that assuming it keeps following the
policy.
• This means we need to know the next action our policy takes in order to perform an
update step.

Your Logo or Name Here 26


SARSA: On Policy TD Control
• Learning Action-value function

• Estimate Q𝜋 for the current behaviour policy 𝜋.


• After every transition from a nonterminal state St , do this :

• If st+1 is terminal, then Q(st+1, at+1 ) = 0.

Your Logo or Name Here 27


SARSA: On Policy TD Control
• It is straightforward to design an on-policy control algorithm based on the Sarsa
prediction method.
• As in all on-policy methods, we continually estimate q𝜋 for the behavior policy 𝜋, and at
the same time change 𝜋 toward greediness with respect to q𝜋.

Your Logo or Name Here 28


Example: Windy Gridworld
• It is a standard gridworld, with start and goal states, but with one difference: there is a
crosswind running upward through the middle of the grid.

• Actions: up, down, left & right


• The strength of the wind is given below each column, in number of cells shifted upward.
• For example, if you are one cell to the right of the goal, then the action left takes you to
the cell just above the goal. Your Logo or Name Here 29
Example: Windy Gridworld
• This is an undiscounted episodic task, with
constant rewards of 1 until the goal state is
reached.
• This graph - results of applying 𝜀-greedy Sarsa to
this task, with 𝜀 = 0.1, 𝛼= 0.5, and the initial
values Q(s,a) = 0 for all s, a.

Your Logo or Name Here 30


Q-learning: Off-policy TD Control
• Q-Learning is an off-policy learning method.
• It updates the Q-value for a certain action based on the obtained reward
from the next state and the maximum reward from the possible states
after that.
• It is off-policy because it uses an ε-greedy strategy for the first step and a
greedy action selection strategy for the second step.

Your Logo or Name Here 31


Q-learning: Off-policy TD Control
• One of the early breakthroughs in reinforcement learning was the development of an
off-policy TD control algorithm known as Q-learning (Watkins, 1989), defined by

• The use of the maxa function over the available actions makes the Q-learning algorithm
Your Logo or Name Here 32
an off-policy approach.
Example: Cliff Walking
• On-policy, SARSA agent views the cliff edge as
riskier because it chooses and updates actions
subject to its stochastic policy.
• That means it has learned it has a high
likelihood of stepping off the cliff and receiving
a high negative reward.

• Q-learning agent by contrast has learned its


policy based on the optimal policy which
always chooses the action with the highest Q-
value.
• It is more confident in its ability to walk the cliff
edge without falling off. Your Logo or Name Here 33
Expected SARSA
• Expected Sarsa is very similar to Sarsa.
• However, instead of state-action values being stochastically sampled using our current
policy, it computes the expected value over all future state-action pairs, thereby
considering how likely each action is under the current policy.
• The update-step is now defined as:

• Expected Sarsa is computationally more complex than Sarsa but, in return, it eliminates
the variance due to random selection of At+1 . Your Logo or Name Here 34
Expected SARSA

Your Logo or Name Here 35


Maximum Bias & Double Learning
• Maximization Bias is a technical way of saying that Q-Learning algorithm overestimates
the value function estimates (V) and action-value estimates (Q).
• A & B: non terminal states

• Episodes always start in A with a choice between two actions, left and right.
• The right action transitions immediately to the terminal state with a reward and return
of 0.
• Left action transitions to B, also with a reward of zero, from which there are a large
number of actions, all with rewards sampled from 𝓝(-0.1,1), which is a normal
distribution with mean -0.1 and variance 1.
Your Logo or Name Here 36
Maximum Bias & Double Learning

Your Logo or Name Here 37


Maximum Bias & Double Learning
• Given the large variance in rewards, it is quite possible that the initial few estimates of
the actions might be positive or more negative.
• The problem with Q-Learning is that the same samples are being used to decide which
action is the best (highest expected reward), and the same samples are also being used
to estimate that action-value.
• if an action’s value was overestimated, it will be chosen as the best action, and it’s
overestimated value is used as the target.
• Instead, we can have two different action-value estimates, and the “best” action can
be chosen based on the values of the first action-value estimate and the target can be
decided by the second action-value estimate.

Your Logo or Name Here 38


Maximum Bias & Double Learning
• Double Q-learning very quickly adjusts and is not affected by any lucky draws from
actions taken in the Gamble state.
• Double Q-learning performs much better in this game, and empirically this has been
the recurring theme across many games.

Your Logo or Name Here 39


Maximum Bias & Double Learning
• Even though both methods converge to Q∗ the policies in the meantime tend to be
better in most games when being pessimistic.
• This is particularly remarkable since when Double Q-learning splits the samples into
two groups, the variance of the estimators grows.
• This means it has lower sample efficiency, but despite this, removing the optimism
from Q-learning still usually outweighs this disadvantage.

Your Logo or Name Here 40

You might also like