0% found this document useful (0 votes)
3 views

Module3 TD Methods

The document discusses Temporal Difference (TD) learning, which combines Monte Carlo and dynamic programming methods to learn from raw experiences without requiring a model of the environment. It highlights the concepts of on-policy and off-policy learning, specifically through Sarsa and Q-learning, and addresses the issue of maximization bias in reinforcement learning, proposing solutions like Double Q-learning and Expected Sarsa to mitigate this bias. The document provides examples and explanations of how these methods work to improve learning stability and action selection accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module3 TD Methods

The document discusses Temporal Difference (TD) learning, which combines Monte Carlo and dynamic programming methods to learn from raw experiences without requiring a model of the environment. It highlights the concepts of on-policy and off-policy learning, specifically through Sarsa and Q-learning, and addresses the issue of maximization bias in reinforcement learning, proposing solutions like Double Q-learning and Expected Sarsa to mitigate this bias. The document provides examples and explanations of how these methods work to improve learning stability and action selection accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Temporal Difference

Learning
Dr. D. John Pradeep
Associate Professor
VIT-AP University
TD-Learning
• TD learning is a combination of Monte Carlo ideas and dynamic
programming (DP) ideas
• TD methods can learn directly from raw experience without a
model of the environment’s dynamics like Monte Carlo methods
• TD methods update estimates based in part on other learned
estimates, without waiting for a final outcome (they bootstrap) –
DP
Dynamic programming

Full back up
Boot strapping
Monte Carlo Methods

Sample back up No Boot strapping


Temporal Difference (TD (λ)) Methods

St
Sample back up Boot strapping

TD(0) - error
St+1
TD - Methods
Sarsa – On policy TD control
• Let an episode consists of an alternating sequence of states and
state–action pairs as shown below

• State action value update


Sarsa – On policy TD control
Q – Learning: Off policy learning
• In Q-learning, the learned action-value function, Q, directly
approximates q*, the optimal action-value function,
independent of the policy being followed.
Q – Learning: Off policy learning
Maximization bias in RL
• Maximization bias in reinforcement learning happens when the
action-value function Q(s, a) overestimates the true value of an
action because we always pick the action with the highest estimated
value — even when those estimates might be inaccurate. This bias
arises from the interaction of maximization and estimation
errors.
Let’s take a simple example:
Imagine you’re playing a slot machine game with two slot machines:
• Machine A gives an average reward of 5, but sometimes you get
rewards of 0, 5, or 10.
• Machine B gives an average reward of 4, with more consistent
rewards of 3, 4, or 5.
Maximization bias in RL
• You start by estimating the rewards for each machine. After a few
trials, your current estimates of the Q-values might look like this:
Q(A) = 7 (because you happened to hit a couple of 10s early on)
Q(B) = 4
• Now, because we always choose the action with the highest estimated
value, you’ll keep choosing Machine A — even though its true average
is 5 and you’ve overestimated it.
• This overestimation is an example of maximization bias —
because the "max" operation picks the highest estimate, not
necessarily the most accurate one.
Why is this a problem?
• It can lead to suboptimal action selection, favoring actions
whose value estimates are inflated by randomness.
• It makes learning slower and less stable because the algorithm
struggles with overestimated values.
Maximization bias in RL
How to reduce maximization bias?
• Double Q-Learning: It reduces this bias by maintaining two
separate Q-value estimates and using one to choose actions and
the other to evaluate them.
• Averaging techniques: Methods like Expected SARSA
reduce the bias by considering the weighted average of action
values instead of always picking the max.
Double q-learning
Double Q-learning is a technique designed to reduce
maximization bias in standard Q-learning by using two
separate Q-value functions.
The idea: Instead of maintaining one Q-function Q(s, a), Double
Q-learning keeps two: Q1(s,a) and Q2(s,a)
• These two estimates are trained independently on different
subsets of experiences, which helps control the overestimation
problem caused by the maximization operation in standard Q-
learning.
Double q-learning
How it works: When we take an action and observe a reward,
we update one of the two Q-functions randomly:
1. Choose which Q-function to update:
With 50% probability, choose either Q1 or Q2.
2. Select action using one function, evaluate with the
other: Let’s say we’re updating Q1:
Double q-learning
If we were updating Q2, we’d reverse the roles:

• Action selection during training or execution:


When selecting an action (like in an ε-greedy strategy), you can combine the
two estimates:

Why this works:


• By decoupling action selection and evaluation, Double Q-
learning reduces the risk that an overestimated value for an action
keeps reinforcing itself.
• Since Q1 and Q2 are trained on different samples and only one updates
at a time, they tend to average out overestimation errors.
Double q-learning
A simple example: Let’s say: Q1(s,a1)=5, Q2(s,a1)=4 Q1(s,a2)=3,
Q2(s,a2)=6
• If we use standard Q-learning, we’d pick a2 because it has the higher
max estimate of 6 — even though it may not be the best action.
With Double Q-learning:
• If we update Q1, we choose a2 (max from Q1), but evaluate it using
Q2 -- so the target is based on 6, not overestimating.
• If we update Q2, we choose a2 but evaluate using Q1 — so the target is
3, balancing out the estimates.
• Result: We avoid persistently overestimating the value of a2 because
we split the selection and evaluation between two different functions.
Expected Sarsa
• This eliminates the variance due to random selection of At+1

You might also like