Module3 TD Methods

The document discusses Temporal Difference (TD) learning, which combines Monte Carlo and dynamic programming methods to learn from raw experiences without requiring a model of the environment. It highlights the concepts of on-policy and off-policy learning, specifically through Sarsa and Q-learning, and addresses the issue of maximization bias in reinforcement learning, proposing solutions like Double Q-learning and Expected Sarsa to mitigate this bias. The document provides examples and explanations of how these methods work to improve learning stability and action selection accuracy.

Uploaded by

vemuripraveena2622

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views18 pages

Module3 TD Methods

Uploaded by

vemuripraveena2622

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Temporal Difference

Learning
Dr. D. John Pradeep
Associate Professor
VIT-AP University
TD-Learning
• TD learning is a combination of Monte Carlo ideas and dynamic
programming (DP) ideas
• TD methods can learn directly from raw experience without a
model of the environment’s dynamics like Monte Carlo methods
• TD methods update estimates based in part on other learned
estimates, without waiting for a final outcome (they bootstrap) –
DP
Dynamic programming

Full back up
Boot strapping
Monte Carlo Methods

Sample back up No Boot strapping

Temporal Difference (TD (λ)) Methods

St
Sample back up Boot strapping

TD(0) - error
St+1
TD - Methods
Sarsa – On policy TD control
• Let an episode consists of an alternating sequence of states and
state–action pairs as shown below

• State action value update

Sarsa – On policy TD control
Q – Learning: Off policy learning
• In Q-learning, the learned action-value function, Q, directly
approximates q*, the optimal action-value function,
independent of the policy being followed.
Q – Learning: Off policy learning
Maximization bias in RL
• Maximization bias in reinforcement learning happens when the
action-value function Q(s, a) overestimates the true value of an
action because we always pick the action with the highest estimated
value — even when those estimates might be inaccurate. This bias
arises from the interaction of maximization and estimation
errors.
Let’s take a simple example:
Imagine you’re playing a slot machine game with two slot machines:
• Machine A gives an average reward of 5, but sometimes you get
rewards of 0, 5, or 10.
• Machine B gives an average reward of 4, with more consistent
rewards of 3, 4, or 5.
Maximization bias in RL
• You start by estimating the rewards for each machine. After a few
trials, your current estimates of the Q-values might look like this:
Q(A) = 7 (because you happened to hit a couple of 10s early on)
Q(B) = 4
• Now, because we always choose the action with the highest estimated
value, you’ll keep choosing Machine A — even though its true average
is 5 and you’ve overestimated it.
• This overestimation is an example of maximization bias —
because the "max" operation picks the highest estimate, not
necessarily the most accurate one.
Why is this a problem?
• It can lead to suboptimal action selection, favoring actions
whose value estimates are inflated by randomness.
• It makes learning slower and less stable because the algorithm
struggles with overestimated values.
Maximization bias in RL
How to reduce maximization bias?
• Double Q-Learning: It reduces this bias by maintaining two
separate Q-value estimates and using one to choose actions and
the other to evaluate them.
• Averaging techniques: Methods like Expected SARSA
reduce the bias by considering the weighted average of action
values instead of always picking the max.
Double q-learning
Double Q-learning is a technique designed to reduce
maximization bias in standard Q-learning by using two
separate Q-value functions.
The idea: Instead of maintaining one Q-function Q(s, a), Double
Q-learning keeps two: Q1(s,a) and Q2(s,a)
• These two estimates are trained independently on different
subsets of experiences, which helps control the overestimation
problem caused by the maximization operation in standard Q-
learning.
Double q-learning
How it works: When we take an action and observe a reward,
we update one of the two Q-functions randomly:
1. Choose which Q-function to update:
With 50% probability, choose either Q1 or Q2.
2. Select action using one function, evaluate with the
other: Let’s say we’re updating Q1:
Double q-learning
If we were updating Q2, we’d reverse the roles:

• Action selection during training or execution:

When selecting an action (like in an ε-greedy strategy), you can combine the
two estimates:

Why this works:

• By decoupling action selection and evaluation, Double Q-
learning reduces the risk that an overestimated value for an action
keeps reinforcing itself.
• Since Q1 and Q2 are trained on different samples and only one updates
at a time, they tend to average out overestimation errors.
Double q-learning
A simple example: Let’s say: Q1(s,a1)=5, Q2(s,a1)=4 Q1(s,a2)=3,
Q2(s,a2)=6
• If we use standard Q-learning, we’d pick a2 because it has the higher
max estimate of 6 — even though it may not be the best action.
With Double Q-learning:
• If we update Q1, we choose a2 (max from Q1), but evaluate it using
Q2 -- so the target is based on 6, not overestimating.
• If we update Q2, we choose a2 but evaluate using Q1 — so the target is
3, balancing out the estimates.
• Result: We avoid persistently overestimating the value of a2 because
we split the selection and evaluation between two different functions.
Expected Sarsa
• This eliminates the variance due to random selection of At+1

Module 5-rl
No ratings yet
Module 5-rl
54 pages
RL Class Mtech
No ratings yet
RL Class Mtech
67 pages
2 4+Advanced+Tricks+for+DQNs
No ratings yet
2 4+Advanced+Tricks+for+DQNs
82 pages
Unit 5
No ratings yet
Unit 5
70 pages
Lec 09
No ratings yet
Lec 09
26 pages
Q Learning
No ratings yet
Q Learning
12 pages
Unit 5
No ratings yet
Unit 5
54 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Notes For Module 4 and 5
No ratings yet
Notes For Module 4 and 5
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
34 pages
Catalago Volvo L30B MODERNA
100% (1)
Catalago Volvo L30B MODERNA
321 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Chase Bank February
100% (2)
Chase Bank February
4 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
Q - Networks (1) 31 50
No ratings yet
Q - Networks (1) 31 50
20 pages
Unit 1
No ratings yet
Unit 1
18 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
3964 Double Q Learning
No ratings yet
3964 Double Q Learning
9 pages
5th Unit Notes Full File
No ratings yet
5th Unit Notes Full File
22 pages
Course 2 - Sample Based Learning Methods Learning Objectives
No ratings yet
Course 2 - Sample Based Learning Methods Learning Objectives
3 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Learning Task
No ratings yet
Learning Task
14 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
Temporal-Difference (TD) Learning: Basics
No ratings yet
Temporal-Difference (TD) Learning: Basics
6 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
41 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Define The Problem
No ratings yet
Define The Problem
6 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning Algorithms in Global Path Planning For Mobile Robot
No ratings yet
Reinforcement Learning Algorithms in Global Path Planning For Mobile Robot
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
Fai Mid2 4ans
No ratings yet
Fai Mid2 4ans
4 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
EE 675 Lecture 27th March
No ratings yet
EE 675 Lecture 27th March
4 pages
Rule-Based Reinforcement Learning Augmented by External Knowledge
No ratings yet
Rule-Based Reinforcement Learning Augmented by External Knowledge
7 pages
Unit 4
100% (1)
Unit 4
7 pages
Non-Deterministic Reward and Action
No ratings yet
Non-Deterministic Reward and Action
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
37 RL
No ratings yet
37 RL
18 pages
Report p1
No ratings yet
Report p1
7 pages
p1 Piotr
No ratings yet
p1 Piotr
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Sense and Sensibility by Jane Austen Preview
100% (2)
Sense and Sensibility by Jane Austen Preview
20 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Lec 17 SARSA Expected SARSA Q Learning
No ratings yet
Lec 17 SARSA Expected SARSA Q Learning
4 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Mathworks Installation Help
No ratings yet
Mathworks Installation Help
60 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Unit 1 QS & SPW I
No ratings yet
Unit 1 QS & SPW I
13 pages
Multigrade Lesson Plan in English3
100% (12)
Multigrade Lesson Plan in English3
13 pages
Module 1
No ratings yet
Module 1
98 pages
16-Optimization and Loss Functions in Classifiers, Convolution Layers, Max Pool Layers-24!08!2024
No ratings yet
16-Optimization and Loss Functions in Classifiers, Convolution Layers, Max Pool Layers-24!08!2024
36 pages
Andhra Pradesh SSC Results, AP SSC 2015 Results, SSC Marks, 10th Class
No ratings yet
Andhra Pradesh SSC Results, AP SSC 2015 Results, SSC Marks, 10th Class
1 page
Module 5
No ratings yet
Module 5
37 pages
CH5 - Function Approximation
No ratings yet
CH5 - Function Approximation
33 pages
Module 4
No ratings yet
Module 4
32 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
MODULE6 5 Learning With Options
No ratings yet
MODULE6 5 Learning With Options
19 pages
Module6 4 Options
No ratings yet
Module6 4 Options
17 pages
Module1 2
No ratings yet
Module1 2
14 pages
1 Linear Algebra Basics 25-07-2024
No ratings yet
1 Linear Algebra Basics 25-07-2024
30 pages
Solid Dosage Forms: Tablets: Abhay ML Verma (Pharmaceutics)
No ratings yet
Solid Dosage Forms: Tablets: Abhay ML Verma (Pharmaceutics)
5 pages
WINSEM2024-25 STS4006 TH AP2024254001070 2025-03-01 Reference-Material-I
No ratings yet
WINSEM2024-25 STS4006 TH AP2024254001070 2025-03-01 Reference-Material-I
14 pages
Chapter 03 - Random Variables
No ratings yet
Chapter 03 - Random Variables
14 pages
01112015000000B - Boehler FOX CEL - Ce
No ratings yet
01112015000000B - Boehler FOX CEL - Ce
1 page
8 Linear Classifiers HInge Loss 03-08-2024
No ratings yet
8 Linear Classifiers HInge Loss 03-08-2024
20 pages
DSA Time Complexity Table
No ratings yet
DSA Time Complexity Table
1 page
DWDM (BCS058) 2nd UNIT NOTES
No ratings yet
DWDM (BCS058) 2nd UNIT NOTES
39 pages
Comparing Q Learning and Policy Gradient in Frozen Lake Environment
No ratings yet
Comparing Q Learning and Policy Gradient in Frozen Lake Environment
8 pages
Comparing Q Learning and Policy Gradient in Frozen Lake Environment
No ratings yet
Comparing Q Learning and Policy Gradient in Frozen Lake Environment
8 pages
This Study Resource Was: Nursing in The Philippines (10th
No ratings yet
This Study Resource Was: Nursing in The Philippines (10th
8 pages
Ai
No ratings yet
Ai
4 pages
EWM Setting Configurations
No ratings yet
EWM Setting Configurations
5 pages
WAABERI ACADEMY (AutoRecovered)
No ratings yet
WAABERI ACADEMY (AutoRecovered)
31 pages
Lms For Jku Final Project Phase 1
No ratings yet
Lms For Jku Final Project Phase 1
52 pages
The Fish'N Chicken Family Value Meals: Quality Take Home Cooking
No ratings yet
The Fish'N Chicken Family Value Meals: Quality Take Home Cooking
2 pages
22bce7873 Asg7
No ratings yet
22bce7873 Asg7
3 pages
Parvatham Yakshitha Gowri - Resume
No ratings yet
Parvatham Yakshitha Gowri - Resume
3 pages
22bce7873 Asg9
No ratings yet
22bce7873 Asg9
3 pages
CIBIL Report
No ratings yet
CIBIL Report
3 pages
Audit Chapter 5 Remaining Questions (Kindly Printout)
No ratings yet
Audit Chapter 5 Remaining Questions (Kindly Printout)
18 pages
Series BNS-B20 Safety Door-Handle Switch With Integrated Safety Sensor
No ratings yet
Series BNS-B20 Safety Door-Handle Switch With Integrated Safety Sensor
2 pages
ClassiCon Catalog 2013
No ratings yet
ClassiCon Catalog 2013
61 pages
Como Integrar AD o LDAP Con La Ume Del Portal
No ratings yet
Como Integrar AD o LDAP Con La Ume Del Portal
7 pages
Social Re Serch Foundation Paper
No ratings yet
Social Re Serch Foundation Paper
6 pages
Diameter Tensile Strength
No ratings yet
Diameter Tensile Strength
5 pages
Teacher Evaluation by Students
No ratings yet
Teacher Evaluation by Students
1 page
Matthean Antithesis
No ratings yet
Matthean Antithesis
34 pages
Fluxion
No ratings yet
Fluxion
7 pages
Revision Index: Data Sheet
No ratings yet
Revision Index: Data Sheet
6 pages
(Adi Setia MD Dom) Al-Attas' Islamization of Knowledge (Pre-Symposium Dialogues PAPER) PDF
No ratings yet
(Adi Setia MD Dom) Al-Attas' Islamization of Knowledge (Pre-Symposium Dialogues PAPER) PDF
7 pages
Boy Fantasy - Google Search
No ratings yet
Boy Fantasy - Google Search
1 page
Nat An Skigin: Nskigin@nd - Edu
No ratings yet
Nat An Skigin: Nskigin@nd - Edu
4 pages
Analysis of A Poison Tree
No ratings yet
Analysis of A Poison Tree
2 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet

Module3 TD Methods

Uploaded by

Module3 TD Methods

Uploaded by

Temporal Difference

Sample back up No Boot strapping

• State action value update

• Action selection during training or execution:

Why this works:

You might also like