0% found this document useful (0 votes)

23 views40 pages

Module 5

1) Temporal difference (TD) learning is a model-free reinforcement learning method that combines ideas from Monte Carlo and dynamic programming. Unlike Monte Carlo methods, TD methods update estimates based on other learned estimates without waiting for the final outcome. 2) TD prediction methods like TD(0) update value estimates based on the current state and next state, without waiting for the final reward like Monte Carlo methods do. The difference between the estimated value of the current state and the estimated value of the next state is called the TD error. 3) An example of estimating travel time home from work each day illustrates the difference between Monte Carlo and TD updates. While Monte Carlo must wait until reaching the final outcome, TD

Uploaded by

Christin T.Kunjumon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views40 pages

Module 5

Uploaded by

Christin T.Kunjumon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Reinforcement Learning

Module 5
Dr. D. Sathian
SCOPE
Temporal Difference (TD) Learning
• RL’s model-free methods is temporal difference (TD) learning.
• TD learning is a combination of Monte Carlo ideas and dynamic programming (DP)
ideas.
• Like Monte Carlo methods, TD methods can learn directly from raw experience without
a model of the environment’s dynamics.
• Like DP, TD methods update estimates based in part on other learned estimates,
without waiting for a final outcome (they bootstrap).

• Temporal Difference Learning is an unsupervised learning technique that is very

commonly used in reinforcement learning for the purpose of predicting the total
reward expected over the future.
Your Logo or Name Here 2
Temporal Difference Prediction
• Both TD and Monte Carlo methods use experience to solve the prediction problem.
• Monte Carlo methods wait until the return following the visit is known which is after
the episode ends is available to update the value of the state, whereas TD methods
update the state value in the next time step, at the next time step t+1 they
immediately form a target and make a useful update using the observed reward.
• A simple every-visit Monte Carlo method suitable for nonstationary environments is

• where, Gt is the actual return following time t, and 𝛼 is a constant step-size parameter.
• The simplest TD method makes the update, immediately on transition to St+1 and
receiving Rt+1

Estimate of the return Your Logo or Name Here 3

Temporal Difference Prediction
• Updating the state value just after one time step is called one-step TD or TD(0), which
is a special case of the TD(𝜆) and n-step TD methods.

Your Logo or Name Here 4

Temporal Difference Prediction

Simple Monte Carlo Simple TD

Your Logo or Name Here 5

Temporal Difference Prediction

Your Logo or Name Here 6

Temporal Difference Prediction
• Because TD(0) bases its update in part on an existing estimate, we say that it is a
bootstrapping method, like DP.
• In the TD update equation, the quantity
in brackets is a measure of an error,
measuring the difference between the
estimated value of St and the better
estimate at the next time step. This
quantity is called the TD error which has
a widespread presence in all
reinforcement learning.

Your Logo or Name Here 7

Temporal Difference Prediction
• The TD error at each time is the error in the estimate made at that time.
• Because the TD error depends on the next state and next reward, it is not actually
available until one time step later.
• That is, 𝛿t is the error in V (St), available at time t + 1.
• Also note that if the array V does not change during the episode (as it does not in
Monte Carlo methods), then the Monte Carlo error can be written as a sum of TD
errors:

Your Logo or Name Here 8

Example: Driving Home
• Each day as you drive home from work, you try to predict how long it will take to get home.
• When you leave your office, you note the time, the day of week, the weather, and anything
else that might be relevant.
• Say on this Friday you are leaving at exactly 6 o’clock, and you estimate that it will take 30
minutes to get home. As you reach your car it is 6:05, and you notice it is starting to rain.
• Traffic is often slower in the rain, so you re-estimate that it will take 35 minutes from then, or a
total of 40 minutes.
• Fifteen minutes later you have completed the highway portion of your journey in good time.
• As you exit onto a secondary road you cut your estimate of total travel time to 35 minutes.
• Unfortunately, at this point you get stuck behind a slow truck, and the road is too narrow to
pass. You end up having to follow the truck until you turn onto the side street where you live at
6:40.
• Three minutes later you are home. Your Logo or Name Here 9
Example: Driving Home
• The sequence of states, times, and predictions is thus as follows:

• The rewards in this example are the elapsed times on each leg of the journey.
• We are not discounting (𝛾= 1), and thus the return for each state is the actual time to go from
that state.
• The value of each state is the expected time to go.
• The second column of numbers gives the current estimated value for each state encountered.
Your Logo or Name Here 10
Example: Driving Home
• The red arrows show the changes in predictions recommended by the constant-𝛼 MC
method, for 𝛼 = 1.
• These are exactly the errors between the estimated value (predicted time to go) in each
state and the actual return (actual time to go).

• For example, when you

exited the highway you
thought it would take
only 15 minutes more
to get home, but in fact
it took 23 minutes.
• The error, Gt - V(St), at
this time is 8 minutes. MC TD
Your Logo or Name Here 11
Example: Driving Home
• Suppose the step-size parameter, 𝛼 is 1/2. Then the predicted time to go after exiting
the highway would be revised upward by four minutes as a result of this experience.
• In any event, the change can only be made offline, that is, after you have reached home.
Only at this point do you know any of the actual returns.
• Suppose on another day you again estimate when leaving your office that it will take 30
minutes to drive home, but then you become stuck in a massive traffic jam. 25 minutes
after leaving the office you are still bumper-to-bumper on the highway. You now
estimate that it will take another 25 minutes to get home, for a total of 50 minutes. As
you wait in traffic, you already know that your initial estimate of 30 minutes was too
optimistic.

• Must you wait until you get home before increasing your estimate for the initial state?

Your Logo or Name Here 12

Example: Driving Home
• According to the Monte Carlo approach you must wait until you get home, because you
don’t yet know the true return.
• On the other hand, according to a TD approach, you would learn immediately, shifting
your initial estimate from 30 minutes toward 50.
• In fact, each estimate would be shifted toward the estimate that immediately follows it.
• Returning to our first day of driving, Figure (right) shows the changes in the predictions
recommended by the TD rule (these are the changes made by the rule if 𝛼 = 1).
• Each error is proportional to the change over time of the prediction, that is, to the
temporal differences in predictions.

Your Logo or Name Here 13

Advantages of TD Prediction Method
• TD methods do not require a model of the environment, only experience.
• TD, but not MC, methods can be fully incremental .
• You can learn before knowing the final outcome
• Less memory
• Less peak computation
• You can learn without the final outcome
• From incomplete sequences
• Both MC and TD converge (under certain assumptions to be detailed later), but which is
faster?

Your Logo or Name Here 14

Example: Random Walk
• In this example we empirically compare the prediction abilities of TD(0) and constant-𝛼
MC when applied to the following Markov reward process:

• All episodes start in the center state, C, and proceed either left or right by one state on
each step, with equal probability.
• This behavior is presumably due to the combined effect of a fixed policy and an
environment's state-transition probabilities, but we do not care which; we are
concerned only with predicting returns however they are generated.
• Episodes terminate either on the extreme left or the extreme right. When an episode
terminates on the right a reward of +1 occurs; all other rewards are zero.
Your Logo or Name Here 15
Example: Random Walk
• For example, a typical walk might consist of the following state-and-reward sequence:
C, 0, B, 0, C, 0, D, 0, E, 1.

• Because this task is undiscounted and episodic, the true value of each state is the
probability of terminating on the right if starting from that state.
• Thus, the true value of the center state is v𝜋(C) = 0.5.
• The true values of all the states, A through E, are 1/6, 2/6, 3/6, 4/6, and 5/6.

Your Logo or Name Here 16

Example: Random Walk

• The values learned by TD(0) is approaching the true values as more episodes are
experienced.
• with a constant step-size parameter (𝛼 = 0.1 in this example), the values fluctuate
indefinitely in response to the outcomes of the most recent episodes.Your Logo or Name Here 17
Example: Random Walk

• This graph shows learning curves for the two methods for various values of 𝛼.
• The performance measure shown is the root mean-squared (RMS) error between the
value function learned and the true value function, averaged over the five states, then
averaged over 100 runs. Your Logo or Name Here 18
Optimality of TD(0)
• Suppose there is available only a finite amount of experience, say 10 episodes or 100
time steps.
• In this case, a common approach with incremental learning methods is to present the
experience repeatedly until the method converges upon an answer.
• Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10
episodes until convergence.
• Compute updates according to TD(0), but only update estimates after each complete
pass through the data.
• For any finite Markov prediction task, under batch updating, TD(0) converges for
sufficiently small .
• Constant-𝛼 MC also converges under these conditions, but to a difference answer!
Your Logo or Name Here 19
Example: Random Walk under Batch Updating
• Batch-updating versions of TD(0) and constant-𝛼 MC were applied as follows to the
random walk prediction example

• After each new episode, all episodes seen so far

were treated as a batch.
• They were repeatedly presented to the algorithm,
either TD(0) or constant-𝛼 MC, with 𝛼 suffciently
small that the value function converged.
• Batch TD method was consistently better than the
batch Monte Carlo method.
• MC is optimal only in a limited way, and
that TD is optimal in a way that is more relevant
to predicting returns.
Your Logo or Name Here 20
Example: You are the Predictor
• Suppose you observe the following eight episodes:
• A,0,B,0
• B,1
• B,1
• B,1
V(A)=?
• B,1 V(B)=?
• B,1
• B,1
• B,0

Your Logo or Name Here 21

Example: You are the Predictor

• Everyone would probably agree that the optimal value for V(B) = 3/4 , because six out of
the eight times in state B the process terminated immediately with a return of 1, and the
other two times in B the process terminated immediately with a return of 0.

• V(A) =?

Your Logo or Name Here 22

Example: You are the Predictor

• There are two reasonable answers.

• One way of viewing - based on first modeling the Markov process, and then computing the
correct estimates given the model, which indeed in this case gives V(A) = 3/4 .
• 100% of the times the process was in state A, it traversed immediately to B (with a reward of 0);
and because we have already decided that B has value 3/4 , therefore A must have value 3/4 as
well.
• This is also the answer that batch TD(0) gives.

Your Logo or Name Here 23

Example: You are the Predictor

• The other reasonable answer is simply to observe that we have seen A once and the return that
followed it was 0; we therefore estimate V (A) as 0.
• This is the answer that batch Monte Carlo methods give.
• Notice that it is also the answer that gives minimum squared error on the training data. In fact, it
gives zero error on the data.

Your Logo or Name Here 24

Example: You are the Predictor

• The prediction that best matches the training data is V(A)=0

• This minimizes the mean-square-error on the training set
• This is what a batch Monte Carlo method gets
• If we consider the sequentiality of the problem, then we would set V(A)=.75
• This is correct for the maximum likelihood estimate of a Markov model generating the data
• i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it
predicts (how?)
• This is called the certainty-equivalence estimate
Your Logo or Name Here 25
• This is what TD(0) gets
SARSA: On Policy TD Control
• SARSA is an on-policy learning method.
• It uses an ε-greedy strategy for all the steps.
• It updates the Q-value for a certain action based on the obtained reward from taking
that action and the reward from the state after that assuming it keeps following the
policy.
• This means we need to know the next action our policy takes in order to perform an
update step.

Your Logo or Name Here 26

SARSA: On Policy TD Control
• Learning Action-value function

• Estimate Q𝜋 for the current behaviour policy 𝜋.

• After every transition from a nonterminal state St , do this :

• If st+1 is terminal, then Q(st+1, at+1 ) = 0.

Your Logo or Name Here 27

SARSA: On Policy TD Control
• It is straightforward to design an on-policy control algorithm based on the Sarsa
prediction method.
• As in all on-policy methods, we continually estimate q𝜋 for the behavior policy 𝜋, and at
the same time change 𝜋 toward greediness with respect to q𝜋.

Your Logo or Name Here 28

Example: Windy Gridworld
• It is a standard gridworld, with start and goal states, but with one difference: there is a
crosswind running upward through the middle of the grid.

• Actions: up, down, left & right

• The strength of the wind is given below each column, in number of cells shifted upward.
• For example, if you are one cell to the right of the goal, then the action left takes you to
the cell just above the goal. Your Logo or Name Here 29
Example: Windy Gridworld
• This is an undiscounted episodic task, with
constant rewards of 1 until the goal state is
reached.
• This graph - results of applying 𝜀-greedy Sarsa to
this task, with 𝜀 = 0.1, 𝛼= 0.5, and the initial
values Q(s,a) = 0 for all s, a.

Your Logo or Name Here 30

Q-learning: Off-policy TD Control
• Q-Learning is an off-policy learning method.
• It updates the Q-value for a certain action based on the obtained reward
from the next state and the maximum reward from the possible states
after that.
• It is off-policy because it uses an ε-greedy strategy for the first step and a
greedy action selection strategy for the second step.

Your Logo or Name Here 31

Q-learning: Off-policy TD Control
• One of the early breakthroughs in reinforcement learning was the development of an
off-policy TD control algorithm known as Q-learning (Watkins, 1989), defined by

• The use of the maxa function over the available actions makes the Q-learning algorithm
Your Logo or Name Here 32
an off-policy approach.
Example: Cliff Walking
• On-policy, SARSA agent views the cliff edge as
riskier because it chooses and updates actions
subject to its stochastic policy.
• That means it has learned it has a high
likelihood of stepping off the cliff and receiving
a high negative reward.

• Q-learning agent by contrast has learned its

policy based on the optimal policy which
always chooses the action with the highest Q-
value.
• It is more confident in its ability to walk the cliff
edge without falling off. Your Logo or Name Here 33
Expected SARSA
• Expected Sarsa is very similar to Sarsa.
• However, instead of state-action values being stochastically sampled using our current
policy, it computes the expected value over all future state-action pairs, thereby
considering how likely each action is under the current policy.
• The update-step is now defined as:

• Expected Sarsa is computationally more complex than Sarsa but, in return, it eliminates
the variance due to random selection of At+1 . Your Logo or Name Here 34
Expected SARSA

Your Logo or Name Here 35

Maximum Bias & Double Learning
• Maximization Bias is a technical way of saying that Q-Learning algorithm overestimates
the value function estimates (V) and action-value estimates (Q).
• A & B: non terminal states

• Episodes always start in A with a choice between two actions, left and right.
• The right action transitions immediately to the terminal state with a reward and return
of 0.
• Left action transitions to B, also with a reward of zero, from which there are a large
number of actions, all with rewards sampled from 𝓝(-0.1,1), which is a normal
distribution with mean -0.1 and variance 1.
Your Logo or Name Here 36
Maximum Bias & Double Learning

Your Logo or Name Here 37

Maximum Bias & Double Learning
• Given the large variance in rewards, it is quite possible that the initial few estimates of
the actions might be positive or more negative.
• The problem with Q-Learning is that the same samples are being used to decide which
action is the best (highest expected reward), and the same samples are also being used
to estimate that action-value.
• if an action’s value was overestimated, it will be chosen as the best action, and it’s
overestimated value is used as the target.
• Instead, we can have two different action-value estimates, and the “best” action can
be chosen based on the values of the first action-value estimate and the target can be
decided by the second action-value estimate.

Your Logo or Name Here 38

Maximum Bias & Double Learning
• Double Q-learning very quickly adjusts and is not affected by any lucky draws from
actions taken in the Gamble state.
• Double Q-learning performs much better in this game, and empirically this has been
the recurring theme across many games.

Your Logo or Name Here 39

Maximum Bias & Double Learning
• Even though both methods converge to Q∗ the policies in the meantime tend to be
better in most games when being pessimistic.
• This is particularly remarkable since when Double Q-learning splits the samples into
two groups, the variance of the estimators grows.
• This means it has lower sample efficiency, but despite this, removing the optimism
from Q-learning still usually outweighs this disadvantage.

Your Logo or Name Here 40

CausalML Book 2022
No ratings yet
CausalML Book 2022
500 pages
CSE406 (Advanced Java) Syllabus
No ratings yet
CSE406 (Advanced Java) Syllabus
3 pages
2025 - Applied Causal Inference Powered by ML and AI
No ratings yet
2025 - Applied Causal Inference Powered by ML and AI
518 pages
GCE O Levels (Singapore) - Speed of Reaction
No ratings yet
GCE O Levels (Singapore) - Speed of Reaction
3 pages
This Study Resource Was: Simulab Activity 2.1. Voltage and Current Division Principle
100% (1)
This Study Resource Was: Simulab Activity 2.1. Voltage and Current Division Principle
5 pages
M3
No ratings yet
M3
57 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Temporal Difference Models_ Model-Free Deep RL for Model-Based Control
No ratings yet
Temporal Difference Models_ Model-Free Deep RL for Model-Based Control
14 pages
Value-Based Reinforcement Learning: Shusen Wang
No ratings yet
Value-Based Reinforcement Learning: Shusen Wang
53 pages
AOSpine Thoracolumbar Classification System - Poster PDF
No ratings yet
AOSpine Thoracolumbar Classification System - Poster PDF
1 page
Module 5-rl
No ratings yet
Module 5-rl
54 pages
IS 14458 PART 8 Design of RCC Cantilever Retaining Walls
100% (4)
IS 14458 PART 8 Design of RCC Cantilever Retaining Walls
5 pages
Temporal Difference
No ratings yet
Temporal Difference
4 pages
Temporal Difference Learning Method
No ratings yet
Temporal Difference Learning Method
36 pages
ECT304 VLSI CIRCUIT DESIGN, JANUARY 2024
No ratings yet
ECT304 VLSI CIRCUIT DESIGN, JANUARY 2024
2 pages
How Reference Points Motivate Us - Ideas For Leaders
No ratings yet
How Reference Points Motivate Us - Ideas For Leaders
4 pages
Temporal-Difference (TD) Learning: Basics
No ratings yet
Temporal-Difference (TD) Learning: Basics
6 pages
06 TD Methods
No ratings yet
06 TD Methods
88 pages
Landfriend and Mocskos - TrueSkill Through Time: reliable initial skill estimates and historical comparability with Julia, Python, and R
No ratings yet
Landfriend and Mocskos - TrueSkill Through Time: reliable initial skill estimates and historical comparability with Julia, Python, and R
43 pages
MLT- Module 5
No ratings yet
MLT- Module 5
77 pages
Stock Price Prediction Using Reinforcement Learning
No ratings yet
Stock Price Prediction Using Reinforcement Learning
6 pages
Lecture#4_Temporal_DifferenceTD_Learning_Q_Learning_&_SARSA_2024
No ratings yet
Lecture#4_Temporal_DifferenceTD_Learning_Q_Learning_&_SARSA_2024
62 pages
A Scrabble Artificial Intelligence Game
No ratings yet
A Scrabble Artificial Intelligence Game
52 pages
2025.02.11.636952v3.full
No ratings yet
2025.02.11.636952v3.full
33 pages
Model Free Prediction (2)
No ratings yet
Model Free Prediction (2)
38 pages
20AI903_RL_UNIT 3
No ratings yet
20AI903_RL_UNIT 3
34 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Unit 06 Temporal Difference Learning
No ratings yet
Unit 06 Temporal Difference Learning
9 pages
Model free methods
No ratings yet
Model free methods
31 pages
3 - Chapter 7 Temporal-Difference Methods
No ratings yet
3 - Chapter 7 Temporal-Difference Methods
26 pages
11 ML Reinforcement Learning Prediction
No ratings yet
11 ML Reinforcement Learning Prediction
13 pages
Hansen_2022
No ratings yet
Hansen_2022
20 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
Tutorial 1 c11
No ratings yet
Tutorial 1 c11
8 pages
causal_tutorial_1
No ratings yet
causal_tutorial_1
18 pages
TEMPORAL DIFFERENCE LEARNING
No ratings yet
TEMPORAL DIFFERENCE LEARNING
15 pages
C++ SYLLABUS
No ratings yet
C++ SYLLABUS
3 pages
GTD2 TDC Suttonetal2009
No ratings yet
GTD2 TDC Suttonetal2009
8 pages
NeurIPS 2021 Adversarial Intrinsic Motivation For Reinforcement Learning Paper
No ratings yet
NeurIPS 2021 Adversarial Intrinsic Motivation For Reinforcement Learning Paper
15 pages
Dissecting Reinforcement Learning-Part10
No ratings yet
Dissecting Reinforcement Learning-Part10
19 pages
AI_ML_Internship_Diary_Darshanaa_M_23BCC010
No ratings yet
AI_ML_Internship_Diary_Darshanaa_M_23BCC010
2 pages
TD Convergence: An Optimization Perspective: Equal Contribution
No ratings yet
TD Convergence: An Optimization Perspective: Equal Contribution
15 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Sutton1988 Article LearningToPredictByTheMethodsO
No ratings yet
Sutton1988 Article LearningToPredictByTheMethodsO
36 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
Internal
No ratings yet
Internal
25 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Lecture 11 12 - Model Free Prediction, Monte-Carlo Learning, Temporal Difference Learning
No ratings yet
Lecture 11 12 - Model Free Prediction, Monte-Carlo Learning, Temporal Difference Learning
24 pages
Landfried Learning
No ratings yet
Landfried Learning
43 pages
Tachograph-adapter-2020-datasheet-en
No ratings yet
Tachograph-adapter-2020-datasheet-en
12 pages
Weatherwax Epstein Hastie Solution Manual
No ratings yet
Weatherwax Epstein Hastie Solution Manual
147 pages
Weatherwax Epstein Hastie Solution Manual
No ratings yet
Weatherwax Epstein Hastie Solution Manual
147 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Excel Basics 6: Customize Quick Access Toolbar (QAT) and Show New Ribbon Tabs
No ratings yet
Excel Basics 6: Customize Quick Access Toolbar (QAT) and Show New Ribbon Tabs
73 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Landfried Learning
No ratings yet
Landfried Learning
43 pages
RL
No ratings yet
RL
9 pages
Lecture 4: Model-Free Prediction: David Silver
No ratings yet
Lecture 4: Model-Free Prediction: David Silver
51 pages
Summer of Science-Final Report
100% (1)
Summer of Science-Final Report
7 pages
MTZ Bus Bar Sizing
No ratings yet
MTZ Bus Bar Sizing
2 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Variables of Quantitative Research
100% (2)
Variables of Quantitative Research
23 pages
Unit 3 Intertemporal Preferences - handout (2)
No ratings yet
Unit 3 Intertemporal Preferences - handout (2)
50 pages
4Q Wk1 D3 Quartiles For Ungrouped Data
No ratings yet
4Q Wk1 D3 Quartiles For Ungrouped Data
26 pages
Bin Location Overview: SAP Business One Version 9.3
No ratings yet
Bin Location Overview: SAP Business One Version 9.3
23 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Flip Flop - Basics, Overview, Truth Table & Various Types
100% (1)
Flip Flop - Basics, Overview, Truth Table & Various Types
3 pages
Datasheet 579774 (50-8254) en 120V 60Hz
No ratings yet
Datasheet 579774 (50-8254) en 120V 60Hz
3 pages
RS Aggarwal Class 12 Solutions Chapter-33 Linear Programming
No ratings yet
RS Aggarwal Class 12 Solutions Chapter-33 Linear Programming
81 pages
W01 358 6996
No ratings yet
W01 358 6996
29 pages
Unified View: Multi-Step Bootstrapping
No ratings yet
Unified View: Multi-Step Bootstrapping
18 pages
Concept and Importance of Standardization in ASU Drugs
No ratings yet
Concept and Importance of Standardization in ASU Drugs
23 pages
Inorganic Chemistry - Lab Report 5
No ratings yet
Inorganic Chemistry - Lab Report 5
7 pages
2024-2025-Class VIII-Mathematics-Chapter 6-AW
No ratings yet
2024-2025-Class VIII-Mathematics-Chapter 6-AW
8 pages
Glossary For Basic Mathematics
No ratings yet
Glossary For Basic Mathematics
80 pages
Soft Computing UNIT 3
No ratings yet
Soft Computing UNIT 3
10 pages
Chapter 6: Temporal Difference Learning: Objectives of This Chapter
No ratings yet
Chapter 6: Temporal Difference Learning: Objectives of This Chapter
49 pages
Heat Exchangers Seminar
100% (1)
Heat Exchangers Seminar
15 pages
Transportation Engineering I Lab Reports
0% (1)
Transportation Engineering I Lab Reports
22 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
Repair and Rehabilitation of Structures
No ratings yet
Repair and Rehabilitation of Structures
76 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Eme
No ratings yet
Eme
25 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
AP Chemistry: Chapter 7 - Atomic Structure & Periodicity
100% (1)
AP Chemistry: Chapter 7 - Atomic Structure & Periodicity
14 pages
(Textbook) (Solution) The Elements of Statistical Learning
No ratings yet
(Textbook) (Solution) The Elements of Statistical Learning
147 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
PMP Formula Guide
From Everand
PMP Formula Guide
Mohammad Usmani
4.5/5 (16)

Module 5

Uploaded by

Module 5

Uploaded by

Reinforcement Learning

• Temporal Difference Learning is an unsupervised learning technique that is very

Estimate of the return Your Logo or Name Here 3

Your Logo or Name Here 4

Simple Monte Carlo Simple TD

Your Logo or Name Here 5

Your Logo or Name Here 6

Your Logo or Name Here 7

Your Logo or Name Here 8

• For example, when you

Your Logo or Name Here 12

Your Logo or Name Here 13

Your Logo or Name Here 14

Your Logo or Name Here 16

• After each new episode, all episodes seen so far

Your Logo or Name Here 21

Your Logo or Name Here 22

• There are two reasonable answers.

Your Logo or Name Here 23

Your Logo or Name Here 24

• The prediction that best matches the training data is V(A)=0

Your Logo or Name Here 26

• Estimate Q𝜋 for the current behaviour policy 𝜋.

• If st+1 is terminal, then Q(st+1, at+1 ) = 0.

Your Logo or Name Here 27

Your Logo or Name Here 28

• Actions: up, down, left & right

Your Logo or Name Here 30

Your Logo or Name Here 31

• Q-learning agent by contrast has learned its

Your Logo or Name Here 35

Your Logo or Name Here 37

Your Logo or Name Here 38

Your Logo or Name Here 39

Your Logo or Name Here 40

You might also like