0% found this document useful (0 votes)
2 views54 pages

Module 5-rl

Module 5 covers Temporal Difference (TD) Learning, highlighting its advantages over Monte Carlo and Dynamic Programming methods, and introduces key concepts such as TD(0), SARSA, and Q-learning. It also discusses eligibility traces, on-policy vs off-policy learning, and the mathematical foundations of TD learning, including updates and error calculations. The module emphasizes the practical applications and efficiency of TD methods in reinforcement learning, alongside their limitations.

Uploaded by

manavvvv298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views54 pages

Module 5-rl

Module 5 covers Temporal Difference (TD) Learning, highlighting its advantages over Monte Carlo and Dynamic Programming methods, and introduces key concepts such as TD(0), SARSA, and Q-learning. It also discusses eligibility traces, on-policy vs off-policy learning, and the mathematical foundations of TD learning, including updates and error calculations. The module emphasizes the practical applications and efficiency of TD methods in reinforcement learning, alongside their limitations.

Uploaded by

manavvvv298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Module 5

Content
• Temporal Difference Learning
• What is Temporal Difference learning,
• Advantages of Temporal Difference methods over Monte Carlo and Dynamic
Programming methods,
• TD(0),
• On-policy vs off-policy,
• SARSA, Q learning.
• Eligibility traces
• N-step Temporal Difference methods,
• On-line vs Off-line updation,
• TD(𝜆) : forward view, backward view,
• Traces: Accumulating trace, Dutch trace, Replacing trace,
• Equivalence of forward and backward view,
• SARSA(𝜆)
Temporal Difference Learning
• Temporal difference learning (TD learning) is a concept in machine learning
that attributes to a class of model-free reinforcement learning methods.
• Learn the value of a state based on experience (i.e., sequences of states,
actions, and rewards), without requiring a full model of the environment.
• The technique predicts a reward or quantity depending on the future values
of the value function.
• Temporal difference is coined based on its use of differences in predictions
over successive time steps to drive the learning procedure.
Temporal Difference Learning
• At any time step, the prediction is revised to drive it closer to the prediction of
the same quantity in the following steps.
• TD learning methods are also known as bootstrapping methods.
Advantages of Temporal Difference methods over Monte
Carlo and Dynamic
Temporal Difference
Feature (TD) Monte Carlo (MC) Dynamic Programming (DP)
Learning from
Episodes Incomplete episodes Requires complete episodes Requires a full model, not episodes
Online/Incrementa
l Naturally online Batch processing (end of episode) Can be iterative but needs full model
N/A (deterministic updates based on
Variance Lower Higher model)
Non-terminating
Env. Applicable Not directly applicable Applicable if model is known
Convergence Can be slower, especially
Speed Often faster stochastic Depends on the problem and computation
Model
Requirement Model-free Model-free Requires a complete model of the env.
Applicability Limited due to the "curse of
(Large States) More scalable More scalable dimensionality"
Actual experience (end of Bootstrapping from model-based
Learning Basis Actual experience episode) estimates
Temporal difference (TD) learning variants
• TD(0) - The simplest form of TD learning
• TD(0) learning is the simplest algorithm.
• It updates the value function with the value of the next state after
every step.
• The reward is obtained in the process.
• The obtained reward is the key factor that keeps learning ground,
and the algorithm converges after some sampling.
Off-policy methods (like Q-learning) evaluate a policy that is
different from the one used to generate behavior. On-policy methods (like SARSA and TD(λ)) evaluate and
improve the same policy used to make decisions.
This allows learning about the optimal policy even while
exploring sub-optimal actions
On-policy TD learning & Off-policy learning

• On-policy TD learning algorithms learn • Off-policy TD learning algorithms are


the value of policies used for decision- soft policies that learn policies for
making. behavior and estimation.
• The value functions are modified as • The algorithm updates the estimated
per the results of executing actions value functions with hypothetical
given by some policy. actions (actions that have not yet
• Such policies are not strict. been tested).
• They are soft and non-deterministic. • Off-policy learning methods can
They select the action that gives the distinguish exploration from controls.
highest reward. • Consequently, an agent trained with
off-policy learning can learn tactics
that they did not exhibit in the
learning phase.
State-action-reward-state-action (SARSA)
• SARSA is used in reinforcement learning to learn Markov decision
process policy.
• A SARSA agent communicates with the environment and revises the
policy based on the actions executed. Thus, it is an on-policy learning
algorithm.
• The main function for updating the Q-value in the SARSA algorithm
is based on the current state of the agent S1, the action chosen by
agent A1, the reward R the agent received for selecting the action,
state S2 that the agent enters after taking the action, and the next
action A2 that the agent selects in the new state.
• The tuple representing the above statement is (st, at, rt, st+1, at+1) and
is called SARSA.
• The SARSA algorithm is represented as follows:
Q(St, at) = (St, at) + α [rt + γ Q (St+1, αt+1) - Q(St, at)]
State-action-reward-state-action (SARSA)
• Q(St, at) = (St, at) + α [rt + γ Q (St+1, αt+1) - Q(St, at)]

• Q value for the state-action is updated by an error, adjusted by the learning


rate α.
• Q shows the reward signal obtained in the next time step to take any action in
a state S in the discounted future reward from the next-state action
observation.
• SARSA learns the Q values based on the policy that it follows itself.
Q-learning
• Q-learning is an off-policy learning algorithm in reinforcement
learning.
• It strives to determine the action to be taken for the current state.
• It maximizes the reward by learning from the actions that are not
included in the present policy.
• It does not utilize an environment. It can handle stochastic transitions
and rewards without adaptations.
Temporal difference learning in neuroscience
Temporal difference learning advantages

• Temporal differences can be learned in every step, online or offline.


• Temporal difference methods can be learned from incomplete
sequences.
• Temporal difference methods can function in non-terminating
environments.
• Temporal difference learning has a lower variance than Monte Carlo.
• Temporal difference learning is more efficient than Monte Carlo
learning.
• Temporal difference learning exploits the Markov property. It is more
effective than Markov environments.
Temporal difference learning disadvantages

• Though TD learning is effective and efficient than other learning


algorithms, it has the following limitations:
• Temporal difference learning is a biased estimation.
• It is more sensitive to the starting value.
Value
Type Policy Function Update Rule Key Feature
Learned
Simplest TD method,
Follows State-value
TD(0) V(st​)←V(st​)+α[Rt+1​+γV(st+1​)−V(st​)] updates value after one
policy (V(s))
step.
Updates Q-values based
on the action actually
SARSA (TD(0) On- Action-value Q(st​,at​)←Q(st​,at​)+α[rt+1​+γQ(st+1​,a
taken in the next state,
on-policy) policy (Q(s,a)) t+1​)−Q(st​,at​)]
learns the value of the
policy being followed.
Updates Q-values based
on the best possible
Q-learning
Off- Action-value Q(st​,at​)←Q(st​,at​)+α[rt+1​+γmaxa′​Q( action in the next state,
(TD(0) off-
policy (Q(s,a)) st+1​,a′)−Q(st​,at​)] learns the optimal policy
policy)
independent of the policy
being followed.
Value Function
Type Policy Update Rule Key Feature
Learned
Uses eligibility traces to assign
Blends the benefits of TD(0)
credit to past states based on a
Follows State-value (low variance, online) and
TD(λ) trace decay parameter λ.
policy (V(s)) Monte Carlo (no
Generalizes TD(0) and Monte
bootstrapping too far).
Carlo methods.
Extends SARSA by incorporating
More efficient learning by
eligibility traces for state-action
On- Action-value propagating rewards to
SARSA(λ) pairs. Updates Q-values for all
policy (Q(s,a)) preceding state-action
eligible state-action pairs based
pairs.
on the TD error.
Off-policy extension of TD(λ)
Aims to achieve the
using eligibility traces. More
Off- Action-value efficiency of eligibility
Q(λ) complex and less common than
policy (Q(s,a)) traces in an off-policy
SARSA(λ) due to potential
setting.
instability.
Temporal difference (TD) learning variants
• Mathematically, TD(0) learning can be expressed as follows:
Discount factor (0 ≤ γ ≤ 1): it determines how
much future rewards are valued.
V(St) = V(St) + α (Rt+1 + γ V(St+1) - V(St))
.
Value of state St Learning rate (0 < α ≤ 1): controls Value estimate of the next
how big the update step is. state.

• α is the learning factor and γ is the discount factor.


• The value of state S updates in the next time step (t + 1) based on
the reward Rt+1 observed after each time step t and destination
from time step t + 1.
• This indicates it is the bootstrap of S at time step t with an
estimation from time step t+1 while Rt+1 is the observed reward.
V(St) = V(St) + α (Rt+1 + γ V(St+1) - V(St))
• Rule Says:
• Update the value of the current state 𝑆𝑡​ by pushing it toward the value we
get from the reward and the next state's value.

• The part in parentheses is called the TD error:

𝛿=𝑅𝑡+1+𝛾𝑉(𝑆𝑡+1)−𝑉(𝑆𝑡 )

• It represents the difference between what you expected and what you
actually got + what you expect in the future.

New Value = Old Value + Learning Rate × TD Error


Advantage
1. Doesn’t Need a Model (Model-Free): TD learning can learn from
experience alone, without knowing the dynamics of the
environment (i.e., transition probabilities or reward function).
It learns from actual interactions — like trial and error.
2. Online & Incremental: TD updates after every step, meaning it can
start learning immediately and doesn’t need to wait until the end
of an episode.
This is especially helpful for long or continuous tasks where
waiting for the final reward isn’t practical.
TD(0) Works
• Making a prediction about the value of the current state.
• Taking an action and observing the immediate reward and the next
state.
• Using the actual reward and the predicted value of the next state to
create a "target" value for the current state.
• Updating the value estimate of the current state to be a little closer
to this "target”.
This process is repeated as the agent interacts with the environment,
gradually improving the accuracy of the value estimates for each state.
Eligibility traces (ET)
• N-step Temporal Difference methods,
• On-line vs Off-line updating,
• TD(𝜆) : forward view, backward view,
• Traces: Accumulating trace,
• Dutch trace,
• Replacing trace,
• Equivalence of forward and backward view,
• SARSA(𝜆)
Eligibility traces (ET)
• Eligibility traces are a fundamental concept in Reinforcement Learning
that provide a mechanism for temporal credit assignment.
• Help answering questions like “Which of the past states and actions
were most responsible for the reward I just received?"
Example (ET)
• Imagine a robot navigating a maze.
• It takes a sequence of actions and finally reaches the goal, getting a
reward.
• How does it learn which of its past moves were good?
Without eligibility traces: With eligibility traces:
Only the very last action and the state before reaching the The algorithm keeps a temporary record, or "trace," of the
goal would get significant credit for the reward in a simple states and actions the agent has recently visited.
one-step TD learning method like TD(0) or SARSA(0). When a reward is received (or a TD error is calculated),
this error is then propagated backward in time to the
states and actions in the trace.
The more recent and the more frequently a state or
action was visited, the more "eligible" it is to have its
value updated based on the current error.
N-step Temporal Difference methods,
• Temporal Difference methods are a generalization of the one-step TD
learning.
• Instead of looking only one step into the future to update value estimates,
they look n steps ahead.
• This allows them to bridge the gap between one-step TD learning and
Monte Carlo methods, using the strengths of both.
• The n-step SARSA update for the action-value function Q(S t​ ,A t​ ) is:

• N-step return for SARSA:


• All Q(s,a)=0 for all states s∈{0,...,11} and actions a ∈ {Up,Down,Left,Right}
α=0.1, γ=0.9.
• Step 1 (t=0):
S 0 =0. Agent chooses A 0 =Right (exploratory action).
R 1 =0 (moved to S 1 =1).
• Step 2 (t=1):
S1​=1. Agent chooses A1​=Down (exploratory action).
R2​=0 (moved to S2​=5).
• Step 3 (t=2):
S2​=5. Agent chooses A2​=Right (say the current ϵ-greedy policy favors
Right, but Q(5,Right) is still 0).
R3​=0 (moved to S3​=6)
Update at t=0 (after observing two steps):
S 0 =0,A0 =Right, R 1 =0,
S 1 =1,A1 =Down,R 2 =0,
S 2 =5,A2 =Right.
The 2-step return
Q 0:2 =R1 +γR2 +γ^2 Q(S2 ,A2 )=0+0.9∗0+(0.9)^2 ∗Q(5,Right)=0.81∗0=0.

• The updates might remain zero until we get closer to the reward
On-line vs Off-line updation
• In on-line updation, the agent updates its value function (or policy)
after each step of interaction with the environment. As soon as the
agent takes an action, receives a reward, and transitions to a new
state, it uses this new experience to update its learning
• In off-line updation, the agent collects a batch of experiences (e.g.,
multiple episodes or a fixed dataset) before performing updates to its
value function or policy. The learning process is decoupled from the
immediate interaction with the environment.
Characteristics online updation
• Real-time learning: Learning happens continuously as the agent
interacts with the environment.
• Sample efficiency: Can be less sample-efficient as each experience is
typically used only once for an immediate update.
• Responsiveness to change: Can adapt quickly to changes in the
environment or the agent's behavior policy.
• Examples: TD(0), SARSA are inherently on-line in their basic form.
Policy gradient methods also typically perform on-line updates after
collecting a trajectory (though the trajectory might consist of multiple
steps).
Characteristics Off-line Updating
• Delayed learning: Updates happen after a significant amount of data
has been gathered.
• Sample efficiency: Can be more sample-efficient as the algorithm can
make multiple passes through the collected data.
• Stability: Batch updates can sometimes lead to more stable learning.
• Less responsive to immediate changes: May take longer to adapt to
sudden changes in the environment or the optimal policy.
• Examples: Monte Carlo methods (which update at the end of an
episode), and using experience replay in Deep Q-Networks (DQN)
where a batch of past experiences is sampled to perform updates.
TD(λ)
• TD(λ) is a powerful extension of Temporal Difference (TD) learning
that bridges the gap between one-step TD methods (like TD(0)) and
Monte Carlo (MC) methods.
• The parameter λ (lambda), ranging from 0 to 1, controls the balance
between these two extremes.
• There are two key perspectives on TD(λ):
• the forward view and
• the backward view
Forward View (The λ-return):
The forward view defines an update target for the value function of a state
based on a weighted average of all possible n-step returns, from n=1 up to the
end of the episode (or a horizon in continuing tasks).
• The n-step return Gt:t+n is the total discounted reward from time t+1 up to
t+n, plus the discounted value of the state St+n:

where:
• Ri is the reward at time i.
• γ is the discount factor (between 0 and 1).
• V(St+n​) is the estimated value of the state n steps ahead.
Forward View (The λ-return):
• The λ-return is then calculated as a weighted average of all n-
step returns (n=1,2,…,T−t, where T is the terminal time), with weights
(1−λ)λ ^ n−1 :
Forward View (The λ-return):
The forward view looks ahead from the current state to all possible future
rewards and states within an episode and uses a geometrically weighted
average of the returns obtained at each step to update the value of the
current state.
• When λ=0, the λ-return reduces to the 1-step TD return
(Gt:t+1​=Rt+1​+γV(St+1​)), making it equivalent to TD(0).
• When λ=1, the weights become 0 for all n<T−t, and the λ-return becomes
the actual return until the end of the episode (Gt:T​), making it equivalent
to the Monte Carlo method.
• For 0<λ<1, the λ-return is a compromise between TD and MC, considering a
spectrum of future returns with exponentially decreasing weights.
Backward View
• The backward view of TD(λ) provides a mechanism for implementing
updates in an online and incremental manner using eligibility traces.
• An eligibility trace e t​ (s) is associated with each state s and
represents the degree to which that state has been visited recently.
When a state s is visited at time t, its eligibility trace is incremented.
Over time, the eligibility traces of all states decay.
Backward View
where:
• γ is the discount factor.
• λ is the trace-decay parameter.
• The TD error δt​ at time t is calculated as the difference between the
immediate reward plus the discounted value of the next state and the
current state's value:

In the backward view, this TD error is used to update the value


estimates of all visited states proportionally to their eligibility traces:
Backward View
• The backward view offers a computationally efficient, online
algorithm. The eligibility traces allow credit for a reward to be
assigned to recently visited states, bridging the gap between
immediate and long-term consequences.
• When λ=0, only the eligibility trace of the immediately preceding state is non-
zero (after being visited), and the update rule effectively becomes the TD(0)
update.
• When λ=1, the eligibility traces accumulate without decay throughout the
episode (for undiscounted tasks, γ=1), resembling a Monte Carlo update
where all visited states contribute to the final reward.
• For 0<λ<1, the update considers the impact of the TD error on states visited in
the recent past, with the influence decaying exponentially.
Equivalence
For offline updating (updates performed only at the end of an
episode), the forward view (λ-return) and the backward view (TD(λ)
with eligibility traces) are mathematically equivalent, resulting in the
same overall update to the value function.
• For online updating (updates performed at each time step), the
equivalence is an approximation, especially with function
approximation. However, the backward view is generally preferred for
its online nature and computational efficiency in most practical
reinforcement learning applications
Types of Traces

• Accumulating Trace
• Replacing Trace
• Dutch Trace
Accumulating Trace
This is the most common type of eligibility trace, and the one I
described in the backward view of TD(λ) earlier. When a state s is
visited at time t, its eligibility trace et​(s) is incremented by 1. The
traces for all states then decay by a factor of γλ at each step.
• The update rule is:
Accumulating Trace
• With accumulating traces, the eligibility of a state accumulates over
multiple visits within an episode, weighted by the decay factor.
• This means that states visited frequently or recently have higher
eligibility and their value estimates are updated more significantly by
the TD error.
Replacing Trace
Replacing traces are a variation of eligibility traces primarily used in
action-value methods like Sarsa(λ) and Q-learning(λ). The key
difference lies in how the eligibility of the current state-action pair is
handled. Instead of simply adding 1, the eligibility of the current
state-action pair is set to 1, effectively replacing its previous value.
• For state-action pairs (s,a):
Replacing Traces
• Motivation for replacing traces is to address issues that can arise with
accumulating traces in control settings, particularly in preventing
oscillations or instabilities when dealing with changes in the optimal
policy.
• By setting the eligibility of the current action to 1, it ensures that the
most recent action taken in a state has a strong influence on the
learning process.
Dutch Trace (DT)
• Standard eligibility traces (accumulating and replacing) have well-
defined update rules.
• The term "Dutch trace" is not widely established in the RL literature.
• This slide explores a hypothetical interpretation based on the name,
suggesting a potential mechanism for distributing credit more
"evenly" or with a specific weighting scheme.
• Instead of accumulating or replacing, the eligibility might reflect the
average recency of visits.
• Eg. If states A, B, and C were visited recently, their Dutch trace might
be related to the average time since their last visit.
DT
• Instead of simply accumulating or replacing, a "Dutch trace" might
involve averaging the eligibility of recently visited states or state-
action pairs over a specific time window.
• This could aim to provide a more balanced view of which
states/actions contributed to recent rewards.
DT
where:
• k is a window size.
• I(S i​ =s) is an indicator function (1 if state at time i is s, 0 otherwise).
• γ and λ are the discount and trace-decay parameters.
DT
• Characteristics:
• The eligibility of a state would depend on how frequently it was
visited within the last k steps.
• It might prevent very frequently visited states from dominating the
eligibility values.
• The parameter k would introduce a new level of control over the
temporal credit assignment.
DT
• Benefits:
Could lead to more stable learning in certain environments.
Might improve exploration by giving more consistent credit to a wider
range of visited states.

Drawback:
Increased computational complexity due to the need to maintain a
history of recent states.The choice of the window size k would be a
critical hyperparameter.
SARSA(𝜆)
• SARSA(λ) is an on-policy temporal difference learning algorithm that
extends the basic SARSA algorithm by incorporating eligibility traces.
• It allows for learning to propagate back to multiple preceding state-
action pairs, speeding up learning compared to one-step SARSA
(SARSA(0)).
• The λ parameter, ranging from 0 to 1, controls the influence of past
rewards on the current value estimate.
• SARSA(λ) updates the action-value function Q(s,a) for state-action
pairs based on the TD error, weighted by the eligibility trace of each
state-action pair. The eligibility trace indicates the degree to which a
state-action pair has been recently visited.
Key Components
1. Action-Value Function (Q(s,a)): This function estimates the
expected return of taking action a in state s and following the
current policy thereafter.
2. Policy: SARSA is an on-policy algorithm, meaning the policy being
learned is also used to generate behavior. A common approach is an
ϵ-greedy policy, where the agent chooses the action with the
highest Q-value with probability 1−ϵ, and a random action with
probability ϵ.
3. Eligibility Traces (e(s,a)): For each state-action pair (s,a), an
eligibility trace e t​ (s,a) is maintained. It indicates how recently and
how frequently that pair has been visited.
SARSA(𝜆)
SARSA(𝜆)
4.TD Error (δt​): The temporal difference error measures the difference
between the predicted return and the actual return one step later:

Note that At+1​ is the action actually taken in the next state St+1
according to the current policy.
SARSA(𝜆)
5. Value Function Update: The Q-values are updated for all state-action pairs
based on the TD error and their eligibility traces:

where α is the learning rate.


The Role of λ:
λ=0: This reduces SARSA(λ) to one-step SARSA (SARSA(0)). Only the Q-
value of the immediately preceding state-action pair is updated significantly.
λ=1: This approximates a Monte Carlo method. The eligibility traces
accumulate throughout the episode, and the updates are based on the total
reward received until the end of the episode.
0<λ<1: This provides a balance between TD(0) and Monte Carlo.
Learning can propagate back n steps with a weight of (1−λ)λ n−1 .
SARSA(𝜆)
• Advantage:
• Can learn faster than SARSA(0) by propagating rewards back multiple steps.
• Provides a smooth transition between one-step TD learning and Monte Carlo
methods through the λ parameter.
• Eligibility traces provide a computationally efficient way to handle temporal
credit assignment.
• Disadvantage
• More complex to implement than SARSA(0) due to the need to maintain
eligibility traces.
• The choice of λ can significantly impact performance and may require tuning.
SARSA(𝜆)

You might also like