Module 5-rl
Module 5-rl
Content
• Temporal Difference Learning
• What is Temporal Difference learning,
• Advantages of Temporal Difference methods over Monte Carlo and Dynamic
Programming methods,
• TD(0),
• On-policy vs off-policy,
• SARSA, Q learning.
• Eligibility traces
• N-step Temporal Difference methods,
• On-line vs Off-line updation,
• TD(𝜆) : forward view, backward view,
• Traces: Accumulating trace, Dutch trace, Replacing trace,
• Equivalence of forward and backward view,
• SARSA(𝜆)
Temporal Difference Learning
• Temporal difference learning (TD learning) is a concept in machine learning
that attributes to a class of model-free reinforcement learning methods.
• Learn the value of a state based on experience (i.e., sequences of states,
actions, and rewards), without requiring a full model of the environment.
• The technique predicts a reward or quantity depending on the future values
of the value function.
• Temporal difference is coined based on its use of differences in predictions
over successive time steps to drive the learning procedure.
Temporal Difference Learning
• At any time step, the prediction is revised to drive it closer to the prediction of
the same quantity in the following steps.
• TD learning methods are also known as bootstrapping methods.
Advantages of Temporal Difference methods over Monte
Carlo and Dynamic
Temporal Difference
Feature (TD) Monte Carlo (MC) Dynamic Programming (DP)
Learning from
Episodes Incomplete episodes Requires complete episodes Requires a full model, not episodes
Online/Incrementa
l Naturally online Batch processing (end of episode) Can be iterative but needs full model
N/A (deterministic updates based on
Variance Lower Higher model)
Non-terminating
Env. Applicable Not directly applicable Applicable if model is known
Convergence Can be slower, especially
Speed Often faster stochastic Depends on the problem and computation
Model
Requirement Model-free Model-free Requires a complete model of the env.
Applicability Limited due to the "curse of
(Large States) More scalable More scalable dimensionality"
Actual experience (end of Bootstrapping from model-based
Learning Basis Actual experience episode) estimates
Temporal difference (TD) learning variants
• TD(0) - The simplest form of TD learning
• TD(0) learning is the simplest algorithm.
• It updates the value function with the value of the next state after
every step.
• The reward is obtained in the process.
• The obtained reward is the key factor that keeps learning ground,
and the algorithm converges after some sampling.
Off-policy methods (like Q-learning) evaluate a policy that is
different from the one used to generate behavior. On-policy methods (like SARSA and TD(λ)) evaluate and
improve the same policy used to make decisions.
This allows learning about the optimal policy even while
exploring sub-optimal actions
On-policy TD learning & Off-policy learning
𝛿=𝑅𝑡+1+𝛾𝑉(𝑆𝑡+1)−𝑉(𝑆𝑡 )
• It represents the difference between what you expected and what you
actually got + what you expect in the future.
• The updates might remain zero until we get closer to the reward
On-line vs Off-line updation
• In on-line updation, the agent updates its value function (or policy)
after each step of interaction with the environment. As soon as the
agent takes an action, receives a reward, and transitions to a new
state, it uses this new experience to update its learning
• In off-line updation, the agent collects a batch of experiences (e.g.,
multiple episodes or a fixed dataset) before performing updates to its
value function or policy. The learning process is decoupled from the
immediate interaction with the environment.
Characteristics online updation
• Real-time learning: Learning happens continuously as the agent
interacts with the environment.
• Sample efficiency: Can be less sample-efficient as each experience is
typically used only once for an immediate update.
• Responsiveness to change: Can adapt quickly to changes in the
environment or the agent's behavior policy.
• Examples: TD(0), SARSA are inherently on-line in their basic form.
Policy gradient methods also typically perform on-line updates after
collecting a trajectory (though the trajectory might consist of multiple
steps).
Characteristics Off-line Updating
• Delayed learning: Updates happen after a significant amount of data
has been gathered.
• Sample efficiency: Can be more sample-efficient as the algorithm can
make multiple passes through the collected data.
• Stability: Batch updates can sometimes lead to more stable learning.
• Less responsive to immediate changes: May take longer to adapt to
sudden changes in the environment or the optimal policy.
• Examples: Monte Carlo methods (which update at the end of an
episode), and using experience replay in Deep Q-Networks (DQN)
where a batch of past experiences is sampled to perform updates.
TD(λ)
• TD(λ) is a powerful extension of Temporal Difference (TD) learning
that bridges the gap between one-step TD methods (like TD(0)) and
Monte Carlo (MC) methods.
• The parameter λ (lambda), ranging from 0 to 1, controls the balance
between these two extremes.
• There are two key perspectives on TD(λ):
• the forward view and
• the backward view
Forward View (The λ-return):
The forward view defines an update target for the value function of a state
based on a weighted average of all possible n-step returns, from n=1 up to the
end of the episode (or a horizon in continuing tasks).
• The n-step return Gt:t+n is the total discounted reward from time t+1 up to
t+n, plus the discounted value of the state St+n:
where:
• Ri is the reward at time i.
• γ is the discount factor (between 0 and 1).
• V(St+n) is the estimated value of the state n steps ahead.
Forward View (The λ-return):
• The λ-return is then calculated as a weighted average of all n-
step returns (n=1,2,…,T−t, where T is the terminal time), with weights
(1−λ)λ ^ n−1 :
Forward View (The λ-return):
The forward view looks ahead from the current state to all possible future
rewards and states within an episode and uses a geometrically weighted
average of the returns obtained at each step to update the value of the
current state.
• When λ=0, the λ-return reduces to the 1-step TD return
(Gt:t+1=Rt+1+γV(St+1)), making it equivalent to TD(0).
• When λ=1, the weights become 0 for all n<T−t, and the λ-return becomes
the actual return until the end of the episode (Gt:T), making it equivalent
to the Monte Carlo method.
• For 0<λ<1, the λ-return is a compromise between TD and MC, considering a
spectrum of future returns with exponentially decreasing weights.
Backward View
• The backward view of TD(λ) provides a mechanism for implementing
updates in an online and incremental manner using eligibility traces.
• An eligibility trace e t (s) is associated with each state s and
represents the degree to which that state has been visited recently.
When a state s is visited at time t, its eligibility trace is incremented.
Over time, the eligibility traces of all states decay.
Backward View
where:
• γ is the discount factor.
• λ is the trace-decay parameter.
• The TD error δt at time t is calculated as the difference between the
immediate reward plus the discounted value of the next state and the
current state's value:
• Accumulating Trace
• Replacing Trace
• Dutch Trace
Accumulating Trace
This is the most common type of eligibility trace, and the one I
described in the backward view of TD(λ) earlier. When a state s is
visited at time t, its eligibility trace et(s) is incremented by 1. The
traces for all states then decay by a factor of γλ at each step.
• The update rule is:
Accumulating Trace
• With accumulating traces, the eligibility of a state accumulates over
multiple visits within an episode, weighted by the decay factor.
• This means that states visited frequently or recently have higher
eligibility and their value estimates are updated more significantly by
the TD error.
Replacing Trace
Replacing traces are a variation of eligibility traces primarily used in
action-value methods like Sarsa(λ) and Q-learning(λ). The key
difference lies in how the eligibility of the current state-action pair is
handled. Instead of simply adding 1, the eligibility of the current
state-action pair is set to 1, effectively replacing its previous value.
• For state-action pairs (s,a):
Replacing Traces
• Motivation for replacing traces is to address issues that can arise with
accumulating traces in control settings, particularly in preventing
oscillations or instabilities when dealing with changes in the optimal
policy.
• By setting the eligibility of the current action to 1, it ensures that the
most recent action taken in a state has a strong influence on the
learning process.
Dutch Trace (DT)
• Standard eligibility traces (accumulating and replacing) have well-
defined update rules.
• The term "Dutch trace" is not widely established in the RL literature.
• This slide explores a hypothetical interpretation based on the name,
suggesting a potential mechanism for distributing credit more
"evenly" or with a specific weighting scheme.
• Instead of accumulating or replacing, the eligibility might reflect the
average recency of visits.
• Eg. If states A, B, and C were visited recently, their Dutch trace might
be related to the average time since their last visit.
DT
• Instead of simply accumulating or replacing, a "Dutch trace" might
involve averaging the eligibility of recently visited states or state-
action pairs over a specific time window.
• This could aim to provide a more balanced view of which
states/actions contributed to recent rewards.
DT
where:
• k is a window size.
• I(S i =s) is an indicator function (1 if state at time i is s, 0 otherwise).
• γ and λ are the discount and trace-decay parameters.
DT
• Characteristics:
• The eligibility of a state would depend on how frequently it was
visited within the last k steps.
• It might prevent very frequently visited states from dominating the
eligibility values.
• The parameter k would introduce a new level of control over the
temporal credit assignment.
DT
• Benefits:
Could lead to more stable learning in certain environments.
Might improve exploration by giving more consistent credit to a wider
range of visited states.
Drawback:
Increased computational complexity due to the need to maintain a
history of recent states.The choice of the window size k would be a
critical hyperparameter.
SARSA(𝜆)
• SARSA(λ) is an on-policy temporal difference learning algorithm that
extends the basic SARSA algorithm by incorporating eligibility traces.
• It allows for learning to propagate back to multiple preceding state-
action pairs, speeding up learning compared to one-step SARSA
(SARSA(0)).
• The λ parameter, ranging from 0 to 1, controls the influence of past
rewards on the current value estimate.
• SARSA(λ) updates the action-value function Q(s,a) for state-action
pairs based on the TD error, weighted by the eligibility trace of each
state-action pair. The eligibility trace indicates the degree to which a
state-action pair has been recently visited.
Key Components
1. Action-Value Function (Q(s,a)): This function estimates the
expected return of taking action a in state s and following the
current policy thereafter.
2. Policy: SARSA is an on-policy algorithm, meaning the policy being
learned is also used to generate behavior. A common approach is an
ϵ-greedy policy, where the agent chooses the action with the
highest Q-value with probability 1−ϵ, and a random action with
probability ϵ.
3. Eligibility Traces (e(s,a)): For each state-action pair (s,a), an
eligibility trace e t (s,a) is maintained. It indicates how recently and
how frequently that pair has been visited.
SARSA(𝜆)
SARSA(𝜆)
4.TD Error (δt): The temporal difference error measures the difference
between the predicted return and the actual return one step later:
Note that At+1 is the action actually taken in the next state St+1
according to the current policy.
SARSA(𝜆)
5. Value Function Update: The Q-values are updated for all state-action pairs
based on the TD error and their eligibility traces: