0% found this document useful (0 votes)

3 views63 pages

2023 Week3 Modelfree

Uploaded by

luciferboboqwq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views63 pages

2023 Week3 Modelfree

Uploaded by

luciferboboqwq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Week 3: Model-free Prediction and Control

Bolei Zhou

UCLA

October 10, 2023

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 1 / 63

Announcement

1 Assignment 1 is out, dues by the end of Week 4.

1 https://github.com/ucla-rlcourse/assignment-2023fall
2 Please start early!
2 RL examples: https://github.com/ucla-rlcourse/RLexample

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 2 / 63

This Week’s Plan

1 Last week
1 MDP, policy evaluation, policy iteration and value iteration for solving
a known MDP
2 This week
1 Model-free prediction: Estimate value function of an unknown MDP
2 Model-free control: Optimize value function of an unknown MDP

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 3 / 63

Review on the control in MDP
1 When the MDP is known?
1 Both R and P are exposed to the agent
2 Therefore we can run policy iteration and value iteration
2 Policy iteration: Given a known MDP, compute the optimal policy
and the optimal value function
1 Policy evaluation: iteration on the Bellman expectation backup
X X
vt (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)vt−1 (s 0 ))
a∈A s 0 ∈S

2 Policy improvement: greedy on action-value function q

X
qπt (s, a) =R(s, a) + γ P(s 0 |s, a)vπt (s 0 )
s 0 ∈S
πt+1 (s) = arg max qπt (s, a)
a

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 4 / 63

Review on the control in MDP introduced last week

1 Value iteration: Given a known MDP, compute the optimal value

function
2 Iteration on the Bellman optimality backup
X
vt+1 (s) ← max R(s, a) + γ P(s 0 |s, a)vt (s 0 )
a∈A
s 0 ∈S

3 To retrieve the optimal policy after the value iteration:

X
π ∗ (s) ← arg max R(s, a) + γ P(s 0 |s, a)vend (s 0 ) (1)
a
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 5 / 63

RL when knowing how the world works

1 Both of the policy iteration and value iteration assume the direct
access to the dynamics and rewards of the environment

2 In a lot of real-world problems, MDP model is either unknown or

known by too big or too complex to use
1 Atari Game, Game of Go, Helicopter, Portfolio management, etc

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 6 / 63

Model-free RL: Learning through interactions

1 Model-free RL can solve the problems through the interaction with

the environment

2 No more direct access to the known transition dynamics and reward

function
3 Trajectories/episodes are collected by the agent’s interaction with the
environment
4 Each trajectory/episode contains {S1 , A1 , R2 , S2 , A2 , R3 , ..., ST }

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 7 / 63

Model-free prediction: policy evaluation without the access
to the model

1 Estimating the expected return of a particular policy if we don’t have

access to the MDP models
1 Monte Carlo policy evaluation
2 Temporal Difference (TD) learning

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 8 / 63

What is Monte-Carlo Method?

1 A broad class of computational algorithms that rely on repeated

random sampling to obtain numerical results
2 Example: Estimate the value of π

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 9 / 63

Monte-Carlo Policy Evaluation

1 Return: Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... under policy π

2 v π (s) = Eτ ∼π [Gt |st = s], thus expectation over trajectories τ
generated by following π
3 MC simulation: we can simply sample a lot of trajectories, compute
the actual returns for all the trajectories, then average them
4 MC policy evaluation uses empirical mean return instead of expected
return
5 MC does not require MDP dynamics/rewards, no bootstrapping, and
does not assume state is Markov.
6 Only applied to episodic MDPs (each episode terminates)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 10 / 63

Example: Monte Carlo Algorithm for Computing Value of
a MRP
Algorithm 1 Monte Carlo simulation to calculate MRP value function
1: i ← 0, Gt ← 0
2: while i 6= N do
3: generate an episode, starting from state s and timePt
H−1 i−t
4: Using the generated episode, calculate return g = i=t γ ri
5: Gt ← Gt + g , i ← i + 1
6: end while
7: Vt (s) ← Gt /N

1 For example: to calculate V (s4 ) we can generate a lot of trajectories

then take the average of the returns:
1 1
1 return for s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 × 10 = 2.5
1 1
2 return for s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 × 5 = 1.25
3 return for s4 , s5 , s6 , s6 : 0
4 more trajectories
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 11 / 63
Monte-Carlo Policy Evaluation

1 To evaluate state v (s)

1 Every time-step t that state s is visited in an episode,
2 Increment counter N(s) ← N(s) + 1
3 Increment total return S(s) ← S(s) + Gt
4 Value is estimated by mean return v (s) = S(s)/N(s)

2 By law of large numbers, v (s) → v π (s) as N(s) → ∞

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 12 / 63

Monte-Carlo Policy Evaluation

1 How to calculate G (s): compute backward

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 13 / 63

Incremental Mean

from the moving average of samples x1 , x2 , ...

t
1X
µt = xj
t
j=1
t−1
1 X
= xt + xj
t
j=1
1
= (xt + (t − 1)µt−1 )
t
1
=µt−1 + (xt − µt−1 )
t

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 14 / 63

Incremental MC Updates

1 Collect one episode (S1 , A1 , R2 , ..., St )

2 For each state st with computed return Gt

N(St ) ←N(St ) + 1
1
v (St ) ←v (St ) + (Gt − v (St ))
N(St )

3 Or use a running mean (old episodes are forgotten). Good for

non-stationary problems.

v (St ) ← v (St ) + α(Gt − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 15 / 63

Difference between DP and MC for policy evaluation

1 Dynamic Programming (DP) computes vi by bootstrapping the rest

of the expected return by the value estimate vi−1
2 Iteration on Bellman expectation backup:
X X
vi (s) ← π(a|s) R(s, a) + γ P(s 0 |s, a)vi−1 (s 0 )
a∈A s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 16 / 63

Difference between DP and MC for policy evaluation

1 MC updates the empirical mean return with one sampled episode

v (St ) ← v (St ) + α(Gi,t − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 17 / 63

Advantages of MC over DP

1 MC works when the environment is unknown

2 Working with sample episodes has a huge advantage, even when one
has complete knowledge of the environment’s dynamics, for example,
transition probability is complex to compute
3 Cost of estimating a single state’s value is independent of the total
number of states. So you can sample episodes starting from the
states of interest then average returns

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 18 / 63

Introduction of Temporal-Difference (TD) Learning

1 TD methods learn directly from episodes of experience

2 TD is model-free: no knowledge of MDP transitions/rewards
3 TD learns from incomplete episodes, by bootstrapping

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 19 / 63

Introduction of Temporal-Difference (TD) Learning

1 Objective: learn vπ online from experience under policy π

2 Simplest TD algorithm: TD(0)
1 Update v (St ) toward estimated return Rt+1 + γv (St+1 )

v (St ) ← v (St ) + α(Rt+1 + γv (St+1 ) − v (St ))

3 Rt+1 + γv (St+1 ) is called TD target

4 δt = Rt+1 + γv (St+1 ) − v (St ) is called the TD error
5 Comparison: Incremental Monte-Carlo
1 Update v (St ) toward actual return Gt given an episode i

v (St ) ← v (St ) + α(Gi,t − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 20 / 63

Advantages of TD over MC

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 21 / 63

Comparison of TD and MC

1 TD can learn online after every step

2 MC must wait until end of episode before return is known

3 TD can learn from incomplete sequences

4 MC can only learn from complete sequences

5 TD works in continuing (non-terminating) environments

6 MC only works for episodic (terminating) environments

7 TD exploits Markov property, more efficient in Markov environments

8 MC does not exploit Markov property, more effective in non-Markov
environments

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 22 / 63

n-step TD

1 n-step TD methods that generalize both one-step TD and MC.

2 We can shift from one to the other smoothly as needed to meet the
demands of a particular task.

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 23 / 63

n-step TD prediction

1 Consider the following n-step returns for n = 1, 2, ∞

(1)
n = 1(TD) Gt =Rt+1 + γv (St+1 )
(2)
n=2 Gt =Rt+1 + γRt+2 + γ 2 v (St+2 )
..
.
n = ∞(MC ) Gt∞ =Rt+1 + γRt+2 + ... + γ T −t−1 RT

2 Thus the n-step return is defined as

Gtn = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n v (St+n )

3 n-step TD: v (St ) ← v (St ) + α Gtn − v (St )

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 24 / 63

Bootstrapping and Sampling for DP, MC, and TD

1 Bootstrapping: update involves an estimate

1 MC does not bootstrap
2 DP bootstraps
3 TD bootstraps
2 Sampling: update samples an expectation
1 MC samples
2 DP does not sample
3 TD samples

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 25 / 63

Unified View: Dynamic Programming Backup

v (St ) ← Eπ [Rt+1 + γv (St+1 )]

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 26 / 63

Unified View: Monte-Carlo Backup

v (St ) ← v (St ) + α(Gt − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 27 / 63

Unified View: Temporal-Difference Backup

TD(0) : v (St ) ← v (St ) + α(Rt+1 + γv (st+1 ) − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 28 / 63

Unified View of Reinforcement Learning

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 29 / 63

A Short Summary

1 Model-free prediction
1 Evaluate the state value by only interacting with the environment
2 Many algorithms can do it: Temporal Difference Learning and
Monte-Carlo method

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 30 / 63

Model-free Control for MDP

1 Model-free control:
1 Optimize the value function of an unknown MDP
2 Generate a optimal control policy
2 Generalized Policy Iteration (GPI) with MC or TD in the loop

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 31 / 63

Revisiting Policy Iteration

1 Iterate through the two steps:

1 Evaluate the policy π (computing v given current π)
2 Improve the policy by acting greedily with respect to vπ

π 0 = greedy(vπ ) (2)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 32 / 63

Policy Iteration for a Known MDP
1 compute the state-action value of a policy π:
X
qπi (s, a) = R(s, a) + γ P(s 0 |s, a)vπi (s 0 )
s 0 ∈S

2 Compute new policy πi+1 for all s ∈ S following

πi+1 (s) = arg max qπi (s, a) (3)
a

3 Problem: What to do if there is neither R(s, a) nor P(s 0 |s, a)

known/available?
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 33 / 63
Generalized Policy Iteration with Action-Value Function
Monte Carlo version of policy iteration

1 Policy evaluation: Monte-Carlo policy evaluation Q = qπ

2 Policy improvement: Greedy policy improvement?

π(s) = arg max q(s, a)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 34 / 63

Monte Carlo with Exploring Starts

1 One assumption to obtain the guarantee of convergence in PI:

Episode has exploring starts
2 Exploring starts can ensure all actions are selected infinitely often

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 35 / 63

Monte Carlo with -Greedy Exploration

1 Trade-off between exploration and exploitation (we will talk about

this in later lecture)
2 -Greedy Exploration: Ensuring continual exploration
1 All actions are tried with non-zero probability
2 With probability 1 − choose the greedy action
3 With probability choose an action at random
(
/|A| + 1 − if a∗ = arg maxa∈A Q(s, a)
π 0 (a|s) =
/|A| otherwise

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 36 / 63

Monte Carlo with -Greedy Exploration

1 -Greedy Policy improvement theorem (textbook pg-101): For any

Therefore from the policy improvement theorem vπ0 (s) ≥ vπ (s)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 37 / 63

Monte Carlo with -Greedy Exploration

Algorithm 2
1: Initialize Q(S, A) = 0, N(S, A) = 0, = 1, k = 1
2: πk = -greedy(Q)
3: loop
4: Sample k-th episode (S1 , A1 , R2 , ..., ST ) ∼ πk
5: for each state St and action At in the episode do
6: N(St , At ) ← N(St , At ) + 1
7: Q(St , At ) ← Q(St , At ) + N(S1t ,At ) (Gt − Q(St , At ))
8: end for
9: k ← k + 1, ← 1/k
10: πk = -greedy(Q)
11: end loop

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 38 / 63

MC vs. TD for Prediction and Control

1 Temporal-difference (TD) learning has several advantages over

Monte-Carlo (MC)
1 Lower variance
2 Online
3 Incomplete sequences
2 So we can use TD instead of MC in our control loop
1 Apply TD to Q(S, A)
2 Use -greedy policy improvement
3 Update every time-step rather than at the end of one episode

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 39 / 63

Recall: TD Prediction

1 An episode consists of an alternating sequence of states and

state–action pairs:

2 TD(0) method for estimating the value function V (S)

At ← action given by π for S

Take action At , observe Rt+1 and St+1
V (St ) ← V (St ) + α[Rt+1 + γV (St+1 ) − V (St )]

3 How about estimating action value function Q(S, A)?

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 40 / 63

Sarsa: On-Policy TD Control

1 An episode consists of an alternating sequence of states and

state–action pairs:

2 -greedy policy for one step, then bootstrap the action value function:
h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )

3 The update is done after every transition from a nonterminal state St

4 TD target δt = Rt+1 + γQ(St+1 , At+1 )

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 41 / 63

Sarsa algorithm

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 42 / 63

n-step Sarsa
1 Consider the following n-step Q-returns for n = 1, 2, ∞
(1)
n = 1(Sarsa)qt =Rt+1 + γQ(St+1 , At+1 )
(2)
n=2 qt =Rt+1 + γRt+2 + γ 2 Q(St+2 , At+2 )
..
.
n = ∞(MC ) qt∞ =Rt+1 + γRt+2 + ... + γ T −t−1 RT

2 Thus the n-step Q-return is defined as

(n)
qt = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n Q(St+n , At+n )

3 n-step Sarsa updates Q(s,a) towards the n-step Q-return:

(n)
Q(St , At ) ← Q(St , At ) + α qt − Q(St , At )

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 43 / 63

On-policy Learning vs. Off-policy Learning

1 On-policy learning: Learn about policy π from the experience

collected from π
1 Behave non-optimally in order to explore all actions, then reduce the
exploration. e.g., -greedy
2 Another important approach is off-policy learning which essentially
uses two different polices:
1 the one which is being learned about and becomes the optimal policy
2 the other one which is more exploratory and is used to generate
trajectories
3 Off-policy learning: Learn about policy π from the experience sampled
from another policy µ
1 π: target policy
2 µ: behavior policy

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 44 / 63

Off-policy Learning

1 Following behaviour policy µ(a|s) to collect data

S1 , A1 , R2 , ..., ST ∼ µ
Update π using S1 , A1 , R2 , ..., ST
2 It leads to many benefits:
1 Learn about optimal policy while following exploratory policy
2 Learn from observing humans or other agents
3 Re-use experience generated from old policies π1 , π2 , ..., πt−1
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 45 / 63
Off-Policy Control with Q Learning

1 Off-policy learning of action values Q(s, a)

2 No importance sampling is needed
3 Next action in TD target is selected from the alternative action
A0 ∼ π(.|St )
4 update Q(St , At ) towards value of alternative action

Q(St , At ) ← Q(St , At ) + α Rt+1 + γQ(St+1 , A0 ) − Q(St , At )

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 46 / 63

Off-Policy Control with Q-Learning
1 We allow both behavior and target policies to improve
2 The target policy π is greedy on Q(s, a)

π(St+1 ) = arg max Q(St+1 , a0 )

3 The behavior policy µ could be totally random, but we let it improve

following -greedy on Q(s, a)
4 Thus Q-learning target:

Rt+1 + γQ(St+1 , A0 ) =Rt+1 + γQ(St+1 , arg max Q(St+1 , a0 ))

=Rt+1 + γ max
0
Q(St+1 , a0 )
a

5 Thus the Q-Learning update

h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γ max Q(St+1 , a) − Q(St , At )
a

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 47 / 63

Q-learning algorithm

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 48 / 63

Comparison of Sarsa and Q-Learning

1 Sarsa: On-Policy TD control

Choose action At from St using policy derived from Q with -greedy
Take action At , observe Rt+1 and St+1
Choose action At+1 from St+1 using policy derived from Q with -greedy
h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )

2 Q-Learning: Off-Policy TD control

Choose action At from St using policy derived from Q with -greedy
Take action At , observe Rt+1 and St+1
Then ‘imagine’ At+1 as arg max Q(St+1 , a0 ) in the update target
h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γ max Q(St+1 , a) − Q(St , At )
a

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 49 / 63

Comparison of Sarsa and Q-Learning

1 Backup diagram for Sarsa and Q-learning

2 In Sarsa, A and A’ are sampled from the same policy so it is on-policy

3 In Q Learning, A and A’ are from different policies, with A being
more exploratory and A’ determined directly by the max operator

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 50 / 63

Example on Cliff Walk (Example 6.6 from Textbook)
https://github.com/ucla-rlcourse/RLexample/blob/master/
modelfree/cliffwalk.py

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 51 / 63

Summary of DP and TD

Expected Update (DP) Sample Update (TD)

Iterative Policy Evaluation TD Learning
V (s) ← E[R + γV (S 0 )|s] V (S) ←α R + γV (S 0 )
Q-Policy Iteration Sarsa
Q(S, A) ← E[R + γQ(S 0 , A0 )|s, a] Q(S, A) ←α R + γQ(S 0 , A0 )
Q-Value Iteration Q-Learning
Q(S, A) ← E[R + γ maxa0 ∈A Q(S 0 , A0 )|s, a] Q(S, A) ←α R + γ maxa0 ∈A Q(S 0 , a0 )
where x ←α y is defined as x ← x + α(y − x)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 52 / 63

Code Example of Sarsa and Q-Learning

https:
//github.com/ucla-rlcourse/RLexample/tree/master/modelfree

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 53 / 63

Off-policy Learning with Importance Sampling
What is importance sampling?
1 Estimate the expectation of a function f (x)
Z
1X
Ex∼P [f (x)] = f (x)P(x)dx ≈ f (xi )
n
i

2 But sometimes it is difficult to sample x from P(x), then we can

sample x from another distribution Q(x), then correct the weight
Z
Ex∼P [f (x)] = P(x)f (x)dx
Z
P(x)
= Q(x) f (x)dx
Q(x)
h P(x) i 1 X P(x )
i
=Ex∼Q f (x) ≈ f (xi )
Q(x) n Q(xi )
i

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 54 / 63

Off-policy Learning with Importance Sampling

1 Expected return: Eτ ∼π [r (τ )] where r (.) is the reward function and π

is the policy
2 Estimate the expectation of return using trajectories τi sampled from
another policy (behavior policy µ)
Z
Eτ ∼π [r (τ )] = Pπ (τ )r (τ )dτ
Z
Pπ (τ )
= Pµ (τ ) r (τ )dτ
Pµ (τ )
h P (τ ) i
π
=Eτ ∼µ r (τ )
Pµ (τ )
1 X Pπ (τi )
≈ r (τi )
n Pµ (τi )
i

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 55 / 63

Off-Policy Monte Carlo with Importance Sampling

1 Generate episode from behavior policy µ and compute the generated

return Gt

S1 , A1 , R2 , ..., ST ∼ µ

2 Weight return Gt according to similarity between policies

1 Multiply importance sampling corrections along whole episode

π/µ π(At |St ) π(At+1 |St+1 ) π(AT |ST )

Gt = ... Gt
µ(At |St ) µ(At+1 |St+1 ) µ(AT |ST )

3 Update value towards correct return

π/µ
V (St ) ← V (St ) + α(Gt − V (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 56 / 63

Off-Policy TD with Importance Sampling

1 Use TD targets generated from µ to evaluate π

2 Weight TD target R + γV (S 0 ) by importance sampling
3 Only need a single importance sampling correction
π(A |S )
t t
V (St ) ← V (St ) + α Rt+1 + γV (St+1 ) − V (St )
µ(At |St )

4 Policies only need to be similar over a single step

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 57 / 63

Why don’t use importance sampling on Q-Learning?
1 Off-policy TD
π(A |S )
t t
V (St ) ← V (St ) + α Rt+1 + γV (St+1 ) − V (St ) (4)
µ(At |St )
2 Short answer: 1,Q-learning uses a deterministic policy so no action
probability. 2,Q-learning does not make expected value estimates over
the policy distribution. For the full answer click here
3 Remember bellman optimality backup from value iteration
X
Q(s, a) =R(s, a) + γ P(s 0 |s, a) max
0
Q(s 0 , a0 ) (5)
a
s 0 ∈S
1 Q-learning can be considered as sample update of value iteration,
except instead of using the expected value over the transition
dynamics, we use the sample collected from the environment
Q(s, a) ←r + γ max
0
Q(s 0 , a0 ) (6)
a

Q-learning is over the transition distribution, not over policy

distribution thus no need to correct different policy distributions
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 58 / 63
Eligibility Traces (Textbook Chapter 12)

1 Remember in TD learning, return for n-step TD is

Gt:t+n = Rt+1 + λRt+2 + .. + λn−1 Rt+n + λn v (St+n ) (7)

2 a backup can be done toward a target that is half of a two-step return

and half of a four-step return:
1 1
Gt:t+1 (Vt (St+2 )) + Gt:t+4 (Vt (St+4 )) (8)
2 2
3 Such averaging can obtain another way of interrelating TD and
Monte Carlo methods

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 59 / 63

Eligibility Traces

The λ-return is then defined as Gtλ = (1 − λ) λn−1 Gt:t+n

P
1 n

2 For λ = 1, updating according to the λ−return is a Monte Carlo

algorithm. On the other hand, if λ = 0 it becomes a one-step TD
method.
3 The λ−return gives an alternative way of moving smoothly between
Monte Carlo and one-step TD methods

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 60 / 63

Eligibility Traces

1 Based on the λ−return we can derive TD(λ), SARSA(λ),

Q-learning(λ), see Chapter 12
2 Eligibility traces provide an efficient, incremental way of shifting and
choosing between Monte Carlo and TD methods
3 It is similar to n-step method but offers different computational
complexity tradeoffs.

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 61 / 63

Unified View of Reinforcement Learning

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 62 / 63

Summary

1 Model-free prediction and control

2 Monte Carlo method
3 Temporal Difference learning
1 SARSA (on-policy TD control)
2 Q-Learning (off-policy TD control)
4 Importance Sampling
5 Eligibility traces
6 Next week:
1 Value function approximation and Deep Q-learning
2 Textbook Chapter 9

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 63 / 63

Model Free Prediction
No ratings yet
Model Free Prediction
38 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Lec 10
No ratings yet
Lec 10
50 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
ML - Unit-3 - Reinforcement Learning
No ratings yet
ML - Unit-3 - Reinforcement Learning
47 pages
Lecture 4: Model-Free Prediction: David Silver
No ratings yet
Lecture 4: Model-Free Prediction: David Silver
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
16 RL
No ratings yet
16 RL
51 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Unit 4
No ratings yet
Unit 4
49 pages
2023 Week2 Lecture Before
No ratings yet
2023 Week2 Lecture Before
77 pages
RL Ese
No ratings yet
RL Ese
7 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Model Free Methods
No ratings yet
Model Free Methods
31 pages
2023 Week7 modelbasedRL Updated
No ratings yet
2023 Week7 modelbasedRL Updated
56 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
RL 1
No ratings yet
RL 1
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Lec 5
No ratings yet
Lec 5
13 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
AS02
No ratings yet
AS02
16 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Notes
No ratings yet
Notes
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
AI Fundamentals Finals
No ratings yet
AI Fundamentals Finals
6 pages
Optimized AO Algorithm For AND-OR Graph.: 1 Rishavh Srivastava 2 Robin Bisht
No ratings yet
Optimized AO Algorithm For AND-OR Graph.: 1 Rishavh Srivastava 2 Robin Bisht
6 pages
Fai Unit 4 Notes
No ratings yet
Fai Unit 4 Notes
21 pages
Unit 4
100% (1)
Unit 4
7 pages
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
2 pages
Retrace
No ratings yet
Retrace
18 pages
2pageresume Shambhavi
0% (1)
2pageresume Shambhavi
2 pages
Markov Decision
No ratings yet
Markov Decision
14 pages
3 Markov Decision Processes
No ratings yet
3 Markov Decision Processes
70 pages
Unit 3 - Decision Making Under Uncertainty in AI
No ratings yet
Unit 3 - Decision Making Under Uncertainty in AI
25 pages
Unifying Count-Based Exploration and Intrinsic Motivation: ON Tezuma S Evenge
No ratings yet
Unifying Count-Based Exploration and Intrinsic Motivation: ON Tezuma S Evenge
26 pages
Jia Zhou - JMLR 2023
No ratings yet
Jia Zhou - JMLR 2023
61 pages
Literature Review Samples
No ratings yet
Literature Review Samples
64 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Midterm Exam Solution
No ratings yet
Midterm Exam Solution
11 pages
Low Complexity Online Radio Access
No ratings yet
Low Complexity Online Radio Access
14 pages
Ai Fundamentals Final Quiz Source by Ate Zein
No ratings yet
Ai Fundamentals Final Quiz Source by Ate Zein
25 pages
Calculaus
No ratings yet
Calculaus
8 pages
DRL Final Notes
No ratings yet
DRL Final Notes
281 pages
Decision-Making For Autonomous Vehicles On Highway: Deep Reinforcement Learning With Continuous Action Horizon
No ratings yet
Decision-Making For Autonomous Vehicles On Highway: Deep Reinforcement Learning With Continuous Action Horizon
9 pages
04of22 - Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning
No ratings yet
04of22 - Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning
26 pages
Chapter. 08 - Markov Decision Processes - Examples
No ratings yet
Chapter. 08 - Markov Decision Processes - Examples
23 pages
Introduction To Operations Research: Ninth Edition
No ratings yet
Introduction To Operations Research: Ninth Edition
8 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
4701 f17 Final Summary 2
No ratings yet
4701 f17 Final Summary 2
6 pages
AI LabReport
No ratings yet
AI LabReport
12 pages
Cost-Effective Condition-Based Maintenance Using Markov Decision Processes
No ratings yet
Cost-Effective Condition-Based Maintenance Using Markov Decision Processes
6 pages
UT Dallas Syllabus For cs4375.501.07f Taught by Yu Chung NG (Ycn041000)
No ratings yet
UT Dallas Syllabus For cs4375.501.07f Taught by Yu Chung NG (Ycn041000)
5 pages
1 s2.0 S1474034621001956 Main
No ratings yet
1 s2.0 S1474034621001956 Main
13 pages