0% found this document useful (0 votes)
3 views63 pages

2023 Week3 Modelfree

Uploaded by

luciferboboqwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views63 pages

2023 Week3 Modelfree

Uploaded by

luciferboboqwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Week 3: Model-free Prediction and Control

Bolei Zhou

UCLA

October 10, 2023

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 1 / 63


Announcement

1 Assignment 1 is out, dues by the end of Week 4.


1 https://github.com/ucla-rlcourse/assignment-2023fall
2 Please start early!
2 RL examples: https://github.com/ucla-rlcourse/RLexample

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 2 / 63


This Week’s Plan

1 Last week
1 MDP, policy evaluation, policy iteration and value iteration for solving
a known MDP
2 This week
1 Model-free prediction: Estimate value function of an unknown MDP
2 Model-free control: Optimize value function of an unknown MDP

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 3 / 63


Review on the control in MDP
1 When the MDP is known?
1 Both R and P are exposed to the agent
2 Therefore we can run policy iteration and value iteration
2 Policy iteration: Given a known MDP, compute the optimal policy
and the optimal value function
1 Policy evaluation: iteration on the Bellman expectation backup
X X
vt (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)vt−1 (s 0 ))
a∈A s 0 ∈S

2 Policy improvement: greedy on action-value function q


X
qπt (s, a) =R(s, a) + γ P(s 0 |s, a)vπt (s 0 )
s 0 ∈S
πt+1 (s) = arg max qπt (s, a)
a

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 4 / 63


Review on the control in MDP introduced last week

1 Value iteration: Given a known MDP, compute the optimal value


function
2 Iteration on the Bellman optimality backup
X
vt+1 (s) ← max R(s, a) + γ P(s 0 |s, a)vt (s 0 )
a∈A
s 0 ∈S

3 To retrieve the optimal policy after the value iteration:


X
π ∗ (s) ← arg max R(s, a) + γ P(s 0 |s, a)vend (s 0 ) (1)
a
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 5 / 63


RL when knowing how the world works

1 Both of the policy iteration and value iteration assume the direct
access to the dynamics and rewards of the environment

2 In a lot of real-world problems, MDP model is either unknown or


known by too big or too complex to use
1 Atari Game, Game of Go, Helicopter, Portfolio management, etc

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 6 / 63


Model-free RL: Learning through interactions

1 Model-free RL can solve the problems through the interaction with


the environment

2 No more direct access to the known transition dynamics and reward


function
3 Trajectories/episodes are collected by the agent’s interaction with the
environment
4 Each trajectory/episode contains {S1 , A1 , R2 , S2 , A2 , R3 , ..., ST }

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 7 / 63


Model-free prediction: policy evaluation without the access
to the model

1 Estimating the expected return of a particular policy if we don’t have


access to the MDP models
1 Monte Carlo policy evaluation
2 Temporal Difference (TD) learning

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 8 / 63


What is Monte-Carlo Method?

1 A broad class of computational algorithms that rely on repeated


random sampling to obtain numerical results
2 Example: Estimate the value of π

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 9 / 63


Monte-Carlo Policy Evaluation

1 Return: Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... under policy π


2 v π (s) = Eτ ∼π [Gt |st = s], thus expectation over trajectories τ
generated by following π
3 MC simulation: we can simply sample a lot of trajectories, compute
the actual returns for all the trajectories, then average them
4 MC policy evaluation uses empirical mean return instead of expected
return
5 MC does not require MDP dynamics/rewards, no bootstrapping, and
does not assume state is Markov.
6 Only applied to episodic MDPs (each episode terminates)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 10 / 63


Example: Monte Carlo Algorithm for Computing Value of
a MRP
Algorithm 1 Monte Carlo simulation to calculate MRP value function
1: i ← 0, Gt ← 0
2: while i 6= N do
3: generate an episode, starting from state s and timePt
H−1 i−t
4: Using the generated episode, calculate return g = i=t γ ri
5: Gt ← Gt + g , i ← i + 1
6: end while
7: Vt (s) ← Gt /N

1 For example: to calculate V (s4 ) we can generate a lot of trajectories


then take the average of the returns:
1 1
1 return for s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 × 10 = 2.5
1 1
2 return for s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 × 5 = 1.25
3 return for s4 , s5 , s6 , s6 : 0
4 more trajectories
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 11 / 63
Monte-Carlo Policy Evaluation

1 To evaluate state v (s)


1 Every time-step t that state s is visited in an episode,
2 Increment counter N(s) ← N(s) + 1
3 Increment total return S(s) ← S(s) + Gt
4 Value is estimated by mean return v (s) = S(s)/N(s)

2 By law of large numbers, v (s) → v π (s) as N(s) → ∞

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 12 / 63


Monte-Carlo Policy Evaluation

1 How to calculate G (s): compute backward

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 13 / 63


Incremental Mean

from the moving average of samples x1 , x2 , ...


t
1X
µt = xj
t
j=1
t−1 
1 X
= xt + xj
t
j=1
1
= (xt + (t − 1)µt−1 )
t
1
=µt−1 + (xt − µt−1 )
t

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 14 / 63


Incremental MC Updates

1 Collect one episode (S1 , A1 , R2 , ..., St )


2 For each state st with computed return Gt

N(St ) ←N(St ) + 1
1
v (St ) ←v (St ) + (Gt − v (St ))
N(St )

3 Or use a running mean (old episodes are forgotten). Good for


non-stationary problems.

v (St ) ← v (St ) + α(Gt − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 15 / 63


Difference between DP and MC for policy evaluation

1 Dynamic Programming (DP) computes vi by bootstrapping the rest


of the expected return by the value estimate vi−1
2 Iteration on Bellman expectation backup:
X  X 
vi (s) ← π(a|s) R(s, a) + γ P(s 0 |s, a)vi−1 (s 0 )
a∈A s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 16 / 63


Difference between DP and MC for policy evaluation

1 MC updates the empirical mean return with one sampled episode

v (St ) ← v (St ) + α(Gi,t − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 17 / 63


Advantages of MC over DP

1 MC works when the environment is unknown


2 Working with sample episodes has a huge advantage, even when one
has complete knowledge of the environment’s dynamics, for example,
transition probability is complex to compute
3 Cost of estimating a single state’s value is independent of the total
number of states. So you can sample episodes starting from the
states of interest then average returns

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 18 / 63


Introduction of Temporal-Difference (TD) Learning

1 TD methods learn directly from episodes of experience


2 TD is model-free: no knowledge of MDP transitions/rewards
3 TD learns from incomplete episodes, by bootstrapping

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 19 / 63


Introduction of Temporal-Difference (TD) Learning

1 Objective: learn vπ online from experience under policy π


2 Simplest TD algorithm: TD(0)
1 Update v (St ) toward estimated return Rt+1 + γv (St+1 )

v (St ) ← v (St ) + α(Rt+1 + γv (St+1 ) − v (St ))

3 Rt+1 + γv (St+1 ) is called TD target


4 δt = Rt+1 + γv (St+1 ) − v (St ) is called the TD error
5 Comparison: Incremental Monte-Carlo
1 Update v (St ) toward actual return Gt given an episode i

v (St ) ← v (St ) + α(Gi,t − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 20 / 63


Advantages of TD over MC

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 21 / 63


Comparison of TD and MC

1 TD can learn online after every step


2 MC must wait until end of episode before return is known

3 TD can learn from incomplete sequences


4 MC can only learn from complete sequences

5 TD works in continuing (non-terminating) environments


6 MC only works for episodic (terminating) environments

7 TD exploits Markov property, more efficient in Markov environments


8 MC does not exploit Markov property, more effective in non-Markov
environments

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 22 / 63


n-step TD

1 n-step TD methods that generalize both one-step TD and MC.


2 We can shift from one to the other smoothly as needed to meet the
demands of a particular task.

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 23 / 63


n-step TD prediction

1 Consider the following n-step returns for n = 1, 2, ∞


(1)
n = 1(TD) Gt =Rt+1 + γv (St+1 )
(2)
n=2 Gt =Rt+1 + γRt+2 + γ 2 v (St+2 )
..
.
n = ∞(MC ) Gt∞ =Rt+1 + γRt+2 + ... + γ T −t−1 RT

2 Thus the n-step return is defined as

Gtn = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n v (St+n )

 
3 n-step TD: v (St ) ← v (St ) + α Gtn − v (St )

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 24 / 63


Bootstrapping and Sampling for DP, MC, and TD

1 Bootstrapping: update involves an estimate


1 MC does not bootstrap
2 DP bootstraps
3 TD bootstraps
2 Sampling: update samples an expectation
1 MC samples
2 DP does not sample
3 TD samples

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 25 / 63


Unified View: Dynamic Programming Backup

v (St ) ← Eπ [Rt+1 + γv (St+1 )]

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 26 / 63


Unified View: Monte-Carlo Backup

v (St ) ← v (St ) + α(Gt − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 27 / 63


Unified View: Temporal-Difference Backup

TD(0) : v (St ) ← v (St ) + α(Rt+1 + γv (st+1 ) − v (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 28 / 63


Unified View of Reinforcement Learning

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 29 / 63


A Short Summary

1 Model-free prediction
1 Evaluate the state value by only interacting with the environment
2 Many algorithms can do it: Temporal Difference Learning and
Monte-Carlo method

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 30 / 63


Model-free Control for MDP

1 Model-free control:
1 Optimize the value function of an unknown MDP
2 Generate a optimal control policy
2 Generalized Policy Iteration (GPI) with MC or TD in the loop

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 31 / 63


Revisiting Policy Iteration

1 Iterate through the two steps:


1 Evaluate the policy π (computing v given current π)
2 Improve the policy by acting greedily with respect to vπ

π 0 = greedy(vπ ) (2)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 32 / 63


Policy Iteration for a Known MDP
1 compute the state-action value of a policy π:
X
qπi (s, a) = R(s, a) + γ P(s 0 |s, a)vπi (s 0 )
s 0 ∈S

2 Compute new policy πi+1 for all s ∈ S following


πi+1 (s) = arg max qπi (s, a) (3)
a

3 Problem: What to do if there is neither R(s, a) nor P(s 0 |s, a)


known/available?
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 33 / 63
Generalized Policy Iteration with Action-Value Function
Monte Carlo version of policy iteration

1 Policy evaluation: Monte-Carlo policy evaluation Q = qπ


2 Policy improvement: Greedy policy improvement?

π(s) = arg max q(s, a)


a

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 34 / 63


Monte Carlo with Exploring Starts

1 One assumption to obtain the guarantee of convergence in PI:


Episode has exploring starts
2 Exploring starts can ensure all actions are selected infinitely often

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 35 / 63


Monte Carlo with -Greedy Exploration

1 Trade-off between exploration and exploitation (we will talk about


this in later lecture)
2 -Greedy Exploration: Ensuring continual exploration
1 All actions are tried with non-zero probability
2 With probability 1 −  choose the greedy action
3 With probability  choose an action at random
(
/|A| + 1 −  if a∗ = arg maxa∈A Q(s, a)
π 0 (a|s) =
/|A| otherwise

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 36 / 63


Monte Carlo with -Greedy Exploration

1 -Greedy Policy improvement theorem (textbook pg-101): For any


-greedy policy π, the -greedy policy π 0 with respect to qπ is an
improvement, vπ0 (s) ≥ vπ (s)
X
qπ (s, π 0 (s)) = π 0 (a|s)qπ (s, a)
a∈A
 X
= qπ (s, a) + (1 − ) max qπ (s, a)
|A| a
a∈A

 X X π(a|s) − |A|
≥ qπ (s, a) + (1 − ) qπ (s, a)
|A| 1−
a∈A a∈A
X
= π(a|s)qπ (s, a) = vπ (s)
a∈A

Therefore from the policy improvement theorem vπ0 (s) ≥ vπ (s)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 37 / 63


Monte Carlo with -Greedy Exploration

Algorithm 2
1: Initialize Q(S, A) = 0, N(S, A) = 0,  = 1, k = 1
2: πk = -greedy(Q)
3: loop
4: Sample k-th episode (S1 , A1 , R2 , ..., ST ) ∼ πk
5: for each state St and action At in the episode do
6: N(St , At ) ← N(St , At ) + 1
7: Q(St , At ) ← Q(St , At ) + N(S1t ,At ) (Gt − Q(St , At ))
8: end for
9: k ← k + 1,  ← 1/k
10: πk = -greedy(Q)
11: end loop

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 38 / 63


MC vs. TD for Prediction and Control

1 Temporal-difference (TD) learning has several advantages over


Monte-Carlo (MC)
1 Lower variance
2 Online
3 Incomplete sequences
2 So we can use TD instead of MC in our control loop
1 Apply TD to Q(S, A)
2 Use -greedy policy improvement
3 Update every time-step rather than at the end of one episode

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 39 / 63


Recall: TD Prediction

1 An episode consists of an alternating sequence of states and


state–action pairs:

2 TD(0) method for estimating the value function V (S)

At ← action given by π for S


Take action At , observe Rt+1 and St+1
V (St ) ← V (St ) + α[Rt+1 + γV (St+1 ) − V (St )]

3 How about estimating action value function Q(S, A)?

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 40 / 63


Sarsa: On-Policy TD Control

1 An episode consists of an alternating sequence of states and


state–action pairs:

2 -greedy policy for one step, then bootstrap the action value function:
h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )

3 The update is done after every transition from a nonterminal state St


4 TD target δt = Rt+1 + γQ(St+1 , At+1 )

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 41 / 63


Sarsa algorithm

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 42 / 63


n-step Sarsa
1 Consider the following n-step Q-returns for n = 1, 2, ∞
(1)
n = 1(Sarsa)qt =Rt+1 + γQ(St+1 , At+1 )
(2)
n=2 qt =Rt+1 + γRt+2 + γ 2 Q(St+2 , At+2 )
..
.
n = ∞(MC ) qt∞ =Rt+1 + γRt+2 + ... + γ T −t−1 RT

2 Thus the n-step Q-return is defined as


(n)
qt = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n Q(St+n , At+n )

3 n-step Sarsa updates Q(s,a) towards the n-step Q-return:


 
(n)
Q(St , At ) ← Q(St , At ) + α qt − Q(St , At )

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 43 / 63


On-policy Learning vs. Off-policy Learning

1 On-policy learning: Learn about policy π from the experience


collected from π
1 Behave non-optimally in order to explore all actions, then reduce the
exploration. e.g., -greedy
2 Another important approach is off-policy learning which essentially
uses two different polices:
1 the one which is being learned about and becomes the optimal policy
2 the other one which is more exploratory and is used to generate
trajectories
3 Off-policy learning: Learn about policy π from the experience sampled
from another policy µ
1 π: target policy
2 µ: behavior policy

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 44 / 63


Off-policy Learning

1 Following behaviour policy µ(a|s) to collect data

S1 , A1 , R2 , ..., ST ∼ µ
Update π using S1 , A1 , R2 , ..., ST
2 It leads to many benefits:
1 Learn about optimal policy while following exploratory policy
2 Learn from observing humans or other agents
3 Re-use experience generated from old policies π1 , π2 , ..., πt−1
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 45 / 63
Off-Policy Control with Q Learning

1 Off-policy learning of action values Q(s, a)


2 No importance sampling is needed
3 Next action in TD target is selected from the alternative action
A0 ∼ π(.|St )
4 update Q(St , At ) towards value of alternative action

Q(St , At ) ← Q(St , At ) + α Rt+1 + γQ(St+1 , A0 ) − Q(St , At )




Bolei Zhou CS260R Reinforcement Learning October 10, 2023 46 / 63


Off-Policy Control with Q-Learning
1 We allow both behavior and target policies to improve
2 The target policy π is greedy on Q(s, a)

π(St+1 ) = arg max Q(St+1 , a0 )


a0

3 The behavior policy µ could be totally random, but we let it improve


following -greedy on Q(s, a)
4 Thus Q-learning target:

Rt+1 + γQ(St+1 , A0 ) =Rt+1 + γQ(St+1 , arg max Q(St+1 , a0 ))


=Rt+1 + γ max
0
Q(St+1 , a0 )
a

5 Thus the Q-Learning update


h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γ max Q(St+1 , a) − Q(St , At )
a

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 47 / 63


Q-learning algorithm

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 48 / 63


Comparison of Sarsa and Q-Learning

1 Sarsa: On-Policy TD control


Choose action At from St using policy derived from Q with -greedy
Take action At , observe Rt+1 and St+1
Choose action At+1 from St+1 using policy derived from Q with -greedy
h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )

2 Q-Learning: Off-Policy TD control


Choose action At from St using policy derived from Q with -greedy
Take action At , observe Rt+1 and St+1
Then ‘imagine’ At+1 as arg max Q(St+1 , a0 ) in the update target
h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γ max Q(St+1 , a) − Q(St , At )
a

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 49 / 63


Comparison of Sarsa and Q-Learning

1 Backup diagram for Sarsa and Q-learning

2 In Sarsa, A and A’ are sampled from the same policy so it is on-policy


3 In Q Learning, A and A’ are from different policies, with A being
more exploratory and A’ determined directly by the max operator

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 50 / 63


Example on Cliff Walk (Example 6.6 from Textbook)
https://github.com/ucla-rlcourse/RLexample/blob/master/
modelfree/cliffwalk.py

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 51 / 63


Summary of DP and TD

Expected Update (DP) Sample Update (TD)


Iterative Policy Evaluation TD Learning
V (s) ← E[R + γV (S 0 )|s] V (S) ←α R + γV (S 0 )
Q-Policy Iteration Sarsa
Q(S, A) ← E[R + γQ(S 0 , A0 )|s, a] Q(S, A) ←α R + γQ(S 0 , A0 )
Q-Value Iteration Q-Learning
Q(S, A) ← E[R + γ maxa0 ∈A Q(S 0 , A0 )|s, a] Q(S, A) ←α R + γ maxa0 ∈A Q(S 0 , a0 )
where x ←α y is defined as x ← x + α(y − x)

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 52 / 63


Code Example of Sarsa and Q-Learning

https:
//github.com/ucla-rlcourse/RLexample/tree/master/modelfree

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 53 / 63


Off-policy Learning with Importance Sampling
What is importance sampling?
1 Estimate the expectation of a function f (x)
Z
1X
Ex∼P [f (x)] = f (x)P(x)dx ≈ f (xi )
n
i

2 But sometimes it is difficult to sample x from P(x), then we can


sample x from another distribution Q(x), then correct the weight
Z
Ex∼P [f (x)] = P(x)f (x)dx
Z
P(x)
= Q(x) f (x)dx
Q(x)
h P(x) i 1 X P(x )
i
=Ex∼Q f (x) ≈ f (xi )
Q(x) n Q(xi )
i

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 54 / 63


Off-policy Learning with Importance Sampling

1 Expected return: Eτ ∼π [r (τ )] where r (.) is the reward function and π


is the policy
2 Estimate the expectation of return using trajectories τi sampled from
another policy (behavior policy µ)
Z
Eτ ∼π [r (τ )] = Pπ (τ )r (τ )dτ
Z
Pπ (τ )
= Pµ (τ ) r (τ )dτ
Pµ (τ )
h P (τ ) i
π
=Eτ ∼µ r (τ )
Pµ (τ )
1 X Pπ (τi )
≈ r (τi )
n Pµ (τi )
i

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 55 / 63


Off-Policy Monte Carlo with Importance Sampling

1 Generate episode from behavior policy µ and compute the generated


return Gt

S1 , A1 , R2 , ..., ST ∼ µ

2 Weight return Gt according to similarity between policies


1 Multiply importance sampling corrections along whole episode

π/µ π(At |St ) π(At+1 |St+1 ) π(AT |ST )


Gt = ... Gt
µ(At |St ) µ(At+1 |St+1 ) µ(AT |ST )

3 Update value towards correct return


π/µ
V (St ) ← V (St ) + α(Gt − V (St ))

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 56 / 63


Off-Policy TD with Importance Sampling

1 Use TD targets generated from µ to evaluate π


2 Weight TD target R + γV (S 0 ) by importance sampling
3 Only need a single importance sampling correction
 π(A |S ) 
t t 
V (St ) ← V (St ) + α Rt+1 + γV (St+1 ) − V (St )
µ(At |St )

4 Policies only need to be similar over a single step

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 57 / 63


Why don’t use importance sampling on Q-Learning?
1 Off-policy TD
 π(A |S ) 
t t 
V (St ) ← V (St ) + α Rt+1 + γV (St+1 ) − V (St ) (4)
µ(At |St )
2 Short answer: 1,Q-learning uses a deterministic policy so no action
probability. 2,Q-learning does not make expected value estimates over
the policy distribution. For the full answer click here
3 Remember bellman optimality backup from value iteration
X
Q(s, a) =R(s, a) + γ P(s 0 |s, a) max
0
Q(s 0 , a0 ) (5)
a
s 0 ∈S
1 Q-learning can be considered as sample update of value iteration,
except instead of using the expected value over the transition
dynamics, we use the sample collected from the environment
Q(s, a) ←r + γ max
0
Q(s 0 , a0 ) (6)
a

Q-learning is over the transition distribution, not over policy


distribution thus no need to correct different policy distributions
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 58 / 63
Eligibility Traces (Textbook Chapter 12)

1 Remember in TD learning, return for n-step TD is

Gt:t+n = Rt+1 + λRt+2 + .. + λn−1 Rt+n + λn v (St+n ) (7)

2 a backup can be done toward a target that is half of a two-step return


and half of a four-step return:
1 1
Gt:t+1 (Vt (St+2 )) + Gt:t+4 (Vt (St+4 )) (8)
2 2
3 Such averaging can obtain another way of interrelating TD and
Monte Carlo methods

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 59 / 63


Eligibility Traces

The λ-return is then defined as Gtλ = (1 − λ) λn−1 Gt:t+n


P
1 n

2 For λ = 1, updating according to the λ−return is a Monte Carlo


algorithm. On the other hand, if λ = 0 it becomes a one-step TD
method.
3 The λ−return gives an alternative way of moving smoothly between
Monte Carlo and one-step TD methods

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 60 / 63


Eligibility Traces

1 Based on the λ−return we can derive TD(λ), SARSA(λ),


Q-learning(λ), see Chapter 12
2 Eligibility traces provide an efficient, incremental way of shifting and
choosing between Monte Carlo and TD methods
3 It is similar to n-step method but offers different computational
complexity tradeoffs.

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 61 / 63


Unified View of Reinforcement Learning

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 62 / 63


Summary

1 Model-free prediction and control


2 Monte Carlo method
3 Temporal Difference learning
1 SARSA (on-policy TD control)
2 Q-Learning (off-policy TD control)
4 Importance Sampling
5 Eligibility traces
6 Next week:
1 Value function approximation and Deep Q-learning
2 Textbook Chapter 9

Bolei Zhou CS260R Reinforcement Learning October 10, 2023 63 / 63

You might also like