0% found this document useful (0 votes)

9 views92 pages

2.2+model Free+Control

The document discusses model-free reinforcement learning, focusing on model-free control techniques such as Monte-Carlo and Temporal-Difference learning. It outlines the exploration-exploitation dilemma and presents methods to address it, including ε-greedy policies and exploring starts. Various applications of model-free control in real-world problems are also highlighted.

Uploaded by

shengaa1028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views92 pages

2.2+model Free+Control

Uploaded by

shengaa1028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

Some slides are from: David Silver (DeepMind), Katerina Fragkiadaki (CMU),

Emma Brunskill (Stanford), Bolei Zhou (UCLA), Hado van Hasselt (DeepMind)

COMP 4901Z: Reinforcement Learning

2.2 Model-Free Control

Long Chen (Dept. of CSE)

Model-free Reinforcement Learning

• Last lecture
• Model-free prediction
• Estimate the value function of an unknown MDP
• This lecture
• Model-free control
• Optimize the value function of an unknown MDP

2
Recap: DP vs. MC vs. TD Learning
MC: Sample average return
• Remember: approximates expectation
𝑉! 𝑠 = 𝔼! 𝐺" 𝑆" = 𝑠
= 𝔼! ∑&
#$% 𝛾 #
𝑅"'#'( 𝑆" = 𝑠
= 𝔼! 𝑅"'( + 𝛾 ∑&
#$% 𝛾 #
𝑅"'#') 𝑆" = 𝑠
= 𝔼! 𝑅"'( + 𝛾𝑉! 𝑆"'( 𝑆" = 𝑠

TD: combine both: Sample expected DP: the expected values are provided
values and use a current estimate by a model. But we use a current
𝑉(𝑆!"#) of the true 𝑉$ (𝑆!"#) estimate 𝑉(𝑆!"#) of the true 𝑉$ (𝑆!"#)

3
Recap: Monte-Carlo Backup

𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝐺! − 𝑉 𝑆!

4
Recap: Temporal-Difference Backup

𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!"# + 𝛾𝑉 𝑆!"# − 𝑉 𝑆!

5
Recap: Dynamic Programming Backup

𝑉 𝑆! ← 𝔼$ 𝑅!"# + 𝛾𝑉 𝑆!"#

6
Recap: n-Step Return

• Consider the following n-step returns for 𝑛 = 1, 2, … , ∞:

𝑛=1 (TD) 𝐺! # = 𝑅!"# + 𝛾𝑉

𝑛=2 𝐺! % = 𝑅!"# + 𝛾𝑅!"% + 𝛾 %𝑉 𝑆!"%

⋮ ⋮
&
𝑛 = ∞ (MC) 𝐺! = 𝑅!"# + 𝛾𝑅!"% + ⋯ + 𝛾 '(#𝑅'
• Define the n-step return
)
𝐺! = 𝑅!"# + 𝛾𝑅!"% + ⋯ + 𝛾 )(#𝑅!") + 𝛾 ) 𝑉 𝑆!")
• n-step temporal-difference learning
)
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝐺! − 𝑉(𝑆! )
7
Recap: 𝜆-return

• The 𝜆-return 𝐺!* combines all n-step

returns 𝐺! )
• Using weight 1 − 𝜆 𝜆)(#
&
)
𝐺!* = (1 − 𝜆) 8 𝜆)(#𝐺!
)+#

• Forward-view 𝑇𝐷 𝜆
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝐺!* − 𝑉 𝑆!

8
2.2 Model-Free Control
Outline

• Introduction
• On-Policy Monte-Carlo Control
• Off-Policy Monte-Carlo Control
• On-Policy Temporal-Difference Learning
• Off-Policy Temporal-Difference Learning
• Summary

10
Uses of Model-Free Control

• Some example problems that can be modelled as MDPs

• Elevator • Robocup Soccer
• Parallel Parking • Quake
• Ship Steering • Portfolio Management
• Bioreactor • Protein Folding
• Helicopter • Robot Walking
• Aeroplane Logistics • Game of Go

• For most of these problems, either:

• MDP model is unknown, but experience can be sampled
• MDP model is known, but is too big to use, except by samples
• Model-free control can solve these problems.
11
Monte-Carlo Control
Recap: Generalized Policy Iteration

• Policy evaluation Estimate 𝑉$

• Iterative policy evaluation
• Policy improvement Generate 𝜋 , ≥ 𝜋
• Greedy policy improvement

13
Generalized Policy Iteration with Monte-Carlo Evaluation

• Monte Carlo version of policy iteration

• Policy evaluation: Monte-Carlo policy evaluation 𝑉 = 𝑉$ ?

• Policy improvement: Greedy policy improvement?

14
Model-Free Policy Iteration using Action-Value Function

• There are two types of value functions (e.g., 𝑽(𝒔) and 𝑸 𝒔, 𝒂 ), which one
to use for policy improvement?
• Greedy policy improvement over 𝑉 𝑠 requires model of MDP
𝜋 , 𝑠 = arg max 𝑅 𝑠, 𝑎 + 𝛾𝑃 𝑠 , 𝑠, 𝑎 𝑉 𝑠 ,
-∈/

• Greedy policy improvement over 𝑄 𝑠, 𝑎 is model-free

𝜋 , 𝑠 = arg max 𝑄 𝑠, 𝑎
-∈/

15
Generalized Policy Iteration with Action-Value Function

• Policy evaluation: Monte-Carlo policy evaluation 𝑸 = 𝑸𝝅

• Policy improvement: Greedy policy improvement?

16
Convergence of MC Control

• Greedified policy meets the conditions for policy improvement:

• For all 𝑠 ∈ 𝑆:
𝑄$! 𝑠, 𝜋1"# 𝑠 = 𝑄$! 𝑠, arg max 𝑄$! 𝑠, 𝑎
-
= max 𝑄$! 𝑠, 𝑎
-
≥ 𝑄$! 𝑠, 𝜋1 𝑠
≥ 𝑉$! 𝑠
• And thus must be ≥ 𝜋1
• This assumes exploring starts and infinite number of episodes for MC
policy evaluation
17
Monte Carlo Control

• Generalized Policy Iteration

3 4 3 4 3 4 3
𝜋2 → 𝑄$" → 𝜋# → 𝑄$# → 𝜋% → … → 𝜋∗ → 𝑄∗
• Two unlikely assumptions to obtain the guarantee of convergence for the Monte-
Carlo methods:
1. The episodes have exploring starts
2. Policy evaluation could be done with an infinite number of episodes.
• In DP there are two methods for handling the second issue (same for MC)
• Sufficient steps are taken during each policy evaluation to assure that these
bounds are sufficient small (too many episodes to be useful)
• Similar to value iteration, move the value function towards 𝑄!!
18
Recap: Greedy Policy

• For any action-value function 𝑄, the corresponding greedy policy is the

one that
• For each 𝑠, deterministically chooses an action with maximal action
value:
𝜋 𝑠 = arg max 𝑄 𝑠, 𝑎
-
• Policy improvement then can be done by constructing each 𝜋1"# as the
greedy policy with respect to 𝑄$,1

19
MC Estimation of Action Values 𝑄

• Monte Carlo (MC) is most useful when a model is not available

• We want to learn 𝑄∗ 𝑠, 𝑎 because then we can get an optimal policy
without knowing dynamics.
• 𝑄$ 𝑠, 𝑎 : average return starting from state 𝑠 and action 𝑎 following 𝜋
𝑄$ 𝑠, 𝑎 = 𝔼$ 𝑅!"# + 𝛾𝑉$ 𝑆!"# |𝑆! = 𝑠, 𝐴! = 𝑎

= 8 𝑃 𝑠 , , 𝑟 𝑠, 𝑎 𝑟 + 𝛾𝑉$ 𝑠 ,
7$ ,8

• Converges asymptotically if every state-action pair is visited.

• Q: Is that possible if we are using a deterministic policy?

20
Recap: The Exploration Problem

Example of Greedy Action Selection

• There are two doors in front of you.
• You open the left door and get reward 0
𝑉 𝑙𝑒𝑓𝑡 = 0
• You open the right door and get reward +1
𝑉 𝑟𝑖𝑔ℎ𝑡 = +1
• You open the right door and get reward +3
𝑉 𝑟𝑖𝑔ℎ𝑡 = +2

…
• Are you sure you’ve chosen the best door?
21
Recap: The Exploration Problem

• If we always follow the deterministic policy to collect experience, we will never

22
Recap: The Exploration Problem

• If we always follow the deterministic policy to collect experience, we will never

have the opportunity to see and evaluate (estimate 𝑄) of alternative actions…
• ALL learning methods face a dilemma: they seek to learn action values
conditioned on subsequent optimal behavior but they need to act suboptimally in
order to explore all actions (to discover the optimal actions). The exploration-
exploitation dilemma.
• Solutions:
1. Exploring starts: Every state-action pair has a non-zero probability of being the
staring pair
2. Give up on deterministic policies and only search over 𝜖-soft policies
3. Off-policy: Use a different policy to collect experience than the one you care to
evaluate.
23
Monte Carlo ES (Exploring Starts)

Q: The pseudocode is inefficient, how can we further improve the efficiency?

24
Monte Carlo with 𝜖-Greedy Exploration

• Trade-off between exploration and exploitation

• 𝝐-greedy exploration: ensuring continual exploration
• All actions are tried with non-zero probability
• With probability 1 − 𝜖 choose the greedy action
• With probability 𝜖 choose an action at random

𝜖
+1 −𝜖 if 𝑎∗ = arg max 𝑄(𝑠, 𝑎)
𝐴 -∈/
𝜋 𝑎𝑠 = 𝜖
otherwise
𝐴

25
𝜖-Greedy Policy Improvement
Theorem
For any 𝜖-greedy policy 𝜋, the 𝜖-greedy policy 𝜋′ with respect to 𝑄$ is an
improvement, 𝑉$$ 𝑠 ≥ 𝑉$ 𝑠

𝑄% 𝑠, 𝜋 & 𝑠 = & 𝜋 & 𝑎 𝑠 𝑄% (𝑠, 𝑎)

'∈)
𝜖
= & 𝑄% (𝑠, 𝑎) + 1 − 𝜖 max 𝑄% (𝑠, 𝑎)
|𝐴| '∈)
'∈)
𝜖
𝜖 𝜋 𝑎𝑠 −
𝐴
≥ & 𝑄% 𝑠, 𝑎 + 1 − 𝜖 & 𝑄% 𝑠, 𝑎
𝐴 1−𝜖
'∈) '∈)

= & 𝜋 𝑎 𝑠 𝑄% 𝑠, 𝑎 = 𝑉% (𝑠)
'∈)

• Therefore from policy improvement theorem, 𝑉$$ 𝑠 ≥ 𝑉$ 𝑠

26
Monte-Carlo Policy Iteration

• Policy evaluation: Monte-Carlo policy evaluation 𝑄 = 𝑄$

• Policy improvement: 𝜖-greedy policy improvement

27
Monte-Carlo Control

Every episode:
• Policy evaluation: Monte-Carlo policy evaluation 𝑄 ≈ 𝑄$
• Policy improvement: 𝜖-greedy policy improvement

28
On-Policy First-Visit MC Control (without exploring starts)

29
GLIE
Definition (Greedy in the Limit with Infinite Exploration (GLIE))
• All state-action pairs are explored infinitely many times,
lim1→& 𝑁1 𝑠, 𝑎 = ∞
• The policy converges on a greedy policy,
lim1→& 𝜋1 𝑎 𝑠 = 1 𝑎 = arg max 𝑄 𝑠, 𝑎 ,
$ 1
- ∈/
#
• For example, 𝜖-greedy is GLIE if 𝜖 reduces to zero at 𝜖1 = 1

Theorem
GLIE Model-free control converges to the optimal action-value functions,
𝑄! → 𝑄∗

30
GLIE Monte-Carlo Control

• Sample 𝑘-th episode using 𝜋: 𝑆#, 𝐴#, 𝑅%, … , 𝑆' ~ 𝜋

• For each state 𝑆! and action 𝐴! in the episode,
𝑁 𝑆! , 𝐴! ← 𝑁 𝑆! , 𝐴! + 1
1
𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝐺! − 𝑄 𝑆! , 𝐴!
𝑁 𝑆! , 𝐴!

• Improve policy based on new action-value function

#
𝜖 ← 1 , 𝜋 ← 𝜖-greedy(𝑄)

Theorem
GLIE Monte-Carlo control converges to the optimal action-value function,
𝑄 𝑠, 𝑎 → 𝑄∗ 𝑠, 𝑎
31
Summary So Far

• MC has several advantages over DP

• Can learn directly from interaction with environment
• No need for full models
• No need to learn about ALL states (no bootstrapping)
• MC methods provide an alternative policy evaluation process
• One issue to watch for: maintaining sufficient exploration:
• Exploring starts, soft policies

32
Off-Policy MC
On and Off-Policy Learning

• On-policy Learning
• “Learn on the job”
• Learn about behavior policy 𝜋 from experience sampled from 𝜋
• On-policy methods attempts to evaluate or improve the policy that is
used to make decisions.
• Off-policy Learning
• “Look over someone’s shoulder”
• Learn about target policy 𝜋 from experience sampled from 𝜇
• Off-policy methods evaluate or improve a policy different from that
used to generate the data.
34
On and Off-Policy Learning

• On-policy Learning
• Learn about behavior policy 𝜋 from experience sampled from 𝜋
• Off-policy Learning
• Learn about target policy 𝜋 from experience sampled from 𝜇
• Learn “counterfactually” about other things you could do: “what if..?”
• E.g., “What if I would turn left?” => new observations, rewards?
• E.g., “What if I would play more defensively?” => different win probability?
• E.g., “What if I would continue to go forward?” => how long until I bump into a wall?

35
Monte Carlo Control without Exploring Starts

• On-policy method: On-policy methods attempts to evaluate or improve

the policy that is used to make decisions.
• Off-policy method: Off-policy methods evaluate or improve a policy
different from that used to generate the data.

• Q: Monte Carlo ES method is an on-policy or off-policy method?

• A: On-policy!

36
Off-Policy Methods

• Evaluate target policy 𝜋 𝑎 𝑠 to compute 𝑉$ 𝑠 or 𝑄$ 𝑠, 𝑎

• While using behavior policy 𝜇 𝑎 𝑠 to generate actions
• Why is this important?
• Learn from observing humans or other agents (e.g., from logged data)
• Re-use experience from old policies (e.g., from your own past experience)
• Learn about multiple policies while following one policy
• Learn about greedy policy while following exploratory policy
• Q-learning estimates the value of the greedy policy (More details in the
following lectures)
• Acting greedy all the time would not explore sufficiently
37
Off-Policy Methods

• Key Question:
• Can I average returns as before to obtain the value function of 𝜋
• Idea: Importance Sampling:
• Weight each return by the ratio of the probabilities of the trajectory
under the two policies.

38
Background: Estimating Expectations

• General Idea: Draw independent samples 𝑧#, … , 𝑧 ) from distribution

𝑝 𝑧 to approximate expectation:
%
1
𝔼 𝑓 = % 𝑓 𝑧 𝑝 𝑧 𝑑𝑧 ≈ - 𝑓 𝑧 " = 𝑓.
𝑁
"#$

Note that 𝔼 𝑓 = 𝔼 𝑓.

• So the estimator has correct mean (unbiased)

" "
• The variance decrease as : 𝑣𝑎𝑟 𝑓0 = 𝔼 𝑓−𝔼 𝑓 $
# #

• Remark: The accuracy of the estimator does not depend on dimensionality of 𝑧.

39
Background: Importance Sampling

• Suppose we have an easy-to-sample proposal distribution 𝑞(𝑧), such that

𝑞 𝑧 > 0 if 𝑝 𝑧 > 0
𝔼 𝑓 = 5 𝑓 𝑧 𝑝 𝑧 𝑑𝑧

𝑝 𝑧
= 5𝑓 𝑧 𝑞 𝑧 𝑑𝑧
𝑞 𝑧

#
1 𝑝 𝑧% % , 𝑧 % ~𝑞 𝑧
≈ : 𝑓 𝑧
𝑁 𝑞 𝑧%
%&"

• The quantities
This is useful when we can evaluate the 𝑤 % = 𝑝(𝑧 % )/𝑞(𝑧 % )
probability 𝒑 but is hard to sample from it as known as importance weights
40
Background: Importance Sampling Summary

Summary
• Estimate the expectation of a function
1
𝔼:~< 𝑓 𝑥 = h 𝑓 𝑥 𝑃 𝑥 𝑑𝑥 ≈ 8 𝑓(𝑥= )
𝑛
=

• But sometimes it is difficult to sample 𝑥 from 𝑃 𝑥 , then we can sample 𝑥

from another distribution 𝑄(𝑥), then correct the weight
𝑃 𝑥 1 𝑃 𝑥=
𝔼:~< 𝑓 𝑥 = h 𝑓 𝑥 𝑃 𝑥 𝑑𝑥 = h 𝑄 𝑥 𝑓 𝑥 𝑑𝑥 ≈ 8 𝑓(𝑥= )
𝑄 𝑥 𝑛 𝑄 𝑥=
=

41
Background: Importance Sampling

• Using importance sampling to reduce the error and variance of Monte

Carlo Simulations.
Example
• We were trying to find the probability that a randomly chosen variable X
from the standard normal distribution is greater than 3.
& & !*
1
𝑃 𝑋 > 3 = h 𝑓? 𝑡 𝑑𝑡 = h 𝑒( % 𝑑𝑡
> 2𝜋 >

https://acme.byu.edu/0000017a-1bb8-db63-a97e-7bfa0bea0000/vol1lab16montecarlo2-pdf
42
Background: Importance Sampling

https://acme.byu.edu/0000017a-1bb8-db63-a97e-7bfa0bea0000/vol1lab16montecarlo2-pdf
43
Importance Sampling for Off-Policy RL

• Estimate the expectation of return using trajectories sampled from another policy
(behavior policy)

𝔼&~( 𝐺 𝜏 = % 𝜋 𝜏 𝐺 𝜏 𝑑𝑇

𝜋 𝜏
= %𝜇 𝑇 𝐺 𝜏 𝑑𝑇
𝜇 𝜏

𝜋 𝜏
= 𝔼&~) 𝐺 𝜏
𝜇 𝜏
1 𝜋 𝜏*
≈ - 𝐺 𝜏*
𝑛 𝜇 𝜏*
*

44
Importance Sampling Ratio

• Given a starting state 𝑆! , the probability of the subsequent state-action

trajectory 𝐴! , 𝑆!"#, 𝐴!"#, … , 𝑆' , occurring under any policy 𝜋
𝑃 𝐴! , 𝑆!"#, 𝐴!"#, … , 𝑆' 𝑆! , 𝐴!:'(# ~ 𝜋)
= 𝜋 𝐴! 𝑆! 𝑃 𝑆!"# 𝑆! , 𝐴! 𝜋 𝐴!"# 𝑆!"# … 𝑃(𝑆' |𝑆'(#, 𝐴 '(#)
'(#
= l 𝜋 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 )
1+!
• Importance sampling ratio 𝝆
'(#
∏'(#
1+! 𝜋 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝜋(𝐴1 |𝑆1 )
𝜌!'(# = = l
∏'(#
1+! 𝜇 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝜇(𝐴1 |𝑆1 )
1+!
45
Importance Sampling

• We wish to estimate the expected returns (values) under the target policy,
but all we have are return 𝐺! duo to the behavior policy.
𝔼 𝐺! |𝑆! = 𝑠 = 𝑉A 𝑠
vs.
𝔼 𝜌!'(#𝐺! |𝑆! = 𝑠 = 𝑉$ 𝑠

46
Importance Sampling

• Ordinary importance sampling forms estimate:

First time of termination Return after 𝒕 up
following time 𝒕 through 𝑻 𝒕

'(!)
∑!∈𝒯(%) 𝜌! 𝐺!
𝑉 𝑠 =
|𝒯(𝑠)|

Every-visit method: the set of all time steps in which state 𝑠 is visited
First-visit method: time steps that were first visits to s within their episodes.

47
Importance Sampling

• Ordinary importance sampling forms estimate:

∑!∈𝒯(7) 𝜌!' ! (#
𝐺!
𝑉 𝑠 =
|𝒯(𝑠)|
• New notation: time steps increase across episode boundaries:

48
Importance Sampling Ratio

• All importance sampling ratios have expected value 1:

𝜋(𝐴1 |𝑆1 ) 𝜋(𝑎|𝑆1 )
𝔼/! ~A = 8 𝜇 𝑎 𝑆1 = 8 𝜋 𝑎 𝑆1 = 1.
𝜇(𝐴1 |𝑆1 ) 𝜇(𝑎|𝑆1 )
- -
• Note: Importance Sampling can have high (or infinite) variance
• Consider the estimates of the first-visit method after observing a single
return from state 𝑠

49
Two Types of Importance Sampling

• Ordinary Importance Sampling

∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
|𝒯(𝑠)|
• Weighted Importance Sampling

∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
∑+∈𝒯(/) 𝜌+1(+)2$

• Weighted IS is a biased estimation

• For first-visit method with single return, the expectation is 𝑉) 𝑠 rather than 𝑉( 𝑠 .
• Ordinary IS is an unbiased estimation
• For first-visit method, its estimator is always 𝑉( 𝑠
50
Two Types of Importance Sampling

• Ordinary Importance Sampling

∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
|𝒯(𝑠)|
• Weighted Importance Sampling

∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
∑+∈𝒯(/) 𝜌+1(+)2$

• The variance of the ordinary IS is in general unbounded, whereas in the weighted

estimator the largest weight on any single return is one.
• Suppose the ratio wen ten, the ordinary importance-sampling estimate would be
ten times the observed return.
51
Off-Policy MC Prediction

Q: Is the ordinary importance sampling or weighted importance sampling?

52
Recap: Blackjack Example

• State (200 of them)

• Current sum (12-21)
• Dealer’s showing card (ace-10)
• Do I have a “useable” ace (yes-no)

• Action stick: Stop receiving cards (and terminate)

• Action draw: Take another card (no replacement)

• Reward for stick:

• +1/0/-1 if sum of cards >/=/< sum of dealer cards

• Reward for draw:

• -1 for sum of cards > 21 (and terminate), 0
otherwise

• Transitions: automatically draw if sum of cards <12

53
Example: Off-policy Estimation of a Blackjack State Value

• The dealer is showing a deuce, the sum of the player’s cards is 13.

Book Page 106 (S. & B.)

54
Example: Infinite Variance

• Off-policy First-visit MC Prediction (ordinary importance sampling)

Book Page 107 (S. & B.)

55
Example: Infinite Variance

Variance
• Var 𝑋 = 𝔼 𝑋 − 𝑋: + = 𝔼 𝑋 + − 𝑋: +
/01 +
𝜋 𝐴, 𝑆,
𝔼 ; 𝐺
𝜇 𝐴, 𝑆, 2
,-.
!
1 1
= ⋅ 0.1 (the length 1 episode)
2 0.5
!
1 1 1 1
+ ⋅ 0.9 ⋅ ⋅ 0.1 (the length 2 episode)
2 2 0.5 0.5
!
1 1 1 1 1 1
+ ⋅ 0.9 ⋅ ⋅ 0.9 ⋅ ⋅ 0.1 ⋅ (the length 3 episode)
2 2 2 0.5 0.5 0.5
+⋯
% %
= 0.1 > 0.9" ⋅ 2" ⋅ 2 = 0.2 > 1.8" = ∞
"#$ "#$

56
Off-Policy MC Control

Q: Why exit inner Loop earlier?

57
So Far

• MC has several advantages over DP:

• Can learn directly from interaction with environment
• No need for full models
• MC methods provide an alternate policy evaluation process
• One issue to watch for: maintaining sufficient exploration
• Looked at distinction between on-policy and off-policy methods

58
TD Control
MC vs. TD Control

• Temporal-difference (TD) learning has several advantages over Monte-

Carlo (MC)
• Lower variance
• Online
• Incomplete sequences
• Natural idea: Use TD instead of MC in our control loop
• Apply TD to 𝑄(𝑆, 𝐴)
• Use 𝜖-greedy policy improvement
• Update every time step

60
Updating Action-value Functions with SARSA

𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 𝑆, 𝐴 + 𝛾 𝑄 𝑆 , , 𝐴, − 𝑄 𝑆, 𝐴

61
On-Policy Control with SARSA

Every time-step:
• Policy evaluation: SARSA, 𝑄 ≈ 𝑄$
• Policy improvement: 𝜖-greedy policy improvement

62
SARSA Algorithm for On-Policy Control

63
Convergence of SARSA
Theorem
SARSA converges to the optimal action-value function, 𝑄 𝑠, 𝑎 → 𝑄∗ 𝑠, 𝑎 ,
under the following conditions:
• GLIE sequence of policies 𝜋! 𝑎 𝑠
• Robbins-Monro sequence of step size 𝛼!
& &
8 𝛼! = ∞, 8 𝛼!% < ∞
!+# !+#

Convergence Results for Single-step On-Policy Reinforcement-Learning Algorithms. Machine Learning, 2000.

64
Example: Windy Gridworld Example

• Reward = −1 per time-step until reaching goal

• Undiscounted

65
SARSA on the Windy Gridworld Example

Q: Can a policy result in infinite loops? What will MC policy iteration do then?
• If the policy leads to infinite loop states, MC control will get trapped as the episode will not
terminate
• Instead, TD control can update continually the state-action values and switch do a different policy

66
𝑛-step SARSA

• Consider the following 𝑛-step returns for 𝑛 = 1, 2, … , ∞

"
𝑛=1 (SARSA) 𝑄! = 𝑅!'" + 𝛾𝑄 𝑆!'"
$
𝑛=2 𝑄! = 𝑅!'" + 𝛾𝑅!'$ + 𝛾 $ 𝑄 𝑆!'$

⋮ ⋮
(
𝑛 = ∞ (MC) 𝑄! = 𝑅!'" + 𝛾𝑅!'$ + ⋯ + 𝛾 )*" 𝑅)

• Define the 𝑛-step Q-return

𝑄! ) = 𝑅!"# + 𝛾𝑅!"% + … + 𝛾 )(#𝑅!") + 𝛾 ) 𝑄(𝑆!") )
• 𝑛-step SARSA updates 𝑄(𝑠, 𝑎) towards the n-step Q-return
)
𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝛼 𝑄! − 𝑄 𝑆! , 𝐴!
67
Forward View SARSA(𝜆)

• The 𝑄 A return combines all 𝑛-step Q-

"
returns 𝑄+
• Using weight 1 − 𝜆 𝜆"2$
B
"
𝑄+A = (1 − 𝜆) - 𝜆"2$ 𝑄+
"#$

• Forward-view SARSA(λ)
𝑄 𝑆+ , 𝐴+ ← 𝑄 𝑆+ , 𝐴+ + 𝛼 𝑄+A − 𝑄(𝑆+ , 𝐴+ )

68
Off-Policy TD Control
Recap: Importance Sampling for Off-Policy MC

Off-Policy Monte-Carlo
• Multiple importance sampling corrections along whole episode
'(#
∏'(#
1+! 𝜋 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝜋(𝐴1 |𝑆1 )
𝜌!'(# = = l
∏'(#
1+! 𝑏 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝑏(𝐴1 |𝑆1 )
1+!
• Update value towards corrected return
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝜌!'(#𝐺! − 𝑉 𝑆!

72
Importance Sampling for Off-Policy TD

Off-Policy TD
• Weight TD target 𝑅 + 𝛾𝑉(𝑆′) by importance sampling
• Only need a single importance sampling correction

𝜋 𝐴! 𝑆!
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!"# + 𝛾𝑉 𝑆!"# − 𝑉 𝑆!
𝜇 𝐴! 𝑆!

• Much lower variance than Monte-Carlo importance sampling

• Policies only need to be similar over a single step

73
Importance Sampling for Off-Policy TD Updates

𝜋 𝐴! 𝑆!
𝔼+ 𝑅!'" + 𝛾𝑉 𝑆!'" − 𝑉 𝑆! |𝑆! = 𝑠
𝜇 𝐴! 𝑆!

𝜋 𝑎𝑠
= :𝜇 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − 𝑉 𝑠
𝜇 𝑎𝑠
,

= : 𝜋 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − : 𝜇 𝑎 𝑠 𝑉 𝑠

, ,

= : 𝜋 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − : 𝜋 𝑎 𝑠 𝑉 𝑠

, ,

= : 𝜋 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − 𝑉 𝑠

,
= 𝔼- 𝑅!'" + 𝛾𝑉 𝑆!'" − 𝑉(𝑠)|𝑆! = 𝑠
74
Q-Learning

• We now consider off-policy learning of action-values 𝑄(𝑆, 𝐴)

• No importance sampling is required

• Next action is chosen using behavior policy 𝐴!"#~ 𝜇 ⋅ 𝑆! )

• But we consider alternative successor action 𝐴, ~ 𝜋(⋅ |𝑆! )

• And update 𝑄 𝑆! , 𝐴! towards value of alternative action

𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝛼 𝑅!"# + 𝛾𝑄 𝑆!"#, 𝐴, − 𝑄 𝑆! , 𝐴!

75
Q-Learning Control Algorithm

𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 + 𝛾 max 𝑄 𝑆 , , 𝑎, − 𝑄(𝑆, 𝐴)
-,

77
Q-Learning Algorithm for Off-Policy Control

• Q-Learning: 𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝛼 𝑅!"# + 𝛾 max 𝑄 𝑆!"#, 𝑎 − 𝑄(𝑆! , 𝐴! )

• SARSA: 𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 + 𝛾𝑄 𝑆 , , 𝐴, − 𝑄 𝑆, 𝐴
78
Why don’t use importance sampling on Q-Learning

- 𝐴! 𝑆!
• Off-Policy TD: 𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!'" + 𝛾𝑉 𝑆!'" − 𝑉 𝑆!
+ 𝐴! 𝑆!
• Short answer: Because Q-Learning does not make expected value estimates over the
policy distribution. For the full answer click here.
• Remember Bellman Optimality Backup from value iteration

𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 : 𝑃 𝑠 1 𝑠, 𝑎 max
&
𝑄 𝑠 1 , 𝑎1
,
.& ∈0
• Q-Learning can be considered as sample-based version of value iteration, expect
instead of using the expected value over the transition dynamics, we use the sample
collected from the environment
𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 max 𝑄 𝑠 1 , 𝑎1
& ,
• Q-learning is over the transition distribution, not over policy distribution thus no
need to correct different policy distributions 79
Expected SARSA

• Instead of the sample value-of-the-next-state, use the expectation!

𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝛼 𝑅!"# + 𝛾𝔼 𝑄 𝑆!"#, 𝐴!"# 𝑆!"# − 𝑄 𝑆! − 𝐴!

← 𝑄 𝑆! , 𝐴! + 𝛼 𝑅!"# + 𝛾 8 𝜋 𝑎 𝑆!"# 𝑄 𝑆!"#, 𝑎 − 𝑄 𝑆! − 𝐴!

• Expected SARSA performs better than SARSA (but costs more)

• Q: Why?
• Q: Is expected SARSA on policy or off policy?
• Q: What if 𝜋 is the greedy deterministic policy?
• Sometimes called “General Q-learning”
80
1 -Learning
𝑸

• If we do not follow the optimal policy • “S” is the starting

faithfully, and still want to model explore: state, “G” indicates
goal state, which give
a reward of 9 units
• Exploration: 30% of
the time the agent
took a random action
• 𝛾 = 0.9, 𝛽 = 0.5

When the best move isn't optimal: Q-learning with exploration. In AAAI, 1994.

81
Example: SARSA vs. Q-Learning

83
Example: Cliff Walking

84
Relationship Between DP and TD

85
Relationship Between DP and TD

86
Q-Learning Variants
Maximization Bias

• We often need to maximize over our value estimates. The estimated

maxima suffer from maximization bias
• Consider a state for which all action 𝑎, 𝑄∗ 𝑠, 𝑎 = 0. Our estimates 𝑄 𝑠, 𝑎
are uncertain, some are positive and some negative
• Intuitively (Jensen’s Inequality):
𝔼 max 𝜇= ≥ max 𝔼 𝜇=
=

• This is because we use the same estimate 𝑄 both to choose the argmax and
to evaluate it

88
Double Q-Learning

• Train two action-value functions, Q1 and Q2

• Do Q-learning on both, but
• Never on the same time steps (Q1 and Q2 are independent)
• Pick Q1 or Q2 at random to be updated on each step
• If updating Q1, use Q2 for the value of the next state:
• 𝑄# 𝑆! , 𝐴! ← 𝑄# 𝑆! , 𝐴! +

𝛼 𝑅!"# + 𝑸𝟐 𝑆!"#, arg max 𝑸𝟏 𝑆!"#, 𝑎 − 𝑄# 𝑆! , 𝐴!

• Action selections are 𝜖-greedy with respect to the sum of Q1 and Q2

89
Double Tabular Q-Learning

90
Double Q-Learning

Double Q-learning. In NIPS, 2010.

Double-Q estimation ≤ max* 𝔼 𝑋* ≤ 𝐸 max* 𝑋* (single-Q estimation)91
Double Q-Learning vs. Q-Learning

92
Example: Roulette

• Roulette: gambling game

• Here, 171 actions: bet $1 on one of 170 options, or ”stop”
• ”Stop” ends the episode, with $0
• All other actions have high variance reward ,with negative expected value
• Betting actions do not end the episode, instead can bet again

93
Example: Q-Learning vs. Double Q-Learning

94
Extra Reading Materials

• Chapter 5, Monte Carlo Methods, Reinforcement Learning: An

Introduction, Sutton & Barto
• Chapter 6, Temporal-Difference Learning, Reinforcement Learning: An
Introduction, Sutton & Barto
• Double Q-learning. In NIPS, 2010.

96
Thanks & QA?

DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Lecture 4 Pre
No ratings yet
Lecture 4 Pre
85 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Slidedeck 7 MAS 2021 22 RL 3 MC Sarsa QL
No ratings yet
Slidedeck 7 MAS 2021 22 RL 3 MC Sarsa QL
65 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
Unit 4
No ratings yet
Unit 4
49 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
67 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Model Free Methods
No ratings yet
Model Free Methods
31 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Lecture 4: Model Free Control: Emma Brunskill
No ratings yet
Lecture 4: Model Free Control: Emma Brunskill
66 pages
RL 1
No ratings yet
RL 1
30 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
RL Lecture5
No ratings yet
RL Lecture5
16 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
RL Basics 1737166593
No ratings yet
RL Basics 1737166593
30 pages
Lec 5
No ratings yet
Lec 5
13 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
Lecture 5: Model-Free Control: David Silver
No ratings yet
Lecture 5: Model-Free Control: David Silver
43 pages
QP Ans
No ratings yet
QP Ans
40 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
16 RL
No ratings yet
16 RL
51 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Lnotes 04
No ratings yet
Lnotes 04
8 pages
Notes
No ratings yet
Notes
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
2 - Overview of This Book
No ratings yet
2 - Overview of This Book
4 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Plotting Decision Regions - Mlxtend
No ratings yet
Plotting Decision Regions - Mlxtend
5 pages
Econometrics II
No ratings yet
Econometrics II
15 pages
Introductory Econometrics Test Bank Compress
100% (1)
Introductory Econometrics Test Bank Compress
134 pages
Structural Model Tests
No ratings yet
Structural Model Tests
50 pages
WBS-2-Operations Analytics-W1S5-Practice-Problems-Solutions
No ratings yet
WBS-2-Operations Analytics-W1S5-Practice-Problems-Solutions
6 pages
MCQ - Basic Econometrics
No ratings yet
MCQ - Basic Econometrics
9 pages
Correlation
No ratings yet
Correlation
11 pages
Latin Square (Revised)
No ratings yet
Latin Square (Revised)
28 pages
Applied Environmental Statistics
No ratings yet
Applied Environmental Statistics
35 pages
Unit 1 2000 PDF
No ratings yet
Unit 1 2000 PDF
253 pages
Crosstabulation and Chi Square Analysis Summary
No ratings yet
Crosstabulation and Chi Square Analysis Summary
8 pages
SSCK 1203 Data Analysis 090214 Students 02
No ratings yet
SSCK 1203 Data Analysis 090214 Students 02
36 pages
Intro To Hierachical GAM
No ratings yet
Intro To Hierachical GAM
43 pages
PEJI, ISAIAH EUGENE G (Written Report, Paired Samples T Test)
No ratings yet
PEJI, ISAIAH EUGENE G (Written Report, Paired Samples T Test)
16 pages
Midterm Examination # 3: Sta 113: Probability and Statistics in Engineering Tuesday, 2008 Nov. 25, 1:15 - 2:30 PM
No ratings yet
Midterm Examination # 3: Sta 113: Probability and Statistics in Engineering Tuesday, 2008 Nov. 25, 1:15 - 2:30 PM
14 pages
Midterm Exam - Probability and Statistics
No ratings yet
Midterm Exam - Probability and Statistics
2 pages
Correlation
No ratings yet
Correlation
57 pages
1911 08731v2
No ratings yet
1911 08731v2
19 pages
L5 - Statistical Test
No ratings yet
L5 - Statistical Test
27 pages
M3 Part 2: Regression Analysis
No ratings yet
M3 Part 2: Regression Analysis
21 pages
The Rasch Model
No ratings yet
The Rasch Model
32 pages
ARIMA Box-Jenkins 1st
No ratings yet
ARIMA Box-Jenkins 1st
15 pages
Practice Problems - Set 2
No ratings yet
Practice Problems - Set 2
2 pages
5.Sm025 Wow - Question Set 5
No ratings yet
5.Sm025 Wow - Question Set 5
12 pages
Assignment3 Ans 2015 PDF
No ratings yet
Assignment3 Ans 2015 PDF
11 pages
Chi-Square and Analysis of Variance (ANOVA)
No ratings yet
Chi-Square and Analysis of Variance (ANOVA)
21 pages
Mca4020 SLM Unit 03
No ratings yet
Mca4020 SLM Unit 03
24 pages
Solution of The Elements of Statistical Learning Ch6
0% (1)
Solution of The Elements of Statistical Learning Ch6
3 pages
Lecture 20 - Bayesian Analysis
No ratings yet
Lecture 20 - Bayesian Analysis
4 pages
Identifying Models Using Kendall Notation
No ratings yet
Identifying Models Using Kendall Notation
4 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet