0% found this document useful (0 votes)
9 views92 pages

2.2+model Free+Control

The document discusses model-free reinforcement learning, focusing on model-free control techniques such as Monte-Carlo and Temporal-Difference learning. It outlines the exploration-exploitation dilemma and presents methods to address it, including ε-greedy policies and exploring starts. Various applications of model-free control in real-world problems are also highlighted.

Uploaded by

shengaa1028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views92 pages

2.2+model Free+Control

The document discusses model-free reinforcement learning, focusing on model-free control techniques such as Monte-Carlo and Temporal-Difference learning. It outlines the exploration-exploitation dilemma and presents methods to address it, including ε-greedy policies and exploring starts. Various applications of model-free control in real-world problems are also highlighted.

Uploaded by

shengaa1028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Some slides are from: David Silver (DeepMind), Katerina Fragkiadaki (CMU),

Emma Brunskill (Stanford), Bolei Zhou (UCLA), Hado van Hasselt (DeepMind)

COMP 4901Z: Reinforcement Learning


2.2 Model-Free Control

Long Chen (Dept. of CSE)


Model-free Reinforcement Learning

• Last lecture
• Model-free prediction
• Estimate the value function of an unknown MDP
• This lecture
• Model-free control
• Optimize the value function of an unknown MDP

2
Recap: DP vs. MC vs. TD Learning
MC: Sample average return
• Remember: approximates expectation
𝑉! 𝑠 = 𝔼! 𝐺" 𝑆" = 𝑠
= 𝔼! ∑&
#$% 𝛾 #
𝑅"'#'( 𝑆" = 𝑠
= 𝔼! 𝑅"'( + 𝛾 ∑&
#$% 𝛾 #
𝑅"'#') 𝑆" = 𝑠
= 𝔼! 𝑅"'( + 𝛾𝑉! 𝑆"'( 𝑆" = 𝑠

TD: combine both: Sample expected DP: the expected values are provided
values and use a current estimate by a model. But we use a current
𝑉(𝑆!"#) of the true 𝑉$ (𝑆!"#) estimate 𝑉(𝑆!"#) of the true 𝑉$ (𝑆!"#)

3
Recap: Monte-Carlo Backup

𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝐺! − 𝑉 𝑆!

4
Recap: Temporal-Difference Backup

𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!"# + 𝛾𝑉 𝑆!"# − 𝑉 𝑆!

5
Recap: Dynamic Programming Backup

𝑉 𝑆! ← 𝔼$ 𝑅!"# + 𝛾𝑉 𝑆!"#

6
Recap: n-Step Return

• Consider the following n-step returns for 𝑛 = 1, 2, … , ∞:


𝑛=1 (TD) 𝐺! # = 𝑅!"# + 𝛾𝑉

𝑛=2 𝐺! % = 𝑅!"# + 𝛾𝑅!"% + 𝛾 %𝑉 𝑆!"%


⋮ ⋮
&
𝑛 = ∞ (MC) 𝐺! = 𝑅!"# + 𝛾𝑅!"% + ⋯ + 𝛾 '(#𝑅'
• Define the n-step return
)
𝐺! = 𝑅!"# + 𝛾𝑅!"% + ⋯ + 𝛾 )(#𝑅!") + 𝛾 ) 𝑉 𝑆!")
• n-step temporal-difference learning
)
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝐺! − 𝑉(𝑆! )
7
Recap: 𝜆-return

• The 𝜆-return 𝐺!* combines all n-step


returns 𝐺! )
• Using weight 1 − 𝜆 𝜆)(#
&
)
𝐺!* = (1 − 𝜆) 8 𝜆)(#𝐺!
)+#

• Forward-view 𝑇𝐷 𝜆
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝐺!* − 𝑉 𝑆!

8
2.2 Model-Free Control
Outline

• Introduction
• On-Policy Monte-Carlo Control
• Off-Policy Monte-Carlo Control
• On-Policy Temporal-Difference Learning
• Off-Policy Temporal-Difference Learning
• Summary

10
Uses of Model-Free Control

• Some example problems that can be modelled as MDPs


• Elevator • Robocup Soccer
• Parallel Parking • Quake
• Ship Steering • Portfolio Management
• Bioreactor • Protein Folding
• Helicopter • Robot Walking
• Aeroplane Logistics • Game of Go

• For most of these problems, either:


• MDP model is unknown, but experience can be sampled
• MDP model is known, but is too big to use, except by samples
• Model-free control can solve these problems.
11
Monte-Carlo Control
Recap: Generalized Policy Iteration

• Policy evaluation Estimate 𝑉$


• Iterative policy evaluation
• Policy improvement Generate 𝜋 , ≥ 𝜋
• Greedy policy improvement

13
Generalized Policy Iteration with Monte-Carlo Evaluation

• Monte Carlo version of policy iteration

• Policy evaluation: Monte-Carlo policy evaluation 𝑉 = 𝑉$ ?


• Policy improvement: Greedy policy improvement?

14
Model-Free Policy Iteration using Action-Value Function

• There are two types of value functions (e.g., 𝑽(𝒔) and 𝑸 𝒔, 𝒂 ), which one
to use for policy improvement?
• Greedy policy improvement over 𝑉 𝑠 requires model of MDP
𝜋 , 𝑠 = arg max 𝑅 𝑠, 𝑎 + 𝛾𝑃 𝑠 , 𝑠, 𝑎 𝑉 𝑠 ,
-∈/

• Greedy policy improvement over 𝑄 𝑠, 𝑎 is model-free


𝜋 , 𝑠 = arg max 𝑄 𝑠, 𝑎
-∈/

15
Generalized Policy Iteration with Action-Value Function

• Policy evaluation: Monte-Carlo policy evaluation 𝑸 = 𝑸𝝅


• Policy improvement: Greedy policy improvement?

16
Convergence of MC Control

• Greedified policy meets the conditions for policy improvement:


• For all 𝑠 ∈ 𝑆:
𝑄$! 𝑠, 𝜋1"# 𝑠 = 𝑄$! 𝑠, arg max 𝑄$! 𝑠, 𝑎
-
= max 𝑄$! 𝑠, 𝑎
-
≥ 𝑄$! 𝑠, 𝜋1 𝑠
≥ 𝑉$! 𝑠
• And thus must be ≥ 𝜋1
• This assumes exploring starts and infinite number of episodes for MC
policy evaluation
17
Monte Carlo Control

• Generalized Policy Iteration


3 4 3 4 3 4 3
𝜋2 → 𝑄$" → 𝜋# → 𝑄$# → 𝜋% → … → 𝜋∗ → 𝑄∗
• Two unlikely assumptions to obtain the guarantee of convergence for the Monte-
Carlo methods:
1. The episodes have exploring starts
2. Policy evaluation could be done with an infinite number of episodes.
• In DP there are two methods for handling the second issue (same for MC)
• Sufficient steps are taken during each policy evaluation to assure that these
bounds are sufficient small (too many episodes to be useful)
• Similar to value iteration, move the value function towards 𝑄!!
18
Recap: Greedy Policy

• For any action-value function 𝑄, the corresponding greedy policy is the


one that
• For each 𝑠, deterministically chooses an action with maximal action
value:
𝜋 𝑠 = arg max 𝑄 𝑠, 𝑎
-
• Policy improvement then can be done by constructing each 𝜋1"# as the
greedy policy with respect to 𝑄$,1

19
MC Estimation of Action Values 𝑄

• Monte Carlo (MC) is most useful when a model is not available


• We want to learn 𝑄∗ 𝑠, 𝑎 because then we can get an optimal policy
without knowing dynamics.
• 𝑄$ 𝑠, 𝑎 : average return starting from state 𝑠 and action 𝑎 following 𝜋
𝑄$ 𝑠, 𝑎 = 𝔼$ 𝑅!"# + 𝛾𝑉$ 𝑆!"# |𝑆! = 𝑠, 𝐴! = 𝑎

= 8 𝑃 𝑠 , , 𝑟 𝑠, 𝑎 𝑟 + 𝛾𝑉$ 𝑠 ,
7$ ,8

• Converges asymptotically if every state-action pair is visited.


• Q: Is that possible if we are using a deterministic policy?

20
Recap: The Exploration Problem

Example of Greedy Action Selection


• There are two doors in front of you.
• You open the left door and get reward 0
𝑉 𝑙𝑒𝑓𝑡 = 0
• You open the right door and get reward +1
𝑉 𝑟𝑖𝑔ℎ𝑡 = +1
• You open the right door and get reward +3
𝑉 𝑟𝑖𝑔ℎ𝑡 = +2


• Are you sure you’ve chosen the best door?
21
Recap: The Exploration Problem

• If we always follow the deterministic policy to collect experience, we will never


have the opportunity to see and evaluate (estimate 𝑄) of alternative actions…
• ALL learning methods face a dilemma: they seek to learn action values
conditioned on subsequent optimal behavior but they need to act suboptimally in
order to explore all actions (to discover the optimal actions). The exploration-
exploitation dilemma.
• Q: Does a learning algorithm know when the optimal policy has been reached to
stop exploring?

22
Recap: The Exploration Problem

• If we always follow the deterministic policy to collect experience, we will never


have the opportunity to see and evaluate (estimate 𝑄) of alternative actions…
• ALL learning methods face a dilemma: they seek to learn action values
conditioned on subsequent optimal behavior but they need to act suboptimally in
order to explore all actions (to discover the optimal actions). The exploration-
exploitation dilemma.
• Solutions:
1. Exploring starts: Every state-action pair has a non-zero probability of being the
staring pair
2. Give up on deterministic policies and only search over 𝜖-soft policies
3. Off-policy: Use a different policy to collect experience than the one you care to
evaluate.
23
Monte Carlo ES (Exploring Starts)

Q: The pseudocode is inefficient, how can we further improve the efficiency?

24
Monte Carlo with 𝜖-Greedy Exploration

• Trade-off between exploration and exploitation


• 𝝐-greedy exploration: ensuring continual exploration
• All actions are tried with non-zero probability
• With probability 1 − 𝜖 choose the greedy action
• With probability 𝜖 choose an action at random

𝜖
+1 −𝜖 if 𝑎∗ = arg max 𝑄(𝑠, 𝑎)
𝐴 -∈/
𝜋 𝑎𝑠 = 𝜖
otherwise
𝐴

25
𝜖-Greedy Policy Improvement
Theorem
For any 𝜖-greedy policy 𝜋, the 𝜖-greedy policy 𝜋′ with respect to 𝑄$ is an
improvement, 𝑉$$ 𝑠 ≥ 𝑉$ 𝑠

𝑄% 𝑠, 𝜋 & 𝑠 = & 𝜋 & 𝑎 𝑠 𝑄% (𝑠, 𝑎)


'∈)
𝜖
= & 𝑄% (𝑠, 𝑎) + 1 − 𝜖 max 𝑄% (𝑠, 𝑎)
|𝐴| '∈)
'∈)
𝜖
𝜖 𝜋 𝑎𝑠 −
𝐴
≥ & 𝑄% 𝑠, 𝑎 + 1 − 𝜖 & 𝑄% 𝑠, 𝑎
𝐴 1−𝜖
'∈) '∈)

= & 𝜋 𝑎 𝑠 𝑄% 𝑠, 𝑎 = 𝑉% (𝑠)
'∈)

• Therefore from policy improvement theorem, 𝑉$$ 𝑠 ≥ 𝑉$ 𝑠


26
Monte-Carlo Policy Iteration

• Policy evaluation: Monte-Carlo policy evaluation 𝑄 = 𝑄$


• Policy improvement: 𝜖-greedy policy improvement

27
Monte-Carlo Control

Every episode:
• Policy evaluation: Monte-Carlo policy evaluation 𝑄 ≈ 𝑄$
• Policy improvement: 𝜖-greedy policy improvement

28
On-Policy First-Visit MC Control (without exploring starts)

29
GLIE
Definition (Greedy in the Limit with Infinite Exploration (GLIE))
• All state-action pairs are explored infinitely many times,
lim1→& 𝑁1 𝑠, 𝑎 = ∞
• The policy converges on a greedy policy,
lim1→& 𝜋1 𝑎 𝑠 = 1 𝑎 = arg max 𝑄 𝑠, 𝑎 ,
$ 1
- ∈/
#
• For example, 𝜖-greedy is GLIE if 𝜖 reduces to zero at 𝜖1 = 1

Theorem
GLIE Model-free control converges to the optimal action-value functions,
𝑄! → 𝑄∗

30
GLIE Monte-Carlo Control

• Sample 𝑘-th episode using 𝜋: 𝑆#, 𝐴#, 𝑅%, … , 𝑆' ~ 𝜋


• For each state 𝑆! and action 𝐴! in the episode,
𝑁 𝑆! , 𝐴! ← 𝑁 𝑆! , 𝐴! + 1
1
𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝐺! − 𝑄 𝑆! , 𝐴!
𝑁 𝑆! , 𝐴!

• Improve policy based on new action-value function


#
𝜖 ← 1 , 𝜋 ← 𝜖-greedy(𝑄)

Theorem
GLIE Monte-Carlo control converges to the optimal action-value function,
𝑄 𝑠, 𝑎 → 𝑄∗ 𝑠, 𝑎
31
Summary So Far

• MC has several advantages over DP


• Can learn directly from interaction with environment
• No need for full models
• No need to learn about ALL states (no bootstrapping)
• MC methods provide an alternative policy evaluation process
• One issue to watch for: maintaining sufficient exploration:
• Exploring starts, soft policies

32
Off-Policy MC
On and Off-Policy Learning

• On-policy Learning
• “Learn on the job”
• Learn about behavior policy 𝜋 from experience sampled from 𝜋
• On-policy methods attempts to evaluate or improve the policy that is
used to make decisions.
• Off-policy Learning
• “Look over someone’s shoulder”
• Learn about target policy 𝜋 from experience sampled from 𝜇
• Off-policy methods evaluate or improve a policy different from that
used to generate the data.
34
On and Off-Policy Learning

• On-policy Learning
• Learn about behavior policy 𝜋 from experience sampled from 𝜋
• Off-policy Learning
• Learn about target policy 𝜋 from experience sampled from 𝜇
• Learn “counterfactually” about other things you could do: “what if..?”
• E.g., “What if I would turn left?” => new observations, rewards?
• E.g., “What if I would play more defensively?” => different win probability?
• E.g., “What if I would continue to go forward?” => how long until I bump into a wall?

35
Monte Carlo Control without Exploring Starts

• On-policy method: On-policy methods attempts to evaluate or improve


the policy that is used to make decisions.
• Off-policy method: Off-policy methods evaluate or improve a policy
different from that used to generate the data.

• Q: Monte Carlo ES method is an on-policy or off-policy method?


• A: On-policy!

36
Off-Policy Methods

• Evaluate target policy 𝜋 𝑎 𝑠 to compute 𝑉$ 𝑠 or 𝑄$ 𝑠, 𝑎


• While using behavior policy 𝜇 𝑎 𝑠 to generate actions
• Why is this important?
• Learn from observing humans or other agents (e.g., from logged data)
• Re-use experience from old policies (e.g., from your own past experience)
• Learn about multiple policies while following one policy
• Learn about greedy policy while following exploratory policy
• Q-learning estimates the value of the greedy policy (More details in the
following lectures)
• Acting greedy all the time would not explore sufficiently
37
Off-Policy Methods

• Key Question:
• Can I average returns as before to obtain the value function of 𝜋
• Idea: Importance Sampling:
• Weight each return by the ratio of the probabilities of the trajectory
under the two policies.

38
Background: Estimating Expectations

• General Idea: Draw independent samples 𝑧#, … , 𝑧 ) from distribution


𝑝 𝑧 to approximate expectation:
%
1
𝔼 𝑓 = % 𝑓 𝑧 𝑝 𝑧 𝑑𝑧 ≈ - 𝑓 𝑧 " = 𝑓.
𝑁
"#$

Note that 𝔼 𝑓 = 𝔼 𝑓.

• So the estimator has correct mean (unbiased)


" "
• The variance decrease as : 𝑣𝑎𝑟 𝑓0 = 𝔼 𝑓−𝔼 𝑓 $
# #

• Remark: The accuracy of the estimator does not depend on dimensionality of 𝑧.


39
Background: Importance Sampling

• Suppose we have an easy-to-sample proposal distribution 𝑞(𝑧), such that


𝑞 𝑧 > 0 if 𝑝 𝑧 > 0
𝔼 𝑓 = 5 𝑓 𝑧 𝑝 𝑧 𝑑𝑧

𝑝 𝑧
= 5𝑓 𝑧 𝑞 𝑧 𝑑𝑧
𝑞 𝑧

#
1 𝑝 𝑧% % , 𝑧 % ~𝑞 𝑧
≈ : 𝑓 𝑧
𝑁 𝑞 𝑧%
%&"

• The quantities
This is useful when we can evaluate the 𝑤 % = 𝑝(𝑧 % )/𝑞(𝑧 % )
probability 𝒑 but is hard to sample from it as known as importance weights
40
Background: Importance Sampling Summary

Summary
• Estimate the expectation of a function
1
𝔼:~< 𝑓 𝑥 = h 𝑓 𝑥 𝑃 𝑥 𝑑𝑥 ≈ 8 𝑓(𝑥= )
𝑛
=

• But sometimes it is difficult to sample 𝑥 from 𝑃 𝑥 , then we can sample 𝑥


from another distribution 𝑄(𝑥), then correct the weight
𝑃 𝑥 1 𝑃 𝑥=
𝔼:~< 𝑓 𝑥 = h 𝑓 𝑥 𝑃 𝑥 𝑑𝑥 = h 𝑄 𝑥 𝑓 𝑥 𝑑𝑥 ≈ 8 𝑓(𝑥= )
𝑄 𝑥 𝑛 𝑄 𝑥=
=

41
Background: Importance Sampling

• Using importance sampling to reduce the error and variance of Monte


Carlo Simulations.
Example
• We were trying to find the probability that a randomly chosen variable X
from the standard normal distribution is greater than 3.
& & !*
1
𝑃 𝑋 > 3 = h 𝑓? 𝑡 𝑑𝑡 = h 𝑒( % 𝑑𝑡
> 2𝜋 >

https://acme.byu.edu/0000017a-1bb8-db63-a97e-7bfa0bea0000/vol1lab16montecarlo2-pdf
42
Background: Importance Sampling

https://acme.byu.edu/0000017a-1bb8-db63-a97e-7bfa0bea0000/vol1lab16montecarlo2-pdf
43
Importance Sampling for Off-Policy RL

• Estimate the expectation of return using trajectories sampled from another policy
(behavior policy)

𝔼&~( 𝐺 𝜏 = % 𝜋 𝜏 𝐺 𝜏 𝑑𝑇

𝜋 𝜏
= %𝜇 𝑇 𝐺 𝜏 𝑑𝑇
𝜇 𝜏

𝜋 𝜏
= 𝔼&~) 𝐺 𝜏
𝜇 𝜏
1 𝜋 𝜏*
≈ - 𝐺 𝜏*
𝑛 𝜇 𝜏*
*

44
Importance Sampling Ratio

• Given a starting state 𝑆! , the probability of the subsequent state-action


trajectory 𝐴! , 𝑆!"#, 𝐴!"#, … , 𝑆' , occurring under any policy 𝜋
𝑃 𝐴! , 𝑆!"#, 𝐴!"#, … , 𝑆' 𝑆! , 𝐴!:'(# ~ 𝜋)
= 𝜋 𝐴! 𝑆! 𝑃 𝑆!"# 𝑆! , 𝐴! 𝜋 𝐴!"# 𝑆!"# … 𝑃(𝑆' |𝑆'(#, 𝐴 '(#)
'(#
= l 𝜋 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 )
1+!
• Importance sampling ratio 𝝆
'(#
∏'(#
1+! 𝜋 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝜋(𝐴1 |𝑆1 )
𝜌!'(# = = l
∏'(#
1+! 𝜇 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝜇(𝐴1 |𝑆1 )
1+!
45
Importance Sampling

• We wish to estimate the expected returns (values) under the target policy,
but all we have are return 𝐺! duo to the behavior policy.
𝔼 𝐺! |𝑆! = 𝑠 = 𝑉A 𝑠
vs.
𝔼 𝜌!'(#𝐺! |𝑆! = 𝑠 = 𝑉$ 𝑠

46
Importance Sampling

• Ordinary importance sampling forms estimate:


First time of termination Return after 𝒕 up
following time 𝒕 through 𝑻 𝒕

'(!)
∑!∈𝒯(%) 𝜌! 𝐺!
𝑉 𝑠 =
|𝒯(𝑠)|

Every-visit method: the set of all time steps in which state 𝑠 is visited
First-visit method: time steps that were first visits to s within their episodes.

47
Importance Sampling

• Ordinary importance sampling forms estimate:

∑!∈𝒯(7) 𝜌!' ! (#
𝐺!
𝑉 𝑠 =
|𝒯(𝑠)|
• New notation: time steps increase across episode boundaries:

48
Importance Sampling Ratio

• All importance sampling ratios have expected value 1:


𝜋(𝐴1 |𝑆1 ) 𝜋(𝑎|𝑆1 )
𝔼/! ~A = 8 𝜇 𝑎 𝑆1 = 8 𝜋 𝑎 𝑆1 = 1.
𝜇(𝐴1 |𝑆1 ) 𝜇(𝑎|𝑆1 )
- -
• Note: Importance Sampling can have high (or infinite) variance
• Consider the estimates of the first-visit method after observing a single
return from state 𝑠

49
Two Types of Importance Sampling

• Ordinary Importance Sampling

∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
|𝒯(𝑠)|
• Weighted Importance Sampling

∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
∑+∈𝒯(/) 𝜌+1(+)2$

• Weighted IS is a biased estimation


• For first-visit method with single return, the expectation is 𝑉) 𝑠 rather than 𝑉( 𝑠 .
• Ordinary IS is an unbiased estimation
• For first-visit method, its estimator is always 𝑉( 𝑠
50
Two Types of Importance Sampling

• Ordinary Importance Sampling

∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
|𝒯(𝑠)|
• Weighted Importance Sampling

∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
∑+∈𝒯(/) 𝜌+1(+)2$

• The variance of the ordinary IS is in general unbounded, whereas in the weighted


estimator the largest weight on any single return is one.
• Suppose the ratio wen ten, the ordinary importance-sampling estimate would be
ten times the observed return.
51
Off-Policy MC Prediction

Q: Is the ordinary importance sampling or weighted importance sampling?


52
Recap: Blackjack Example

• State (200 of them)


• Current sum (12-21)
• Dealer’s showing card (ace-10)
• Do I have a “useable” ace (yes-no)

• Action stick: Stop receiving cards (and terminate)

• Action draw: Take another card (no replacement)

• Reward for stick:


• +1/0/-1 if sum of cards >/=/< sum of dealer cards

• Reward for draw:


• -1 for sum of cards > 21 (and terminate), 0
otherwise

• Transitions: automatically draw if sum of cards <12


53
Example: Off-policy Estimation of a Blackjack State Value

• The dealer is showing a deuce, the sum of the player’s cards is 13.

Book Page 106 (S. & B.)


54
Example: Infinite Variance

• Off-policy First-visit MC Prediction (ordinary importance sampling)

Book Page 107 (S. & B.)


55
Example: Infinite Variance

Variance
• Var 𝑋 = 𝔼 𝑋 − 𝑋: + = 𝔼 𝑋 + − 𝑋: +
/01 +
𝜋 𝐴, 𝑆,
𝔼 ; 𝐺
𝜇 𝐴, 𝑆, 2
,-.
!
1 1
= ⋅ 0.1 (the length 1 episode)
2 0.5
!
1 1 1 1
+ ⋅ 0.9 ⋅ ⋅ 0.1 (the length 2 episode)
2 2 0.5 0.5
!
1 1 1 1 1 1
+ ⋅ 0.9 ⋅ ⋅ 0.9 ⋅ ⋅ 0.1 ⋅ (the length 3 episode)
2 2 2 0.5 0.5 0.5
+⋯
% %
= 0.1 > 0.9" ⋅ 2" ⋅ 2 = 0.2 > 1.8" = ∞
"#$ "#$

56
Off-Policy MC Control

Q: Why exit inner Loop earlier?


57
So Far

• MC has several advantages over DP:


• Can learn directly from interaction with environment
• No need for full models
• MC methods provide an alternate policy evaluation process
• One issue to watch for: maintaining sufficient exploration
• Looked at distinction between on-policy and off-policy methods

58
TD Control
MC vs. TD Control

• Temporal-difference (TD) learning has several advantages over Monte-


Carlo (MC)
• Lower variance
• Online
• Incomplete sequences
• Natural idea: Use TD instead of MC in our control loop
• Apply TD to 𝑄(𝑆, 𝐴)
• Use 𝜖-greedy policy improvement
• Update every time step

60
Updating Action-value Functions with SARSA

𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 𝑆, 𝐴 + 𝛾 𝑄 𝑆 , , 𝐴, − 𝑄 𝑆, 𝐴

61
On-Policy Control with SARSA

Every time-step:
• Policy evaluation: SARSA, 𝑄 ≈ 𝑄$
• Policy improvement: 𝜖-greedy policy improvement

62
SARSA Algorithm for On-Policy Control

63
Convergence of SARSA
Theorem
SARSA converges to the optimal action-value function, 𝑄 𝑠, 𝑎 → 𝑄∗ 𝑠, 𝑎 ,
under the following conditions:
• GLIE sequence of policies 𝜋! 𝑎 𝑠
• Robbins-Monro sequence of step size 𝛼!
& &
8 𝛼! = ∞, 8 𝛼!% < ∞
!+# !+#

Convergence Results for Single-step On-Policy Reinforcement-Learning Algorithms. Machine Learning, 2000.

64
Example: Windy Gridworld Example

• Reward = −1 per time-step until reaching goal


• Undiscounted

65
SARSA on the Windy Gridworld Example

Q: Can a policy result in infinite loops? What will MC policy iteration do then?
• If the policy leads to infinite loop states, MC control will get trapped as the episode will not
terminate
• Instead, TD control can update continually the state-action values and switch do a different policy

66
𝑛-step SARSA

• Consider the following 𝑛-step returns for 𝑛 = 1, 2, … , ∞


"
𝑛=1 (SARSA) 𝑄! = 𝑅!'" + 𝛾𝑄 𝑆!'"
$
𝑛=2 𝑄! = 𝑅!'" + 𝛾𝑅!'$ + 𝛾 $ 𝑄 𝑆!'$

⋮ ⋮
(
𝑛 = ∞ (MC) 𝑄! = 𝑅!'" + 𝛾𝑅!'$ + ⋯ + 𝛾 )*" 𝑅)

• Define the 𝑛-step Q-return


𝑄! ) = 𝑅!"# + 𝛾𝑅!"% + … + 𝛾 )(#𝑅!") + 𝛾 ) 𝑄(𝑆!") )
• 𝑛-step SARSA updates 𝑄(𝑠, 𝑎) towards the n-step Q-return
)
𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝛼 𝑄! − 𝑄 𝑆! , 𝐴!
67
Forward View SARSA(𝜆)

• The 𝑄 A return combines all 𝑛-step Q-


"
returns 𝑄+
• Using weight 1 − 𝜆 𝜆"2$
B
"
𝑄+A = (1 − 𝜆) - 𝜆"2$ 𝑄+
"#$

• Forward-view SARSA(λ)
𝑄 𝑆+ , 𝐴+ ← 𝑄 𝑆+ , 𝐴+ + 𝛼 𝑄+A − 𝑄(𝑆+ , 𝐴+ )

68
Off-Policy TD Control
Recap: Importance Sampling for Off-Policy MC

Off-Policy Monte-Carlo
• Multiple importance sampling corrections along whole episode
'(#
∏'(#
1+! 𝜋 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝜋(𝐴1 |𝑆1 )
𝜌!'(# = = l
∏'(#
1+! 𝑏 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝑏(𝐴1 |𝑆1 )
1+!
• Update value towards corrected return
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝜌!'(#𝐺! − 𝑉 𝑆!

72
Importance Sampling for Off-Policy TD

Off-Policy TD
• Weight TD target 𝑅 + 𝛾𝑉(𝑆′) by importance sampling
• Only need a single importance sampling correction

𝜋 𝐴! 𝑆!
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!"# + 𝛾𝑉 𝑆!"# − 𝑉 𝑆!
𝜇 𝐴! 𝑆!

• Much lower variance than Monte-Carlo importance sampling


• Policies only need to be similar over a single step

73
Importance Sampling for Off-Policy TD Updates

𝜋 𝐴! 𝑆!
𝔼+ 𝑅!'" + 𝛾𝑉 𝑆!'" − 𝑉 𝑆! |𝑆! = 𝑠
𝜇 𝐴! 𝑆!

𝜋 𝑎𝑠
= :𝜇 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − 𝑉 𝑠
𝜇 𝑎𝑠
,

= : 𝜋 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − : 𝜇 𝑎 𝑠 𝑉 𝑠


, ,

= : 𝜋 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − : 𝜋 𝑎 𝑠 𝑉 𝑠


, ,

= : 𝜋 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − 𝑉 𝑠


,
= 𝔼- 𝑅!'" + 𝛾𝑉 𝑆!'" − 𝑉(𝑠)|𝑆! = 𝑠
74
Q-Learning

• We now consider off-policy learning of action-values 𝑄(𝑆, 𝐴)

• No importance sampling is required

• Next action is chosen using behavior policy 𝐴!"#~ 𝜇 ⋅ 𝑆! )

• But we consider alternative successor action 𝐴, ~ 𝜋(⋅ |𝑆! )

• And update 𝑄 𝑆! , 𝐴! towards value of alternative action


𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝛼 𝑅!"# + 𝛾𝑄 𝑆!"#, 𝐴, − 𝑄 𝑆! , 𝐴!

75
Q-Learning Control Algorithm

𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 + 𝛾 max 𝑄 𝑆 , , 𝑎, − 𝑄(𝑆, 𝐴)
-,

77
Q-Learning Algorithm for Off-Policy Control

• Q-Learning: 𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝛼 𝑅!"# + 𝛾 max 𝑄 𝑆!"#, 𝑎 − 𝑄(𝑆! , 𝐴! )


-

• SARSA: 𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 + 𝛾𝑄 𝑆 , , 𝐴, − 𝑄 𝑆, 𝐴
78
Why don’t use importance sampling on Q-Learning

- 𝐴! 𝑆!
• Off-Policy TD: 𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!'" + 𝛾𝑉 𝑆!'" − 𝑉 𝑆!
+ 𝐴! 𝑆!
• Short answer: Because Q-Learning does not make expected value estimates over the
policy distribution. For the full answer click here.
• Remember Bellman Optimality Backup from value iteration

𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 : 𝑃 𝑠 1 𝑠, 𝑎 max
&
𝑄 𝑠 1 , 𝑎1
,
.& ∈0
• Q-Learning can be considered as sample-based version of value iteration, expect
instead of using the expected value over the transition dynamics, we use the sample
collected from the environment
𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 max 𝑄 𝑠 1 , 𝑎1
& ,
• Q-learning is over the transition distribution, not over policy distribution thus no
need to correct different policy distributions 79
Expected SARSA

• Instead of the sample value-of-the-next-state, use the expectation!


𝑄 𝑆! , 𝐴! ← 𝑄 𝑆! , 𝐴! + 𝛼 𝑅!"# + 𝛾𝔼 𝑄 𝑆!"#, 𝐴!"# 𝑆!"# − 𝑄 𝑆! − 𝐴!

← 𝑄 𝑆! , 𝐴! + 𝛼 𝑅!"# + 𝛾 8 𝜋 𝑎 𝑆!"# 𝑄 𝑆!"#, 𝑎 − 𝑄 𝑆! − 𝐴!

• Expected SARSA performs better than SARSA (but costs more)


• Q: Why?
• Q: Is expected SARSA on policy or off policy?
• Q: What if 𝜋 is the greedy deterministic policy?
• Sometimes called “General Q-learning”
80
1 -Learning
𝑸

• If we do not follow the optimal policy • “S” is the starting


faithfully, and still want to model explore: state, “G” indicates
goal state, which give
a reward of 9 units
• Exploration: 30% of
the time the agent
took a random action
• 𝛾 = 0.9, 𝛽 = 0.5

When the best move isn't optimal: Q-learning with exploration. In AAAI, 1994.

81
Example: SARSA vs. Q-Learning

83
Example: Cliff Walking

84
Relationship Between DP and TD

85
Relationship Between DP and TD

86
Q-Learning Variants
Maximization Bias

• We often need to maximize over our value estimates. The estimated


maxima suffer from maximization bias
• Consider a state for which all action 𝑎, 𝑄∗ 𝑠, 𝑎 = 0. Our estimates 𝑄 𝑠, 𝑎
are uncertain, some are positive and some negative
• Intuitively (Jensen’s Inequality):
𝔼 max 𝜇= ≥ max 𝔼 𝜇=
=

• This is because we use the same estimate 𝑄 both to choose the argmax and
to evaluate it

88
Double Q-Learning

• Train two action-value functions, Q1 and Q2


• Do Q-learning on both, but
• Never on the same time steps (Q1 and Q2 are independent)
• Pick Q1 or Q2 at random to be updated on each step
• If updating Q1, use Q2 for the value of the next state:
• 𝑄# 𝑆! , 𝐴! ← 𝑄# 𝑆! , 𝐴! +

𝛼 𝑅!"# + 𝑸𝟐 𝑆!"#, arg max 𝑸𝟏 𝑆!"#, 𝑎 − 𝑄# 𝑆! , 𝐴!


-

• Action selections are 𝜖-greedy with respect to the sum of Q1 and Q2

89
Double Tabular Q-Learning

90
Double Q-Learning

Double Q-learning. In NIPS, 2010.


Double-Q estimation ≤ max* 𝔼 𝑋* ≤ 𝐸 max* 𝑋* (single-Q estimation)91
Double Q-Learning vs. Q-Learning

92
Example: Roulette

• Roulette: gambling game


• Here, 171 actions: bet $1 on one of 170 options, or ”stop”
• ”Stop” ends the episode, with $0
• All other actions have high variance reward ,with negative expected value
• Betting actions do not end the episode, instead can bet again

93
Example: Q-Learning vs. Double Q-Learning

94
Extra Reading Materials

• Chapter 5, Monte Carlo Methods, Reinforcement Learning: An


Introduction, Sutton & Barto
• Chapter 6, Temporal-Difference Learning, Reinforcement Learning: An
Introduction, Sutton & Barto
• Double Q-learning. In NIPS, 2010.

96
Thanks & QA?

You might also like