2.2+model Free+Control
2.2+model Free+Control
Emma Brunskill (Stanford), Bolei Zhou (UCLA), Hado van Hasselt (DeepMind)
• Last lecture
• Model-free prediction
• Estimate the value function of an unknown MDP
• This lecture
• Model-free control
• Optimize the value function of an unknown MDP
2
Recap: DP vs. MC vs. TD Learning
MC: Sample average return
• Remember: approximates expectation
𝑉! 𝑠 = 𝔼! 𝐺" 𝑆" = 𝑠
= 𝔼! ∑&
#$% 𝛾 #
𝑅"'#'( 𝑆" = 𝑠
= 𝔼! 𝑅"'( + 𝛾 ∑&
#$% 𝛾 #
𝑅"'#') 𝑆" = 𝑠
= 𝔼! 𝑅"'( + 𝛾𝑉! 𝑆"'( 𝑆" = 𝑠
TD: combine both: Sample expected DP: the expected values are provided
values and use a current estimate by a model. But we use a current
𝑉(𝑆!"#) of the true 𝑉$ (𝑆!"#) estimate 𝑉(𝑆!"#) of the true 𝑉$ (𝑆!"#)
3
Recap: Monte-Carlo Backup
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝐺! − 𝑉 𝑆!
4
Recap: Temporal-Difference Backup
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!"# + 𝛾𝑉 𝑆!"# − 𝑉 𝑆!
5
Recap: Dynamic Programming Backup
𝑉 𝑆! ← 𝔼$ 𝑅!"# + 𝛾𝑉 𝑆!"#
6
Recap: n-Step Return
• Forward-view 𝑇𝐷 𝜆
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝐺!* − 𝑉 𝑆!
8
2.2 Model-Free Control
Outline
• Introduction
• On-Policy Monte-Carlo Control
• Off-Policy Monte-Carlo Control
• On-Policy Temporal-Difference Learning
• Off-Policy Temporal-Difference Learning
• Summary
10
Uses of Model-Free Control
13
Generalized Policy Iteration with Monte-Carlo Evaluation
14
Model-Free Policy Iteration using Action-Value Function
• There are two types of value functions (e.g., 𝑽(𝒔) and 𝑸 𝒔, 𝒂 ), which one
to use for policy improvement?
• Greedy policy improvement over 𝑉 𝑠 requires model of MDP
𝜋 , 𝑠 = arg max 𝑅 𝑠, 𝑎 + 𝛾𝑃 𝑠 , 𝑠, 𝑎 𝑉 𝑠 ,
-∈/
15
Generalized Policy Iteration with Action-Value Function
16
Convergence of MC Control
19
MC Estimation of Action Values 𝑄
= 8 𝑃 𝑠 , , 𝑟 𝑠, 𝑎 𝑟 + 𝛾𝑉$ 𝑠 ,
7$ ,8
20
Recap: The Exploration Problem
…
• Are you sure you’ve chosen the best door?
21
Recap: The Exploration Problem
22
Recap: The Exploration Problem
24
Monte Carlo with 𝜖-Greedy Exploration
𝜖
+1 −𝜖 if 𝑎∗ = arg max 𝑄(𝑠, 𝑎)
𝐴 -∈/
𝜋 𝑎𝑠 = 𝜖
otherwise
𝐴
25
𝜖-Greedy Policy Improvement
Theorem
For any 𝜖-greedy policy 𝜋, the 𝜖-greedy policy 𝜋′ with respect to 𝑄$ is an
improvement, 𝑉$$ 𝑠 ≥ 𝑉$ 𝑠
= & 𝜋 𝑎 𝑠 𝑄% 𝑠, 𝑎 = 𝑉% (𝑠)
'∈)
27
Monte-Carlo Control
Every episode:
• Policy evaluation: Monte-Carlo policy evaluation 𝑄 ≈ 𝑄$
• Policy improvement: 𝜖-greedy policy improvement
28
On-Policy First-Visit MC Control (without exploring starts)
29
GLIE
Definition (Greedy in the Limit with Infinite Exploration (GLIE))
• All state-action pairs are explored infinitely many times,
lim1→& 𝑁1 𝑠, 𝑎 = ∞
• The policy converges on a greedy policy,
lim1→& 𝜋1 𝑎 𝑠 = 1 𝑎 = arg max 𝑄 𝑠, 𝑎 ,
$ 1
- ∈/
#
• For example, 𝜖-greedy is GLIE if 𝜖 reduces to zero at 𝜖1 = 1
Theorem
GLIE Model-free control converges to the optimal action-value functions,
𝑄! → 𝑄∗
30
GLIE Monte-Carlo Control
Theorem
GLIE Monte-Carlo control converges to the optimal action-value function,
𝑄 𝑠, 𝑎 → 𝑄∗ 𝑠, 𝑎
31
Summary So Far
32
Off-Policy MC
On and Off-Policy Learning
• On-policy Learning
• “Learn on the job”
• Learn about behavior policy 𝜋 from experience sampled from 𝜋
• On-policy methods attempts to evaluate or improve the policy that is
used to make decisions.
• Off-policy Learning
• “Look over someone’s shoulder”
• Learn about target policy 𝜋 from experience sampled from 𝜇
• Off-policy methods evaluate or improve a policy different from that
used to generate the data.
34
On and Off-Policy Learning
• On-policy Learning
• Learn about behavior policy 𝜋 from experience sampled from 𝜋
• Off-policy Learning
• Learn about target policy 𝜋 from experience sampled from 𝜇
• Learn “counterfactually” about other things you could do: “what if..?”
• E.g., “What if I would turn left?” => new observations, rewards?
• E.g., “What if I would play more defensively?” => different win probability?
• E.g., “What if I would continue to go forward?” => how long until I bump into a wall?
35
Monte Carlo Control without Exploring Starts
36
Off-Policy Methods
• Key Question:
• Can I average returns as before to obtain the value function of 𝜋
• Idea: Importance Sampling:
• Weight each return by the ratio of the probabilities of the trajectory
under the two policies.
38
Background: Estimating Expectations
Note that 𝔼 𝑓 = 𝔼 𝑓.
𝑝 𝑧
= 5𝑓 𝑧 𝑞 𝑧 𝑑𝑧
𝑞 𝑧
#
1 𝑝 𝑧% % , 𝑧 % ~𝑞 𝑧
≈ : 𝑓 𝑧
𝑁 𝑞 𝑧%
%&"
• The quantities
This is useful when we can evaluate the 𝑤 % = 𝑝(𝑧 % )/𝑞(𝑧 % )
probability 𝒑 but is hard to sample from it as known as importance weights
40
Background: Importance Sampling Summary
Summary
• Estimate the expectation of a function
1
𝔼:~< 𝑓 𝑥 = h 𝑓 𝑥 𝑃 𝑥 𝑑𝑥 ≈ 8 𝑓(𝑥= )
𝑛
=
41
Background: Importance Sampling
https://acme.byu.edu/0000017a-1bb8-db63-a97e-7bfa0bea0000/vol1lab16montecarlo2-pdf
42
Background: Importance Sampling
https://acme.byu.edu/0000017a-1bb8-db63-a97e-7bfa0bea0000/vol1lab16montecarlo2-pdf
43
Importance Sampling for Off-Policy RL
• Estimate the expectation of return using trajectories sampled from another policy
(behavior policy)
𝔼&~( 𝐺 𝜏 = % 𝜋 𝜏 𝐺 𝜏 𝑑𝑇
𝜋 𝜏
= %𝜇 𝑇 𝐺 𝜏 𝑑𝑇
𝜇 𝜏
𝜋 𝜏
= 𝔼&~) 𝐺 𝜏
𝜇 𝜏
1 𝜋 𝜏*
≈ - 𝐺 𝜏*
𝑛 𝜇 𝜏*
*
44
Importance Sampling Ratio
• We wish to estimate the expected returns (values) under the target policy,
but all we have are return 𝐺! duo to the behavior policy.
𝔼 𝐺! |𝑆! = 𝑠 = 𝑉A 𝑠
vs.
𝔼 𝜌!'(#𝐺! |𝑆! = 𝑠 = 𝑉$ 𝑠
46
Importance Sampling
'(!)
∑!∈𝒯(%) 𝜌! 𝐺!
𝑉 𝑠 =
|𝒯(𝑠)|
Every-visit method: the set of all time steps in which state 𝑠 is visited
First-visit method: time steps that were first visits to s within their episodes.
47
Importance Sampling
∑!∈𝒯(7) 𝜌!' ! (#
𝐺!
𝑉 𝑠 =
|𝒯(𝑠)|
• New notation: time steps increase across episode boundaries:
48
Importance Sampling Ratio
49
Two Types of Importance Sampling
∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
|𝒯(𝑠)|
• Weighted Importance Sampling
∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
∑+∈𝒯(/) 𝜌+1(+)2$
∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
|𝒯(𝑠)|
• Weighted Importance Sampling
∑+∈𝒯(/) 𝜌+1(+)2$ 𝐺+
𝑉 𝑠 =
∑+∈𝒯(/) 𝜌+1(+)2$
• The dealer is showing a deuce, the sum of the player’s cards is 13.
Variance
• Var 𝑋 = 𝔼 𝑋 − 𝑋: + = 𝔼 𝑋 + − 𝑋: +
/01 +
𝜋 𝐴, 𝑆,
𝔼 ; 𝐺
𝜇 𝐴, 𝑆, 2
,-.
!
1 1
= ⋅ 0.1 (the length 1 episode)
2 0.5
!
1 1 1 1
+ ⋅ 0.9 ⋅ ⋅ 0.1 (the length 2 episode)
2 2 0.5 0.5
!
1 1 1 1 1 1
+ ⋅ 0.9 ⋅ ⋅ 0.9 ⋅ ⋅ 0.1 ⋅ (the length 3 episode)
2 2 2 0.5 0.5 0.5
+⋯
% %
= 0.1 > 0.9" ⋅ 2" ⋅ 2 = 0.2 > 1.8" = ∞
"#$ "#$
56
Off-Policy MC Control
58
TD Control
MC vs. TD Control
60
Updating Action-value Functions with SARSA
𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 𝑆, 𝐴 + 𝛾 𝑄 𝑆 , , 𝐴, − 𝑄 𝑆, 𝐴
61
On-Policy Control with SARSA
Every time-step:
• Policy evaluation: SARSA, 𝑄 ≈ 𝑄$
• Policy improvement: 𝜖-greedy policy improvement
62
SARSA Algorithm for On-Policy Control
63
Convergence of SARSA
Theorem
SARSA converges to the optimal action-value function, 𝑄 𝑠, 𝑎 → 𝑄∗ 𝑠, 𝑎 ,
under the following conditions:
• GLIE sequence of policies 𝜋! 𝑎 𝑠
• Robbins-Monro sequence of step size 𝛼!
& &
8 𝛼! = ∞, 8 𝛼!% < ∞
!+# !+#
Convergence Results for Single-step On-Policy Reinforcement-Learning Algorithms. Machine Learning, 2000.
64
Example: Windy Gridworld Example
65
SARSA on the Windy Gridworld Example
Q: Can a policy result in infinite loops? What will MC policy iteration do then?
• If the policy leads to infinite loop states, MC control will get trapped as the episode will not
terminate
• Instead, TD control can update continually the state-action values and switch do a different policy
66
𝑛-step SARSA
⋮ ⋮
(
𝑛 = ∞ (MC) 𝑄! = 𝑅!'" + 𝛾𝑅!'$ + ⋯ + 𝛾 )*" 𝑅)
• Forward-view SARSA(λ)
𝑄 𝑆+ , 𝐴+ ← 𝑄 𝑆+ , 𝐴+ + 𝛼 𝑄+A − 𝑄(𝑆+ , 𝐴+ )
68
Off-Policy TD Control
Recap: Importance Sampling for Off-Policy MC
Off-Policy Monte-Carlo
• Multiple importance sampling corrections along whole episode
'(#
∏'(#
1+! 𝜋 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝜋(𝐴1 |𝑆1 )
𝜌!'(# = = l
∏'(#
1+! 𝑏 𝐴1 𝑆1 𝑃 𝑆1"# 𝑆1 , 𝐴1 ) 𝑏(𝐴1 |𝑆1 )
1+!
• Update value towards corrected return
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝜌!'(#𝐺! − 𝑉 𝑆!
72
Importance Sampling for Off-Policy TD
Off-Policy TD
• Weight TD target 𝑅 + 𝛾𝑉(𝑆′) by importance sampling
• Only need a single importance sampling correction
𝜋 𝐴! 𝑆!
𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!"# + 𝛾𝑉 𝑆!"# − 𝑉 𝑆!
𝜇 𝐴! 𝑆!
73
Importance Sampling for Off-Policy TD Updates
𝜋 𝐴! 𝑆!
𝔼+ 𝑅!'" + 𝛾𝑉 𝑆!'" − 𝑉 𝑆! |𝑆! = 𝑠
𝜇 𝐴! 𝑆!
𝜋 𝑎𝑠
= :𝜇 𝑎 𝑠 𝔼 𝑅!'" + 𝛾𝑉 𝑆!'" |𝑆! = 𝑠, 𝐴! = 𝑎 − 𝑉 𝑠
𝜇 𝑎𝑠
,
75
Q-Learning Control Algorithm
𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 + 𝛾 max 𝑄 𝑆 , , 𝑎, − 𝑄(𝑆, 𝐴)
-,
77
Q-Learning Algorithm for Off-Policy Control
• SARSA: 𝑄 𝑆, 𝐴 ← 𝑄 𝑆, 𝐴 + 𝛼 𝑅 + 𝛾𝑄 𝑆 , , 𝐴, − 𝑄 𝑆, 𝐴
78
Why don’t use importance sampling on Q-Learning
- 𝐴! 𝑆!
• Off-Policy TD: 𝑉 𝑆! ← 𝑉 𝑆! + 𝛼 𝑅!'" + 𝛾𝑉 𝑆!'" − 𝑉 𝑆!
+ 𝐴! 𝑆!
• Short answer: Because Q-Learning does not make expected value estimates over the
policy distribution. For the full answer click here.
• Remember Bellman Optimality Backup from value iteration
𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 : 𝑃 𝑠 1 𝑠, 𝑎 max
&
𝑄 𝑠 1 , 𝑎1
,
.& ∈0
• Q-Learning can be considered as sample-based version of value iteration, expect
instead of using the expected value over the transition dynamics, we use the sample
collected from the environment
𝑄 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 max 𝑄 𝑠 1 , 𝑎1
& ,
• Q-learning is over the transition distribution, not over policy distribution thus no
need to correct different policy distributions 79
Expected SARSA
When the best move isn't optimal: Q-learning with exploration. In AAAI, 1994.
81
Example: SARSA vs. Q-Learning
83
Example: Cliff Walking
84
Relationship Between DP and TD
85
Relationship Between DP and TD
86
Q-Learning Variants
Maximization Bias
• This is because we use the same estimate 𝑄 both to choose the argmax and
to evaluate it
88
Double Q-Learning
89
Double Tabular Q-Learning
90
Double Q-Learning
92
Example: Roulette
93
Example: Q-Learning vs. Double Q-Learning
94
Extra Reading Materials
96
Thanks & QA?