Lecture 5: Model-Free Control: David Silver
Lecture 5: Model-Free Control: David Silver
David Silver
Lecture 5: Model-Free Control
Outline
1 Introduction
4 Off-Policy Learning
5 Summary
Lecture 5: Model-Free Control
Introduction
Last lecture:
Model-free prediction
Estimate the value function of an unknown MDP
This lecture:
Model-free control
Optimise the value function of an unknown MDP
Lecture 5: Model-Free Control
Introduction
On-policy learning
“Learn on the job”
Learn about policy π from experience sampled from π
Off-policy learning
“Look over someone’s shoulder”
Learn about policy π from experience sampled from µ
Lecture 5: Model-Free Control
On-Policy Monte-Carlo Control
Generalised Policy Iteration
Q=
q
π
Starting
Q, π
q*, π*
(Q )
greedy
π=
-Greedy Exploration
Theorem
For any -greedy policy π, the -greedy policy π 0 with respect to
qπ is an improvement, vπ0 (s) ≥ vπ (s)
X
qπ (s, π 0 (s)) = π 0 (a|s)qπ (s, a)
a∈A
X
= /m qπ (s, a) + (1 − ) max qπ (s, a)
a∈A
a∈A
X X π(a|s) − /m
≥ /m qπ (s, a) + (1 − ) qπ (s, a)
1−
a∈A a∈A
X
= π(a|s)qπ (s, a) = vπ (s)
a∈A
Q=
q
π
Starting
Q, π
q*, π*
)
dy(Q
gree
π = ε-
Monte-Carlo Control
Q=
q
π
Starting Q
q*, π*
)
dy(Q
ε- gree
π =
Every episode:
Policy evaluation Monte-Carlo policy evaluation, Q ≈ qπ
Policy improvement -greedy policy improvement
Lecture 5: Model-Free Control
On-Policy Monte-Carlo Control
GLIE
GLIE
Definition
Greedy in the Limit with Infinite Exploration (GLIE)
All state-action pairs are explored infinitely many times,
lim Nk (s, a) = ∞
k→∞
1
For example, -greedy is GLIE if reduces to zero at k = k
Lecture 5: Model-Free Control
On-Policy Monte-Carlo Control
GLIE
Theorem
GLIE Monte-Carlo control converges to the optimal action-value
function, Q(s, a) → q∗ (s, a)
Lecture 5: Model-Free Control
On-Policy Monte-Carlo Control
Blackjack Example
MC vs. TD Control
S,A
R
S’
A’
Q=
q
π
Starting Q
q*, π*
)
dy(Q
ε- gree
π =
Every time-step:
Policy evaluation Sarsa, Q ≈ qπ
Policy improvement -greedy policy improvement
Lecture 5: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Convergence of Sarsa
Theorem
Sarsa converges to the optimal action-value function,
Q(s, a) → q∗ (s, a), under the following conditions:
GLIE sequence of policies πt (a|s)
Robbins-Monro sequence of step-sizes αt
∞
X
αt = ∞
t=1
X∞
αt2 < ∞
t=1
Lecture 5: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
n-Step Sarsa
Consider the following n-step returns for n = 1, 2, ∞:
(1)
n=1 (Sarsa) qt = Rt+1 + γQ(St+1 )
(2)
n=2 qt = Rt+1 + γRt+2 + γ 2 Q(St+2 )
.. ..
. .
(∞)
n=∞ (MC ) qt = Rt+1 + γRt+2 + ... + γ T −1 RT
Define the n-step Q-return
(n)
qt = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n Q(St+n )
Forward-view Sarsa(λ)
Q(St , At ) ← Q(St , At ) + α qtλ − Q(St , At )
Lecture 5: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
E0 (s, a) = 0
Et (s, a) = γλEt−1 (s, a) + 1(St = s, At = a)
Sarsa(λ) Algorithm
Lecture 5: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Off-Policy Learning
{S1 , A1 , R2 , ..., ST } ∼ µ
Importance Sampling
V (St ) ← V (St ) +
π(At |St )
α (Rt+1 + γV (St+1 )) − V (St )
µ(At |St )
Q-Learning
Rt+1 + γQ(St+1 , A0 )
=Rt+1 + γQ(St+1 , argmax Q(St+1 , a0 ))
a0
=Rt+1 + max
0
γQ(St+1 , a0 )
a
Lecture 5: Model-Free Control
Off-Policy Learning
Q-Learning
S,A
R
S’
A’
0 0
Q(S, A) ← Q(S, A) + α R + γ max
0
Q(S , a ) − Q(S, A)
a
Theorem
Q-learning control converges to the optimal action-value function,
Q(s, a) → q∗ (s, a)
Lecture 5: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning Demo
Q-Learning Demo
Lecture 5: Model-Free Control
Off-Policy Learning
Q-Learning
r
0 0
Bellman Expectation v⇡ (s ) !7 s
q⇤ (s, a) !7 s, a
s0
α
where x ← y ≡ x ← x + α(y − x)
Lecture 5: Model-Free Control
Summary
Questions?