Fa Ii
Fa Ii
Seungyul Han
UNIST
[email protected]
Spring 2024
Table of Contents
≈ Q∗
≈ π∗
Q̂w
w
ǫ-greedy(Q̂w )
⇒ Even without convergence guarantee, we 1) still apply GPI, or 2) devise algorithms with
convergence guarantee in restricted cases like linear function approximation.
1\
∆w = − ∇J(w) = (Qπ (st , at ) − Q̂wt (st , at ))∇Q̂wt (st , at )
2
Table of Contents
we set a′ = π ′ (s̃) for s̃. This step can be done by a′ = arg maxα Qπ (s̃, α).
For all other s(6= s̃) ∈ S, we set π ′ (s) = π(s).
Then, this π ′ and π satisfy the condition of the policy improvement theorem.
Hence, π ′ > π.
The Policy Improvement Theorem (PIT) Condition
Q-Function
a 1 a 2 a3 a|A|
1
st s
s2
Ex. tile coding s3
Function approximation inherently yields
generalization
st
With generalization, Q-value update for one state
st affects the Q-values of many other states.
Hence, the condition for PIT is broken.
s|S|
So, value-based GPI (on-policy/off-policy) with
function approximation does not theoretically
guarantee convergence to optimal solution.
Note that MC uses an unbiased estimate Gt for V π (s) in SGD prediction. In this case,
SGD with MC estimate converges.
There exist examples such as Baird’s example for divergence of off-policy semi-gradient
TD prediction.
We will show the convergence of linear semi-gradient TD prediction.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 12 / 62
Convergence Issues in Control with Function Approximation
Directions:
Development of converging algorithms under restrictions, e.g., linear function
approximation.
⇒ On-policy linear TD prediction (convergence guaranteed)
⇒ On-policy/Off-policy linear gradient TD learning (prediction, control)
(convergence guaranteed)
⇒ Linear least squares batch method
Linear FA:
d
X
V̂w (s) = wi xi (s) = x(s)T w,
i=1
where xt = x(st ).
= (I − αC)wt + αb
| {z }
A
If the square A has eigenvalues whose magnitude are less than one, the state-space
system converges. It is shown that A has eigenvalues whose magnitude are less than
one (Sutton, 1988).
At Convergence: Linear TD Fixed-Point
Table of Contents
QT −1 !
i=t+1 π(ai |si )
Q(st , at ) ← Q(st , at ) + α Gt QT −1 − Q(st , at )
i=t+1 β(ai |si )
st at st+1
∼β
Weight TD target rt+1 + γV (st+1 ) by importance sampling
Note that st+1 ∼ P (·|st , at ) and rt+1 = R(st , at , st+1 (st , at )) and are functions
of at given st
Action at is drawn from behavior policy β
Current value function estimate V (·) is an external function (look-up table) not
depending on the current action (it is a result of all previous states and actions)
Only need a single importance sampling correction
π(a |s )
t t
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )
Lower variance than MC importance sampling
Policies π and β need to be similar over a single step only
Seungyul Han (UNIST) AI512/EE633 Spring 2024 18 / 62
Off-Policy Prediction and Control with Function Approximation
Off-Policy TD Update:
Ordinary Importance Sampling (OIS):
π(a |s )
t t
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )
π(at |st )
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )
This update formula has the advantage that value is not updated when π(at |st ) is
zero for sample action at generated from β(at |st ) > 0. WIS is widely used.
In practice, when ISR π(a t |st )
β(at |st )
is away from 1, the behavior and target probabilities
of at given st are much different. So, we clip ISR from the below and from the
above if it is away from 1.
st at st+1 at+1
given target policy π
Note that importance sampling ratio does not appear in one-step TD Q update.
Given at from st was drawn by the behavior policy β.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 21 / 62
Off-Policy Prediction and Control with Function Approximation
X
Q(st , at ) ← Q(st , at ) + α rt+1 + γ π(a|st+1 )Q(st+1 , a) − Q(st , at )
a
X
wt+1 = wt + α rt+1 + γ π(a|st+1 )Q̂wt (st+1 , a) − Q̂wt (st , at ) ∇Q̂wt (st , at )
a
Function approximation
Bootstrapping
Off-policy learning
π(solid|·) = 1
β(solid|·) = 1/7
w7 + 2w8 β(dashed|·) = 6/7
γ = 0.99
Baird’s Counterexample
1
6 1
1
π(solid|·) = 1 7 7 7 7
β(solid|·) = 1/7 1 6 1 1
w7 + 2w8 β(dashed|·) = 6/7 7 7 7 7
γ = 0.99 . = .
. . ..
.
. . 6
7
. ..
1 1 1 1 1
7 7
··· 7 7 7
Baird’s Counterexample
w1 w2 w3 w4 w5 w6 w7 w8
V (1) 2 1
V (2) 2 1
V (3) 2 1
V (4) 2 1
V (5) 2 1
V (6) 2 1
V (7) 1 2
wsol,1 0 0 0 0 0 0 0 0
1 1 1 1 1 1
wsol,2 2 2 2 2 2 2
2 −1
Note that the vector, wsol,2 , is a null vector of the above matrix [V (i)w(j)].
In fact, any vector in the null space is a linear weight solution to V (s) ≡ 0.
So, there exist infinitely many weight solutions to this problem.
Baird’s Counterexample
si
Consider transition from si to
2wi + w8 s7 , i = 1, · · · , 6:
Linear approx. for V π
s7 wi w7 w8
w7 + 2w8 V (si ) 2 1
V (s7 ) 1 2
π(at |st ) 7 i
i
wt+1 = wt + α rt+1 + γ V̂wt (st+1 = s ) − V̂wt (st = s ) ∇V̂wt (st = s )
β(at |st )
wi wi 2
w7 t t t t
= w7 + 7α γ(w7 + 2w8 ) − (2wi + w8 ) 0
w8 t+1
w8 t
1
Update of wi occurs once every 6 transitions on average, but update of w8 is at every transition.
(w7 does not change.) So, if we start a large w7 with all small w1 = · · · = w6 = w8 , w8 keeps
growing and wi grows too due to growing w8 (growing 6 times faster than w1 , · · · , w6 ).
Seungyul Han (UNIST) AI512/EE633 Spring 2024 27 / 62
Off-Policy Prediction and Control with Function Approximation
Baird’s Counterexample
1e6
4.0 w1
w2
3.5 w3
w4
3.0 w5
w6
The Value of Weights
2.5 w7
w8
2.0
1.5
1.0
0.5
0.0
Note that MC uses an biased estimate Gt for V π (s) in SGD prediction. In this case, SGD
with MC estimate converges.
We have shown the convergence of linear semi-gradient TD prediction.
There exist examples such as Baird’s example for divergence of off-policy semi-gradient
TD prediction.
Control
Algorithm Tabular Linear FA Nonlinear FA
MC O △ X
SARSA O △ X
Q-learning O X X
Table of Contents
Recall
replacements
Mean Square Value Error:
V̂w (s)
V π (s)
s S
h i2
J(w) = M SE(w) = Eπ V π (s) − V̂w (s)
X π h i2
= µ (s) V π (s) − V̂w (s)
s∈S
π
where µ (s) is the on-policy distribution.
We know that under MSVE, even linear semi-gradient TD prediction diverges
(Baird’s example).
Vπ
T π V̂
w
w2 BE
ΠV π
PBE
ΠT π V̂w
V̂w
w1
Π = X(XT DX)−1 XT D,
= Expected TD error
Mean Square Bellman Error (Baird et al., 1995):
X
M SBE(w) = µ(s)|δ̄w (s)|2
s
wt+1 = wt + α(rt + γVwt (st+1 ) − Vwt (st ))∇(rt + γVwt (st+1 ) − Vwt (st ))
This stochastic approximation to the true gradient is not good since st+1 in the first
term and the second term are correlated. (This will be explained later.)
Double Sampling:
wt+1 = wt + α(rt + γVwt (s′t+1 ) − Vwt (st ))∇(rt + γVwt (s′′t+1 ) − Vwt (st ))
However, this algorithm minimizing Bellman error is not a stable algorithm. Instead, we
minimize the projected Bellman error to obtain a stable algorithm.
Gradient of MSPBE:
h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )
Gradient of MSPBE:
h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )
where
x1 (s1 ) ··· x1 (s|S| )
. .. .
XT = [x(s1 ), · · · , x(s|S| )] =
.
. . .
.
,
xd (s1 ) ··· |S|
xd (s )
µ(s1 ) ··· δ̄w (s1 )
.. . .
D= . . , and δ̄w = .
. .
··· |S|
µ(s ) δ̄w (s )|S|
X
XT Dδ̄w = µ(s)x(s)δ̄w (s) = Eβ [ρt δt xt ], xt = x(st )
s
Gradient of MSPBE:
h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )
Gradient of MSPBE:
yt = x T
t vt
et = d t − x T
t vt
= E[d2t − 2dt xT T T
t vt + vt xt xt vt ]
vt⋆ = E[xt xT
t ]
−1
E[dt xt ]
vt+1 = vt + β(dt − vtT xt )xt
Gradient TD:
wt+1 = wt + αρt (xt − γxt+1 )xT
t vt
Table of Contents
J(w) = ED [(V π (s) − V̂w (s))2 ] or ED [(Qπ (s, a) − Q̂w (s, a))2 ]
T
X
= (V π (st ) − V̂w (st ))2 , (4)
t=1
T
X
J(w) = (U (st ) −V̂w (st ))2
t=1
| {z }
target
2
U (s1 ) x1 (s1 ) ··· xd (s1 )
U (s2 ) x1 (s2 ) ··· xd (s2 ) w1
..
= .. − .. .. .. .
. . . .
wd
U (sT ) x1 (sT ) ··· xd (sT ) | {z }
| {z } | {z } =w
=u =X
Normal Equation:
x2
∗ ∗
Πu = w1 x1 + w2 x2
x1
xTi (u − XwLS
∗
) = 0, i = 1, · · · , d ⇐⇒ XT (u − Xw∗ ) = 0
Setting Target:
LSMC U (st ) = Gt
LSTD U (st ) = rt+1 + γ V̂w (st+1 ) (Bradtke and Barto, 1996)
LSTD(λ) U (st ) = Gλt
Normal Equation:
XT (u − XwLS
∗
)=0
Linear LS Prediction:
MC
XT (g − XwLS
∗
)=0 ⇐⇒ ∗
wLS = (XT X)−1 XT g
TD
LS loss over D:
T
X
J(w) = ED [(Qπ (s, a) − Q̂w (s, a))2 ] = (Qπ (st , at ) − Q̂w (st , at ))2 , (5)
t=1
where χ(st , at ) = [x1 (st , at ), · · · , xd (st , at )] which is the t-th column of XTt .
T
function LSPI-TD(D, π0 )
π ′ ← π0
repeat
π ← π′
Qw ← LSTDQ(π, D) (policy evaluation)
for all s ∈ S do
π ′ (s) ← arg max Qw (s, a) (policy improvement)
a∈A
end for
until (π ≈ π ′ )
return π
end function
Table of Contents
RL + ANN
Tesauro, 1995
Riedmiller, 2005, Hafner & Riedmiller, 2011 - NFQCA
Deep Learning:
Krizhevsky Sutskever, and Hinton, ”ImageNet Classification with Deep
Convolutional Neural Networks”, 2012
16 8x8 filters
4x84x84
Fully-connected layer of
rectified linear units
Q̂w (s, a)
Qπ (s, a)
(s, a) S ×A
Goal: find parameter vector w minimizing the MSVE loss:
X π 2
J(w) = Eπ (Qπ (s, a) − Q̂w (s, a))2 = µ (s, a) Qπ (s, a) − Q̂w (s, a)
s,a
Gradient descent:
1
wt+1 = wt − α∇w J(w)|w=wt
2
X π
= wt + α µ (s, a) Qπ (s, a) − Q̂wt (s, a) ∇w Q̂w (s, a)|w=wt
s,a
Batch Methods
Experience Buffer with Size B
et = (st , at , rt+1 , st+1 )
e1 e2 e3 ···
em m m
1 e2 e3 · · · em
M
Mini-batch
M = [em m m
1 , e2 , · · · , eM ] ∼ U (D)
U = [U1m , U2m , · · · , UM
m
]
To handle instability of off-policy Q-learning with non-linear FA, DQN uses two techniques
M
X
w ←w+α [Uim (si , ai ) −Qw (si , ai )]∇Qw (si , ai )
| {z }
i=1
T arget
1 Take action at using ǫ-greedy policy w.r.t. current Q function (i.e., behavior
policy)
2 Store sample (st , at , rt+1 , st+1 ) to the replay buffer D
3 Sample random mini-batch Mi = {(s, a, r, s′ )} from D
4 Compute Q-learning targets using target network, i.e., Q-network with the
same structure but target parameter w− (updated every C steps and held
fixed in-between)
5 Update the network parameter w to minimize the LS loss by using SGD:
" #
2
′ ′
Li (wi ) = Es,a,r,s′ ∼Mi r + γ max Qw− (s , a ) −Qw (s, a)
a′
| {z }
T arget
Improvement of DQN
References