0% found this document useful (0 votes)
17 views62 pages

Fa Ii

The document discusses convergence issues in reinforcement learning control methods that use function approximation. It covers on-policy control with function approximation and the related convergence problems. It also discusses how generalization caused by function approximation breaks the conditions for the policy improvement theorem, meaning value-based methods with function approximation do not theoretically guarantee convergence to an optimal solution.

Uploaded by

이강민
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views62 pages

Fa Ii

The document discusses convergence issues in reinforcement learning control methods that use function approximation. It covers on-policy control with function approximation and the related convergence problems. It also discusses how generalization caused by function approximation breaks the conditions for the policy improvement theorem, meaning value-based methods with function approximation do not theoretically guarantee convergence to an optimal solution.

Uploaded by

이강민
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

AI512/EE633: Reinforcement Learning

Lecture 7 - Valued-Based Control with Function Approximation

Seungyul Han

UNIST
[email protected]

Spring 2024

Seungyul Han (UNIST) AI512/EE633 Spring 2024 1 / 62


Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

*In Lecture 6, we covered on-policy prediction with function approximation.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 2 / 62


On-Policy Control with Function Approximation

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 3 / 62


On-Policy Control with Function Approximation

On-Policy Control with Function Approximation


Q̂w,F ixedP oint ∃?

≈ Q∗
≈ π∗
Q̂w
w
ǫ-greedy(Q̂w )

Policy Evaluation: Approximate Qπ with Q̂w


For given current policy πt (represented by current wt of current Q̂wt ), evaluate
action-value function Q̂wt+1
Does this prediction step converge to the true Q-function of πt with func. approx.?
Policy Improvement πt+1 = ǫ-greedy(Q̂wt+1 )
Does the policy improvement theorem still hold with func. approx.?

⇒ Even without convergence guarantee, we 1) still apply GPI, or 2) devise algorithms with
convergence guarantee in restricted cases like linear function approximation.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 4 / 62


On-Policy Control with Function Approximation

Action-Value Function Approximation for Qπ (s, a)

Approximate the action-value function:

Qπ (s, a) ≈ Q̂w (s, a)

Loss: Mean-squared value error (MSVE)

J(w) = Eπ [(Qπ (s, a) − Q̂w (s, a))2 ]


X π
= µ (s, a)(Qπ (s, a) − Q̂w (s, a))2
s,a
1 X π
− ∇J(w) = µ (s, a)(Qπ (s, a) − Q̂w (s, a))∇Q̂w (s, a)
2 s,a

Apply stochastic gradient descent or semi-gradient descent by sampling


(st , at ) ∼ π:

1\
∆w = − ∇J(w) = (Qπ (st , at ) − Q̂wt (st , at ))∇Q̂wt (st , at )
2

Seungyul Han (UNIST) AI512/EE633 Spring 2024 5 / 62


On-Policy Control with Function Approximation

Stochastic (Semi)-Gradient Descent

For the unknown true Qπ (st , at ), use a target:


For gradient MC, the target = the return Gt

∆w = α Gt − Q̂wt (st , at ) ∇Q̂wt (st , at )

For semi-gradient TD, the target = the TD target


rt+1 + γ Q̂wt (st+1 , at+1 ), where at+1 is sampled from the current
policy ǫ − greedy(Q̂wt (st+1 , ·)) (on-policy control)

∆w = α rt+1 + γ Q̂wt (st+1 , at+1 ) − Q̂wt (st , at ) ∇Q̂wt (st , at )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 6 / 62


On-Policy Control with Function Approximation

On-Policy TD Control: Semi-Gradient SARSA

Input: Q̂w : S × A × Rd → R, step size α


Initialize w ∈ Rd

Loop for each episode


Initialize (s, a)
Loop for each time step of episode
Observe r, s′
If s′ is not terminal
Choose action a′ from ǫ-greedy(Q̂w (s′ , ·))

w ← w + α r + γ Q̂w (s′ , a′ ) − Q̂w (s, a) ∇Q̂w (s, a)
′ ′
(s, a) ← (s , a )
else if s′ is terminal 
w ← w + α r − Q̂w (s, a) ∇Q̂w (s, a)
Go to next episode

Seungyul Han (UNIST) AI512/EE633 Spring 2024 7 / 62


Convergence Issues in Control with Function Approximation

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 8 / 62


Convergence Issues in Control with Function Approximation

Convergence of Control with Function Approximation


Recall Control in the Tabular Case: Policy Improvement Step
r s′
V π (s̃) s̃
a
π
π ′ ′ r s′
Q (s̃, a = π (s̃)) s̃
a′
π
For some s̃ ∈ S such that

Qπ (s̃, π ′ (s̃)) > V π (s̃), with π ′ (s̃) 6= π(s̃),

we set a′ = π ′ (s̃) for s̃. This step can be done by a′ = arg maxα Qπ (s̃, α).
For all other s(6= s̃) ∈ S, we set π ′ (s) = π(s).
Then, this π ′ and π satisfy the condition of the policy improvement theorem.
Hence, π ′ > π.
The Policy Improvement Theorem (PIT) Condition

Qπ (s, π ′ (·|s)) ≥ V π (s) = Qπ (s, π(·|s)), ∀s ∈ S.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 9 / 62


Convergence Issues in Control with Function Approximation

Convergence of Control with Function Approximation


Sequential Implementation of the Idea in the Previous Page: On-Policy Case
Q0 Q(1) Q(2
−→ a1 → r1 , −→ a2 → r2 ,
s1 |{z} |{z} s2 |{z} |{z} s2 3
greedy ∼ π1 greedy ∼ π2
π0 → π1 V U (s1 , a1 ) π1 → π2 V U (s2 , a2 )
Q(0) → Q(1) Q(1) → Q(2)

Policy update (PU): πt−1 → πt = greedy(Qt−1 ) Q-table


 a1 a2 a3 a|A|
arg maxa Q(t−1) (st , a), s = st s 1
πt (s) =
πt−1 (s), s 6= st s2
s3
Value Update (VU): Q(t−1) → Q(t) (Bellman backup)
Note that from πt−1 to πt , only action on st changes by PU
st
and the actions on all other states s’s remain the same to
satisfy the PIT condition. This means that in value-based
methods, the row corresponding to st in the Q-table is
s|S|
updated and all other rows should not be changed.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 10 / 62
Convergence Issues in Control with Function Approximation

Convergence of Control with Function Approximation


Control with Function Approximation: Impact of Generalization
S

Q-Function
a 1 a 2 a3 a|A|
1
st s
s2
Ex. tile coding s3
Function approximation inherently yields
generalization
st
With generalization, Q-value update for one state
st affects the Q-values of many other states.
Hence, the condition for PIT is broken.
s|S|
So, value-based GPI (on-policy/off-policy) with
function approximation does not theoretically
guarantee convergence to optimal solution.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 11 / 62


Convergence Issues in Control with Function Approximation

Convergence of Methods with Func. Approx.


Control:
Algorithm Tabular Linear FA Nonlinear FA
MC O △ X
SARSA O △ X
Q-learning O X X

△ = chatters around near optimal value function


Prediction:
Algorithm Tabular Linear FA Nonlinear FA
On-Policy MC O O O
TD O O X
Off-Policy MC O O O
TD O X X

Note that MC uses an unbiased estimate Gt for V π (s) in SGD prediction. In this case,
SGD with MC estimate converges.
There exist examples such as Baird’s example for divergence of off-policy semi-gradient
TD prediction.
We will show the convergence of linear semi-gradient TD prediction.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 12 / 62
Convergence Issues in Control with Function Approximation

Control with Function Approximation: Considerations

Double sides of Generalization of Function Approximation (FA)


Useful to evaluate the values of unseen states for large MDPs
Destroys the PIT condition for GPI in value-based methods

Directions:
Development of converging algorithms under restrictions, e.g., linear function
approximation.
⇒ On-policy linear TD prediction (convergence guaranteed)
⇒ On-policy/Off-policy linear gradient TD learning (prediction, control)
(convergence guaranteed)
⇒ Linear least squares batch method

Seungyul Han (UNIST) AI512/EE633 Spring 2024 13 / 62


Convergence Issues in Control with Function Approximation

On-Policy Linear Semi-gradient TD Prediction


In Baird’s Counterexample, we will see that off-policy semi-gradient TD prediction with
linear FA diverges. However, we can show that on-policy semi-gradient TD prediction
can converge with linear FA.

Linear FA:
d
X
V̂w (s) = wi xi (s) = x(s)T w,
i=1

where x(s) = [x1 (s), · · · , xd (s)]T .

Linear Semi-Gradient TD Prediction:

∇V̂w (s) = x(s)


wt+1 = wt + α[Ut − V̂wt (st )]∇V̂w (st )
wt+1 = wt + α[rt+1 + γx(st+1 )T wt − xT (st )wt ]x(st )
h i
= wt + α rt+1 xt − xt (xt − γxt+1 )T wt

where xt = x(st ).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 14 / 62


Convergence Issues in Control with Function Approximation

On-Policy Linear Semi-gradient TD Prediction


Average System Behavior:
h i
wt+1 = wt + α rt+1 xt − xt (xt − γxt+1 )T wt
 

E[wt+1 |wt ] = wt + α E[rt+1 xt ] − E[xt (xt − γxt+1 )T ] wt 


 
| {z } | {z }
b C

= (I − αC)wt + αb
| {z }
A

= Awt + αb (a typical state-space equation in linear systems)

If the square A has eigenvalues whose magnitude are less than one, the state-space
system converges. It is shown that A has eigenvalues whose magnitude are less than
one (Sutton, 1988).
At Convergence: Linear TD Fixed-Point

E[wt+1 |wt ] = (I − αC)wt + αb = wt


wT D = C−1 b

Seungyul Han (UNIST) AI512/EE633 Spring 2024 15 / 62


Off-Policy Prediction and Control with Function Approximation

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 16 / 62


Off-Policy Prediction and Control with Function Approximation

Recall Off-Policy MC Control with Importance Sampling


For Qπ (st , at ), action at is given in addition to given st
rt+1 rt+2
st at st+1 at+1 sT
ai ∼ β(ai |si ), i ≥ t + 1 (behavior policy), si+1 ∼ P (si+1 |si , ai )
Return: Gt (τt ) = rt+1 + γrt+2 + γ 2 rt+3 + · · · + γ T −1 rT , where
τt = (st , at , st+1 , at+1 , · · · , sT )
Importance Sampling Technique
Z Z
Pπ (τt )
Qπ (s, a) = Eπ [Gt (τt )|st = s, at = a] = Gt (τt )Pπ (τt )dτt = Gt (τt ) Pβ (τt )dτt
Pβ (τt )
 
Pπ (τt )
≈ SampleMean Gt (τt ) , where τt ∼ Pβ (τt )
Pβ (τt )
QT −1 !
i=t+1 π(ai |si )
= SampleMean Gt (τt ) QT −1 , where τt ∼ Pβ (τt ) (1)
i=t+1 β(ai |si )

QT −1 !
i=t+1 π(ai |si )
Q(st , at ) ← Q(st , at ) + α Gt QT −1 − Q(st , at )
i=t+1 β(ai |si )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 17 / 62


Off-Policy Prediction and Control with Function Approximation

Off-Policy TD Prediction (V π (s)): Importance Sampling


Behavior policy: β, target policy: π
rt+1

st at st+1
∼β
Weight TD target rt+1 + γV (st+1 ) by importance sampling
Note that st+1 ∼ P (·|st , at ) and rt+1 = R(st , at , st+1 (st , at )) and are functions
of at given st
Action at is drawn from behavior policy β
Current value function estimate V (·) is an external function (look-up table) not
depending on the current action (it is a result of all previous states and actions)
Only need a single importance sampling correction
 π(a |s )  
t t
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )
Lower variance than MC importance sampling
Policies π and β need to be similar over a single step only
Seungyul Han (UNIST) AI512/EE633 Spring 2024 18 / 62
Off-Policy Prediction and Control with Function Approximation

Off-Policy TD Prediction (V π (s)): Importance Sampling

Off-Policy TD Update:
Ordinary Importance Sampling (OIS):
 π(a |s )  
t t
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )

Weighted Importance Sampling (WIS):

π(at |st )  
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )

This update formula has the advantage that value is not updated when π(at |st ) is
zero for sample action at generated from β(at |st ) > 0. WIS is widely used.
In practice, when ISR π(a t |st )
β(at |st )
is away from 1, the behavior and target probabilities
of at given st are much different. So, we clip ISR from the below and from the
above if it is away from 1.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 19 / 62


Off-Policy Prediction and Control with Function Approximation

Off-Policy TD Prediction (V π (s)) with Function


Approximation
Importance Sampling-Based Off-Policy TD Prediction:

Target policy π, Behavior policy β


Tabular case
 
π(at |st )
V (st ) ← V (st ) + α (rt+1 + γV (st+1 )) − V (st )
β(at |st )
π(at |st )
V (st ) ← V (st ) + α (rt+1 + γV (st+1 ) − V (st ))
β(at |st )
Function approximation case: Off-policy semi-gradient TD prediction
 
π(at |st )  
wt+1 = wt + α rt+1 + γ V̂wt (st+1 ) − V̂wt (st ) ∇V̂wt (st )
β(at |st )
π(at |st )  
wt+1 = wt + α rt+1 + γ V̂wt (st+1 ) − V̂wt (st ) ∇V̂wt (st )
β(at |st )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 20 / 62


Off-Policy Prediction and Control with Function Approximation

Off-Policy (One-Step) TD Control


Behavior policy: β, target policy: π, current Q-table: Q
rt+1

st at st+1 at+1
given target policy π

For Qπ (st , at ) estimation, current action at is given in addition to st


Note that st+1 ∼ P (·|st , at ) and rt+1 = R(st , at , st+1 ). Hence, no role of (behavior or
target) policy in obtaining st+1 and rt+1 .
Once st+1 is reached, we apply bootstrapping from Q table. So, Q(st+1 , at+1 ) is read
from the Q look-up table for bootstrapping. Here, at+1 is not an actually generated
sample action, but a dummy variable for bootstrapping. So, we can follow the target
policy directly to generate the bootstrapping dummy action variable at+1 . Then, the TD
target and the corresponding update are respectively given by
X
TD target = rt+1 + γ π(a|st+1 )Q(st+1 , a)
a
 X 
Q(st , at ) ← Q(st , at ) + α rt+1 + γ π(a|st+1 )Q(st+1 , a) − Q(st , at )
a

Note that importance sampling ratio does not appear in one-step TD Q update.
Given at from st was drawn by the behavior policy β.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 21 / 62
Off-Policy Prediction and Control with Function Approximation

Off-Policy TD Control (Action-Value Function)

Off-Policy TD Control: For computation of Qπ (st , at ), action at


Tabular case: This is called expected SARSA

 X 
Q(st , at ) ← Q(st , at ) + α rt+1 + γ π(a|st+1 )Q(st+1 , a) − Q(st , at )
a

Function approximation case

 X 
wt+1 = wt + α rt+1 + γ π(a|st+1 )Q̂wt (st+1 , a) − Q̂wt (st , at ) ∇Q̂wt (st , at )
a

When target policy π is greedy policy, the corresponding off-policy control


becomes Q-learning.
For one-step off-policy TD Q-update, importance sampling is not required.
However, for n-step TD, importance sampling ratio appears.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 22 / 62


Off-Policy Prediction and Control with Function Approximation

The Deadly Triad

Function approximation
Bootstrapping
Off-policy learning

The combination of the above three provides a non-trivial problem to tackle.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 23 / 62


Off-Policy Prediction and Control with Function Approximation

Divergence of Off-Policy Prediction with Linear FA (Baird)


Consider off-policy prediction, i.e., V π estimation for the following MDP:
Linear function approximation for state-value func. with parameters w1 , · · · , w8
For each state, there exist two actions: solid and dashed.
Reward for all transitions are zero.
π: target policy, β: behavior policy. This is Baird’ example.

2w1 + w8 2w2 + w8 2w3 + w8 2w4 + w8 2w5 + w8 2w6 + w8

π(solid|·) = 1
β(solid|·) = 1/7
w7 + 2w8 β(dashed|·) = 6/7
γ = 0.99

Seungyul Han (UNIST) AI512/EE633 Spring 2024 24 / 62


Off-Policy Prediction and Control with Function Approximation

Baird’s Counterexample

State distribution µπ (s) for π:



1, s = 7
µπ (s) =
0, o.w.

Initial distribution: Uniform


2w1 + w
2w
8 2 +w
2w
83 +w
2w 2w
8 4 +w8 5 +w
2w
8 6 + w8 Behavior state distribution: Uniform

 1
  6 1
 1

π(solid|·) = 1 7 7 7 7
β(solid|·) = 1/7  1   6 1  1 
w7 + 2w8 β(dashed|·) = 6/7  7   7 7  7 
γ = 0.99  . = .
 .   . .. 
 . 
 
 .   . 6
7
.   .. 
1 1 1 1 1
7 7
··· 7 7 7

Since all reward is zero, true


V π (s) = 0, ∀ s.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 25 / 62


Off-Policy Prediction and Control with Function Approximation

Baird’s Counterexample

Linear Function Approximation for State-Value Function

w1 w2 w3 w4 w5 w6 w7 w8
V (1) 2 1
V (2) 2 1
V (3) 2 1
V (4) 2 1
V (5) 2 1
V (6) 2 1
V (7) 1 2
wsol,1 0 0 0 0 0 0 0 0
1 1 1 1 1 1
wsol,2 2 2 2 2 2 2
2 −1

Note that the vector, wsol,2 , is a null vector of the above matrix [V (i)w(j)].
In fact, any vector in the null space is a linear weight solution to V (s) ≡ 0.
So, there exist infinitely many weight solutions to this problem.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 26 / 62


Off-Policy Prediction and Control with Function Approximation

Baird’s Counterexample
si
Consider transition from si to
2wi + w8 s7 , i = 1, · · · , 6:
Linear approx. for V π

s7 wi w7 w8
w7 + 2w8 V (si ) 2 1
V (s7 ) 1 2

π(at |st )  7 i

i
wt+1 = wt + α rt+1 + γ V̂wt (st+1 = s ) − V̂wt (st = s ) ∇V̂wt (st = s )
β(at |st )

     
wi wi   2
 w7  t t t t
= w7  + 7α γ(w7 + 2w8 ) − (2wi + w8 )  0 
w8 t+1
w8 t
1

2(−2wit + (2γ − 1)w8t t


     
wi wi + γw7 )
 w7  = w7  + 7α  0 
w8 t+1
w8 t −2wit + (2γ − 1)w8t t
+ γw7

Update of wi occurs once every 6 transitions on average, but update of w8 is at every transition.
(w7 does not change.) So, if we start a large w7 with all small w1 = · · · = w6 = w8 , w8 keeps
growing and wi grows too due to growing w8 (growing 6 times faster than w1 , · · · , w6 ).
Seungyul Han (UNIST) AI512/EE633 Spring 2024 27 / 62
Off-Policy Prediction and Control with Function Approximation

Baird’s Counterexample

1e6
4.0 w1
w2
3.5 w3
w4
3.0 w5
w6
The Value of Weights

2.5 w7
w8
2.0

1.5

1.0

0.5

0.0

0 1000 2000 3000 4000 5000


Timesteps

Seungyul Han (UNIST) AI512/EE633 Spring 2024 28 / 62


Off-Policy Prediction and Control with Function Approximation

Convergence of Methods with Func. Approx.


Prediction
On/Off-Policy Algorithm Tabular Linear FA Nonlinear FA
On-Policy MC O O O
TD O O X
Off-Policy MC O O O
TD O X X

Note that MC uses an biased estimate Gt for V π (s) in SGD prediction. In this case, SGD
with MC estimate converges.
We have shown the convergence of linear semi-gradient TD prediction.
There exist examples such as Baird’s example for divergence of off-policy semi-gradient
TD prediction.
Control
Algorithm Tabular Linear FA Nonlinear FA
MC O △ X
SARSA O △ X
Q-learning O X X

△ = chatters around near optimal value function


Seungyul Han (UNIST) AI512/EE633 Spring 2024 29 / 62
(Linear) Gradient TD (gTD) Learning

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 30 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning (Sutton et al. 2009)

Off-policy semi-gradient TD prediction even with linear function approximation is


not guaranteed to converge.
This is because semi-gradient TD is not a stochastic version of the true gradient.
Gradient TD is an approach to guarantee convergence of off-policy learning with
linear function approximation.
Gradient TD considered projected Bellman error instead of value error.
Gradient TD can be applied to on-policy learning too.
There is an extension of gradient TD to off-policy control.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 31 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Recall
replacements
Mean Square Value Error:

V̂w (s)
V π (s)
s S

h i2
J(w) = M SE(w) = Eπ V π (s) − V̂w (s)
X π h i2
= µ (s) V π (s) − V̂w (s)
s∈S

π
where µ (s) is the on-policy distribution.
We know that under MSVE, even linear semi-gradient TD prediction diverges
(Baird’s example).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 32 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction


Bellman Error (BE) and Projected Bellman Error (PBE):


T π V̂
w

w2 BE
ΠV π
PBE
ΠT π V̂w
V̂w
w1

Π: Projection operator onto the linear subspace L(w1 , w2 )


π
T : Bellman backup operator
Bellman Equation
X X
V π (s) = T π (V π (s)) = π(a|s) p(r, s′ |s, a)[r + γV π (s′ )]
a r,s′

Seungyul Han (UNIST) AI512/EE633 Spring 2024 33 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction


Weighted Norm: X

||v||2µ = µ(s)|v(s)|2 = vT Dv,
s

where µ(s) is the considered distribution on s ∈ S (e.g., on-policy distribution in the


on-policy case, or state distribution induced by the behavior policy in the off-policy case)

Projection operator and projection matrix onto L(w1 , · · · , wd ):

Πv = V̂w∗ such that w∗ = arg min ||v − V̂w ||2µ


w

Π = X(XT DX)−1 XT D,

where D = diag(µ(s1 ), · · · , µ(s|S| )) and the feature matrix X is given by


   
xT (s1 ) x1 (s1 ) · · · xd (s1 )
 ..   .. .. .. 
X= . = . . . 
xT (s|S| ) x1 (s|S| ) · · · xd (s )|S|

Seungyul Han (UNIST) AI512/EE633 Spring 2024 34 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction


Bellman Error at State s:

BE(s) = δ̄w (s) = T π V̂w (s) − V̂w (s)
X X
= π(a|s) p(r, s′ |s, a)[r + γV π (s′ )] − V̂w (s)
a r,s′

= Eat ∼π [rt+1 + γ V̂w (st+1 ) − V̂w (st )|st = s]


 
 
 π(at |st ) 
= Eat ∼β  [r
 β(a |s ) |t+1 + γ V̂ w (s t+1 ) − V̂ w (s t ) ] s t = s 

 t t {z } 
| {z } δt
ρt

= Expected TD error
Mean Square Bellman Error (Baird et al., 1995):
X
M SBE(w) = µ(s)|δ̄w (s)|2
s

Mean Square Projected Bellman Error (Sutton et al., 2009):


X
M SP BE(w) = µ(s)|Πδ̄w (s)|2 ,
s

Seungyul Han (UNIST) AI512/EE633 Spring 2024 35 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction


Mean Square Bellman Error:
X
M SBE(w) = µ(s)|δ̄w (s)|2
s
X
∇w M SBE(w) = 2 µ(s)δ̄w (s)∇w δ̄w
s

Minimizing MSBE (Baird et al. 1995): On-Policy Case

wt+1 = wt + α(rt + γVwt (st+1 ) − Vwt (st ))∇(rt + γVwt (st+1 ) − Vwt (st ))
This stochastic approximation to the true gradient is not good since st+1 in the first
term and the second term are correlated. (This will be explained later.)
Double Sampling:

wt+1 = wt + α(rt + γVwt (s′t+1 ) − Vwt (st ))∇(rt + γVwt (s′′t+1 ) − Vwt (st ))

However, this algorithm minimizing Bellman error is not a stable algorithm. Instead, we
minimize the projected Bellman error to obtain a stable algorithm.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 36 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Minimizing MSPBE (Sutton et al., 2009):

M SP BE(w) = ||Πδ̄w ||2µ


= (Πδ̄w )T DΠδ̄w
T T
= δ̄w Π DΠδ̄w
T
= δ̄w [X(XT DX)−1 XT D]T DX(XT DX)−1 XT Dδ̄w
T
= δ̄w DX(XT DX)−T XT DX(XT DX)−1 XT Dδ̄w
= (XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w ) (quadratic in w)

Gradient of MSPBE:

h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 37 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Gradient of MSPBE:

h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )

where
 
x1 (s1 ) ··· x1 (s|S| )
 . .. . 
XT = [x(s1 ), · · · , x(s|S| )] = 
 .
. . .
.
,

xd (s1 ) ··· |S|
xd (s )

   
µ(s1 ) ··· δ̄w (s1 )
 .. .   . 
D= . . , and δ̄w = . 
.  . 
··· |S|
µ(s ) δ̄w (s )|S|

Seungyul Han (UNIST) AI512/EE633 Spring 2024 38 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Gradient of MSPBE: Each Component

X
XT Dδ̄w = µ(s)x(s)δ̄w (s) = Eβ [ρt δt xt ], xt = x(st )
s

∇(XT Dδ̄w )T = ∇Eβ [ρt δt xt ]T = Eβ [(∇δt )xT


t ]

= Eβ [(∇(rt+1 + γwT xt+1 − wT xt ))xT


t ]

= Eβ [ρt (γxt+1 − xt )xT


t ]
X
T
X DX = µ(s)x(s)x(s) = Eβ [xt xT
T
t ]
s

Gradient of MSPBE:

h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )

= 2Eβ [ρt (γxt+1 − xt )xT T −1


t ](Eβ [xt xt ]) Eβ [ρt δt xt ]

Seungyul Han (UNIST) AI512/EE633 Spring 2024 39 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Gradient of MSPBE:

∇M SP BE(w) = 2Eβ [ρt (γxt+1 − xt )xT T −1


t ](Eβ [xt xt ]) Eβ [ρt δt xt ] (2)

Handling Correlation in Sampling:


Suppose that X ∼ p(x) with E[X] = 0.
Then, y = E[X]E[X] = 0.
Now, we want to approximate the quantity y by using sample x1 drawn from p(x) as
ŷ = x1 x1 . Then, ŷ does not have zero mean.
One way to circumvent this bias effect is that we sample x1 ∼ p(x) and then
independently sample x2 ∼ p(x), and use x1 for a sample estimate of the first E[X] and
use x2 for a sample estimate for the second E[X]. Then, ŷ = x1 x2 . This yields E[ŷ] = 0
and there is no bias.
Note that the first and third terms contain xt+1 = x(st+1 ). Hence, if for given st we
sample at (∼ β), rt+1 = r(st , at , st+1 ), st+1 ∼ p(st+1 |st , at ) just once and use this single
sample to approximate the right-hand side in (3), then we have similar bias effect to the
above.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 40 / 62


(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction


Gradient of MSPBE:

∇M SP BE(w) = 2Eβ [ρt (γxt+1 − xt )xT T −1


t ] (Eβ [xt xt ]) Eβ [ρt δt xt ] (3)
| {z }
=:vt compute f irst

Least Mean Square (LMS) Algorithm: Linear Adaptive Filtering

yt = x T
t vt

et = d t − x T
t vt

Goal : min E[|et |2 ]


E[|et |2 ] = E[(dt − xT T T
t vt ) (dt − xt vt )]

= E[d2t − 2dt xT T T
t vt + vt xt xt vt ]

∇v E[|et |2 ] = E[−2dt xt + 2xt xT


t vt ]

vt⋆ = E[xt xT
t ]
−1
E[dt xt ]
vt+1 = vt + β(dt − vtT xt )xt

Gradient TD:
wt+1 = wt + αρt (xt − γxt+1 )xT
t vt

Seungyul Han (UNIST) AI512/EE633 Spring 2024 41 / 62


(Linear) Least Squares Batch Methods (LSTD/LSPI)

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 42 / 62


(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Prediction (Bradtke and Barto, 1996)


Experience Buffer D with Size T
e1 e2 e3 ··· eT

Suppose that we collected experience samples e1 , e2 , · · · , eT in buffer, where


et = (st , at , rt+1 , st+1 )
We want to approximate the value function (V π or Qπ ) using a function
approximator based on all data samples in the buffer by minimizing the loss:

J(w) = ED [(V π (s) − V̂w (s))2 ] or ED [(Qπ (s, a) − Q̂w (s, a))2 ]
T
X
= (V π (st ) − V̂w (st ))2 , (4)
t=1

where ED means expectation w.r.t. samples in buffer (i.e., empirical mean).


P
If we adopt a linear function approximator V̂w (s) = i xi (s)wi = x( s)w, then
minimizing (4) reduces to good old Least Squares estimation (LSE)!
Efficient solutions are available for LSE such as the recursive least squares (RLS).
Seungyul Han (UNIST) AI512/EE633 Spring 2024 43 / 62
(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Prediction

Least Squares: Overdetermined Case (T ≥ d)

T
X
J(w) = (U (st ) −V̂w (st ))2
t=1
| {z }
target
2
   
U (s1 ) x1 (s1 ) ··· xd (s1 )  
 U (s2 )   x1 (s2 ) ··· xd (s2 )  w1
     .. 
=  .. − .. .. ..  . 
 .   . . . 
wd
U (sT ) x1 (sT ) ··· xd (sT ) | {z }
| {z } | {z } =w
=u =X

Seungyul Han (UNIST) AI512/EE633 Spring 2024 44 / 62


(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Prediction

Normal Equation:

x2

∗ ∗
Πu = w1 x1 + w2 x2

x1

Error ⊥ Spanning Space

xTi (u − XwLS

) = 0, i = 1, · · · , d ⇐⇒ XT (u − Xw∗ ) = 0

Seungyul Han (UNIST) AI512/EE633 Spring 2024 45 / 62


(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Prediction

Setting Target:

LSMC U (st ) = Gt
LSTD U (st ) = rt+1 + γ V̂w (st+1 ) (Bradtke and Barto, 1996)
LSTD(λ) U (st ) = Gλt

Normal Equation:

XT (u − XwLS

)=0
Linear LS Prediction:
MC
XT (g − XwLS

)=0 ⇐⇒ ∗
wLS = (XT X)−1 XT g
TD

XTt (r + γXt+1 wLS


∗ ∗
− Xt wLS )=0 ⇐⇒ ∗
wLS = [XTt (Xt − γXt+1 )]−1 XTt r

Seungyul Han (UNIST) AI512/EE633 Spring 2024 46 / 62


(Linear) Least Squares Batch Methods (LSTD/LSPI)

Least Squares Control

Generalized Policy Iteration Based on Batch-Processing Least Squares:


Policy evaluation
LS Q-learning
Policy improvement
Greedy with respect to estimated Q

Seungyul Han (UNIST) AI512/EE633 Spring 2024 47 / 62


(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Qπ Estimation


Experience Buffer D with Size T
e1 e2 e3 ··· eT , et = (st , at , rt+1 , st+1 )

LS loss over D:

T
X
J(w) = ED [(Qπ (s, a) − Q̂w (s, a))2 ] = (Qπ (st , at ) − Q̂w (st , at ))2 , (5)
t=1

where ED means expectation w.r.t. samples in buffer (i.e., empirical mean).


P T
Linear function approximator: Q̂w (s, a) =i xi (s, a)wi = x (s, a)w, then
minimizing (5) yields a batch LS solution:
From the normal equation, LSTDQ(π,D)

wLS = [XTt (Xt − γXt+1 )]−1 XTt r
" T #−1 T
X T T
X
= χ(st , at )(χ (st , at ) − γχ (st+1 , π(st+1 ))) χ(st , at )rt+1
t=1 t=1

where χ(st , at ) = [x1 (st , at ), · · · , xd (st , at )] which is the t-th column of XTt .
T

Seungyul Han (UNIST) AI512/EE633 Spring 2024 48 / 62


(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Policy Iteration (LSPI) Algorithm

Least Squares Policy Iteration (LSPI, Lagoudakis and Parr, 2003):

A popular batch RL algorithm


Use LSTDQ for policy evaluation of policy π for given experience D (off-policy)
Repeatedly use experience D to evaluate updated policies

function LSPI-TD(D, π0 )
π ′ ← π0
repeat
π ← π′
Qw ← LSTDQ(π, D) (policy evaluation)
for all s ∈ S do
π ′ (s) ← arg max Qw (s, a) (policy improvement)
a∈A
end for
until (π ≈ π ′ )
return π
end function

Seungyul Han (UNIST) AI512/EE633 Spring 2024 49 / 62


(Linear) Least Squares Batch Methods (LSTD/LSPI)

Convergence of Control Algorithms

Algorithm Tabular Linear FA Non-Linear FA


MC Control O △ X
SARSA O △ X
Q-learning O X X
Gradient Q-learning (GQ) O O X
LSPI O △ -

△ : wanders around near-optimal value function

Seungyul Han (UNIST) AI512/EE633 Spring 2024 50 / 62


Deep Q Network (DQN)

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 51 / 62


Deep Q Network (DQN)

Previous Attempts to Combine RL Artificial Neural


Networks

RL + ANN
Tesauro, 1995
Riedmiller, 2005, Hafner & Riedmiller, 2011 - NFQCA

Deep Learning:
Krizhevsky Sutskever, and Hinton, ”ImageNet Classification with Deep
Convolutional Neural Networks”, 2012

Deep Q-Network: Successful Incorporation of Deep NN Function Approximators


Mnih et al., DQN, 2015

Seungyul Han (UNIST) AI512/EE633 Spring 2024 52 / 62


Deep Q Network (DQN)

Deep Q-Networks (DQN, Mnih et al., 2015)


32 4x4 filters
256 hidden units Fully-connected linear
output layer

16 8x8 filters

4x84x84

Fully-connected layer of
rectified linear units

Stack of 4 previous Convolutional layer of Convolutional layer of


frames rectified linear units rectified linear units

Nonlinear function approximation with DNN (CNN) for action-value function


Large S with small A : eight directions + button pressed/unpressed (4 ∼ 18)
To handle instability of off-policy Q-learning with non-linear FA, DQN uses two techniques
Off-policy with experience replay - reduces correlation in successive samples in
standard on-line sequential SGD update
Target network - uncorrelate the target value with the current reference value
(Semi-gradient TD ⇒ Gradient TD: Recall the derivation of the gradient formula
assumes target is independent of the updating parameter. )
Seungyul Han (UNIST) AI512/EE633 Spring 2024 53 / 62
Deep Q Network (DQN)

Recall On-Policy On-Line Stochastic Gradient Descent

Q̂w (s, a)

Qπ (s, a)
(s, a) S ×A
Goal: find parameter vector w minimizing the MSVE loss:
X π 2
J(w) = Eπ (Qπ (s, a) − Q̂w (s, a))2 = µ (s, a) Qπ (s, a) − Q̂w (s, a)
s,a

Gradient descent:
1
wt+1 = wt − α∇w J(w)|w=wt
2
X π   
= wt + α µ (s, a) Qπ (s, a) − Q̂wt (s, a) ∇w Q̂w (s, a)|w=wt
s,a

On-line stochastic gradient descent:


  
wt+1 = wt + α Qπ (st , at ) − Q̂wt (st , at ) ∇w Q̂w (st , at )|w=wt
⇒ Single-sample approximation to true expectation and sequential step-by-step
update!
Seungyul Han (UNIST) AI512/EE633 Spring 2024 54 / 62
Deep Q Network (DQN)

Batch Methods
Experience Buffer with Size B
et = (st , at , rt+1 , st+1 )
e1 e2 e3 ···

em m m
1 e2 e3 · · · em
M

Mini-batch

Standard on-line SGD is sequential sample-by-sample update and is not


sample-efficient
Batch methods seek to find the best fitting value function with batch processing
from experience buffer D
Typically, experience is stored in buffer D,
A mini-batch is drawn from the buffer,
Stochastic (semi-)gradient update is performed based on the mini-batch
This update can be done multiple times based on multiple mini-batch drawing.
Samples are used multiple times on average ⇒ sample efficiency increased.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 55 / 62
Deep Q Network (DQN)

Update with Mini-Batch: Least Squares


Update with Mini-batch:
Mini-batch: with mini-batch size M

M = [em m m
1 , e2 , · · · , eM ] ∼ U (D)

Target values associated with the mini-batch:

U = [U1m , U2m , · · · , UM
m
]

Least-Squares Value Error Loss:

L(w) = EM [U m (s, a) − Qw (s, a)]2


M
X
= [Uim (si , ai ) − Qw (si , ai )]2
i=1

Stochastic (semi-)gradient descent:


M
X
w ←w+α [Uim (si , ai ) − Qw (si , ai )]∇Qw (si , ai )
i=1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 56 / 62


Deep Q Network (DQN)

Deep Q-Networks: Target Network

To handle instability of off-policy Q-learning with non-linear FA, DQN uses two techniques

Off-policy with experience replay - reduces correlation in successive samples in


standard on-line sequential SGD update
Target network - uncorrelate the target value with the current reference value
(Semi-gradient TD ⇒ Gradient TD: Recall the derivation of the gradient formula
assumes target is independent of the updating parameter. )

M
X
w ←w+α [Uim (si , ai ) −Qw (si , ai )]∇Qw (si , ai )
| {z }
i=1
T arget

Qw (s, a) and Qw− (s, a)


Uim (si , ai ) = ri + γ max

Qw− (s′i , a′ )
a

Seungyul Han (UNIST) AI512/EE633 Spring 2024 57 / 62


Deep Q Network (DQN)

Deep Q-Networks (DQN)

1 Take action at using ǫ-greedy policy w.r.t. current Q function (i.e., behavior
policy)
2 Store sample (st , at , rt+1 , st+1 ) to the replay buffer D
3 Sample random mini-batch Mi = {(s, a, r, s′ )} from D
4 Compute Q-learning targets using target network, i.e., Q-network with the
same structure but target parameter w− (updated every C steps and held
fixed in-between)
5 Update the network parameter w to minimize the LS loss by using SGD:
" #
 2
′ ′
Li (wi ) = Es,a,r,s′ ∼Mi r + γ max Qw− (s , a ) −Qw (s, a)
a′
| {z }
T arget

6 Go to next time step t + 1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 58 / 62


Deep Q Network (DQN)

DQN Results in Atari Games

Seungyul Han (UNIST) AI512/EE633 Spring 2024 59 / 62


Deep Q Network (DQN)

DQN: Ablation Study

Replay Replay No reply No replay


Target Q Net Q-learning Target Q Net Q-learning
Enduro 1006.3 832.2 141.9 29.1
River Raid 7446.6 4102.8 2867.7 1453
Seaquest 2894.4 822.6 1003 275.8
Space Invaders 1088.9 826.3 373.2 302

Seungyul Han (UNIST) AI512/EE633 Spring 2024 60 / 62


Deep Q Network (DQN)

Improvement of DQN

Prioritized replay mechanism (Schaul et al. 2016)


Double DQN trick (Van Hasselt et al. 2016): See Lecture 5 about
maximization bias and double learning.
···

Seungyul Han (UNIST) AI512/EE633 Spring 2024 61 / 62


Deep Q Network (DQN)

References

Textbook: Sutton and Barto, Reinforcement Learning: An Introduction, The MIT


Press, Cambridge MA, 2018
Dr. David Silver’s course material.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 62 / 62

You might also like