0% found this document useful (0 votes)

17 views62 pages

Fa Ii

The document discusses convergence issues in reinforcement learning control methods that use function approximation. It covers on-policy control with function approximation and the related convergence problems. It also discusses how generalization caused by function approximation breaks the conditions for the policy improvement theorem, meaning value-based methods with function approximation do not theoretically guarantee convergence to an optimal solution.

Uploaded by

이강민

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views62 pages

Fa Ii

Uploaded by

이강민

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

AI512/EE633: Reinforcement Learning

Lecture 7 - Valued-Based Control with Function Approximation

Seungyul Han

UNIST
[email protected]

Spring 2024

Seungyul Han (UNIST) AI512/EE633 Spring 2024 1 / 62

Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

*In Lecture 6, we covered on-policy prediction with function approximation.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 2 / 62

On-Policy Control with Function Approximation

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 3 / 62

On-Policy Control with Function Approximation

Q̂w,F ixedP oint ∃?

≈ Q∗
≈ π∗
Q̂w
w
ǫ-greedy(Q̂w )

Policy Evaluation: Approximate Qπ with Q̂w

For given current policy πt (represented by current wt of current Q̂wt ), evaluate
action-value function Q̂wt+1
Does this prediction step converge to the true Q-function of πt with func. approx.?
Policy Improvement πt+1 = ǫ-greedy(Q̂wt+1 )
Does the policy improvement theorem still hold with func. approx.?

⇒ Even without convergence guarantee, we 1) still apply GPI, or 2) devise algorithms with
convergence guarantee in restricted cases like linear function approximation.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 4 / 62

On-Policy Control with Function Approximation

Action-Value Function Approximation for Qπ (s, a)

Approximate the action-value function:

Qπ (s, a) ≈ Q̂w (s, a)

Loss: Mean-squared value error (MSVE)

J(w) = Eπ [(Qπ (s, a) − Q̂w (s, a))2 ]

X π
= µ (s, a)(Qπ (s, a) − Q̂w (s, a))2
s,a
1 X π
− ∇J(w) = µ (s, a)(Qπ (s, a) − Q̂w (s, a))∇Q̂w (s, a)
2 s,a

Apply stochastic gradient descent or semi-gradient descent by sampling

(st , at ) ∼ π:

1\
∆w = − ∇J(w) = (Qπ (st , at ) − Q̂wt (st , at ))∇Q̂wt (st , at )
2

Seungyul Han (UNIST) AI512/EE633 Spring 2024 5 / 62

On-Policy Control with Function Approximation

Stochastic (Semi)-Gradient Descent

For the unknown true Qπ (st , at ), use a target:

For gradient MC, the target = the return Gt

∆w = α Gt − Q̂wt (st , at ) ∇Q̂wt (st , at )

For semi-gradient TD, the target = the TD target

rt+1 + γ Q̂wt (st+1 , at+1 ), where at+1 is sampled from the current
policy ǫ − greedy(Q̂wt (st+1 , ·)) (on-policy control)

∆w = α rt+1 + γ Q̂wt (st+1 , at+1 ) − Q̂wt (st , at ) ∇Q̂wt (st , at )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 6 / 62

On-Policy Control with Function Approximation

On-Policy TD Control: Semi-Gradient SARSA

Input: Q̂w : S × A × Rd → R, step size α

Initialize w ∈ Rd

Loop for each episode

Initialize (s, a)
Loop for each time step of episode
Observe r, s′
If s′ is not terminal
Choose action a′ from ǫ-greedy(Q̂w (s′ , ·))

w ← w + α r + γ Q̂w (s′ , a′ ) − Q̂w (s, a) ∇Q̂w (s, a)
′ ′
(s, a) ← (s , a )
else if s′ is terminal
w ← w + α r − Q̂w (s, a) ∇Q̂w (s, a)
Go to next episode

Seungyul Han (UNIST) AI512/EE633 Spring 2024 7 / 62

Convergence Issues in Control with Function Approximation

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 8 / 62

Convergence Issues in Control with Function Approximation

Convergence of Control with Function Approximation

Recall Control in the Tabular Case: Policy Improvement Step
r s′
V π (s̃) s̃
a
π
π ′ ′ r s′
Q (s̃, a = π (s̃)) s̃
a′
π
For some s̃ ∈ S such that

Qπ (s̃, π ′ (s̃)) > V π (s̃), with π ′ (s̃) 6= π(s̃),

we set a′ = π ′ (s̃) for s̃. This step can be done by a′ = arg maxα Qπ (s̃, α).
For all other s(6= s̃) ∈ S, we set π ′ (s) = π(s).
Then, this π ′ and π satisfy the condition of the policy improvement theorem.
Hence, π ′ > π.
The Policy Improvement Theorem (PIT) Condition

Qπ (s, π ′ (·|s)) ≥ V π (s) = Qπ (s, π(·|s)), ∀s ∈ S.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 9 / 62

Convergence Issues in Control with Function Approximation

Convergence of Control with Function Approximation

Sequential Implementation of the Idea in the Previous Page: On-Policy Case
Q0 Q(1) Q(2
−→ a1 → r1 , −→ a2 → r2 ,
s1 |{z} |{z} s2 |{z} |{z} s2 3
greedy ∼ π1 greedy ∼ π2
π0 → π1 V U (s1 , a1 ) π1 → π2 V U (s2 , a2 )
Q(0) → Q(1) Q(1) → Q(2)

Policy update (PU): πt−1 → πt = greedy(Qt−1 ) Q-table

a1 a2 a3 a|A|
arg maxa Q(t−1) (st , a), s = st s 1
πt (s) =
πt−1 (s), s 6= st s2
s3
Value Update (VU): Q(t−1) → Q(t) (Bellman backup)
Note that from πt−1 to πt , only action on st changes by PU
st
and the actions on all other states s’s remain the same to
satisfy the PIT condition. This means that in value-based
methods, the row corresponding to st in the Q-table is
s|S|
updated and all other rows should not be changed.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 10 / 62
Convergence Issues in Control with Function Approximation

Convergence of Control with Function Approximation

Control with Function Approximation: Impact of Generalization
S

Q-Function
a 1 a 2 a3 a|A|
1
st s
s2
Ex. tile coding s3
Function approximation inherently yields
generalization
st
With generalization, Q-value update for one state
st affects the Q-values of many other states.
Hence, the condition for PIT is broken.
s|S|
So, value-based GPI (on-policy/off-policy) with
function approximation does not theoretically
guarantee convergence to optimal solution.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 11 / 62

Convergence Issues in Control with Function Approximation

Convergence of Methods with Func. Approx.

Control:
Algorithm Tabular Linear FA Nonlinear FA
MC O △ X
SARSA O △ X
Q-learning O X X

△ = chatters around near optimal value function

Prediction:
Algorithm Tabular Linear FA Nonlinear FA
On-Policy MC O O O
TD O O X
Off-Policy MC O O O
TD O X X

Note that MC uses an unbiased estimate Gt for V π (s) in SGD prediction. In this case,
SGD with MC estimate converges.
There exist examples such as Baird’s example for divergence of off-policy semi-gradient
TD prediction.
We will show the convergence of linear semi-gradient TD prediction.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 12 / 62
Convergence Issues in Control with Function Approximation

Control with Function Approximation: Considerations

Double sides of Generalization of Function Approximation (FA)

Useful to evaluate the values of unseen states for large MDPs
Destroys the PIT condition for GPI in value-based methods

Directions:
Development of converging algorithms under restrictions, e.g., linear function
approximation.
⇒ On-policy linear TD prediction (convergence guaranteed)
⇒ On-policy/Off-policy linear gradient TD learning (prediction, control)
(convergence guaranteed)
⇒ Linear least squares batch method

Seungyul Han (UNIST) AI512/EE633 Spring 2024 13 / 62

Convergence Issues in Control with Function Approximation

On-Policy Linear Semi-gradient TD Prediction

In Baird’s Counterexample, we will see that off-policy semi-gradient TD prediction with
linear FA diverges. However, we can show that on-policy semi-gradient TD prediction
can converge with linear FA.

Linear FA:
d
X
V̂w (s) = wi xi (s) = x(s)T w,
i=1

where x(s) = [x1 (s), · · · , xd (s)]T .

Linear Semi-Gradient TD Prediction:

∇V̂w (s) = x(s)

wt+1 = wt + α[Ut − V̂wt (st )]∇V̂w (st )
wt+1 = wt + α[rt+1 + γx(st+1 )T wt − xT (st )wt ]x(st )
h i
= wt + α rt+1 xt − xt (xt − γxt+1 )T wt

where xt = x(st ).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 14 / 62

Convergence Issues in Control with Function Approximation

On-Policy Linear Semi-gradient TD Prediction

Average System Behavior:
h i
wt+1 = wt + α rt+1 xt − xt (xt − γxt+1 )T wt
 

E[wt+1 |wt ] = wt + α E[rt+1 xt ] − E[xt (xt − γxt+1 )T ] wt 

 
| {z } | {z }
b C

= (I − αC)wt + αb
| {z }
A

= Awt + αb (a typical state-space equation in linear systems)

If the square A has eigenvalues whose magnitude are less than one, the state-space
system converges. It is shown that A has eigenvalues whose magnitude are less than
one (Sutton, 1988).
At Convergence: Linear TD Fixed-Point

E[wt+1 |wt ] = (I − αC)wt + αb = wt

wT D = C−1 b

Seungyul Han (UNIST) AI512/EE633 Spring 2024 15 / 62

Off-Policy Prediction and Control with Function Approximation

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 16 / 62

Off-Policy Prediction and Control with Function Approximation

Recall Off-Policy MC Control with Importance Sampling

For Qπ (st , at ), action at is given in addition to given st
rt+1 rt+2
st at st+1 at+1 sT
ai ∼ β(ai |si ), i ≥ t + 1 (behavior policy), si+1 ∼ P (si+1 |si , ai )
Return: Gt (τt ) = rt+1 + γrt+2 + γ 2 rt+3 + · · · + γ T −1 rT , where
τt = (st , at , st+1 , at+1 , · · · , sT )
Importance Sampling Technique
Z Z
Pπ (τt )
Qπ (s, a) = Eπ [Gt (τt )|st = s, at = a] = Gt (τt )Pπ (τt )dτt = Gt (τt ) Pβ (τt )dτt
Pβ (τt )

Pπ (τt )
≈ SampleMean Gt (τt ) , where τt ∼ Pβ (τt )
Pβ (τt )
QT −1 !
i=t+1 π(ai |si )
= SampleMean Gt (τt ) QT −1 , where τt ∼ Pβ (τt ) (1)
i=t+1 β(ai |si )

QT −1 !
i=t+1 π(ai |si )
Q(st , at ) ← Q(st , at ) + α Gt QT −1 − Q(st , at )
i=t+1 β(ai |si )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 17 / 62

Off-Policy Prediction and Control with Function Approximation

Off-Policy TD Prediction (V π (s)): Importance Sampling

Behavior policy: β, target policy: π
rt+1

st at st+1
∼β
Weight TD target rt+1 + γV (st+1 ) by importance sampling
Note that st+1 ∼ P (·|st , at ) and rt+1 = R(st , at , st+1 (st , at )) and are functions
of at given st
Action at is drawn from behavior policy β
Current value function estimate V (·) is an external function (look-up table) not
depending on the current action (it is a result of all previous states and actions)
Only need a single importance sampling correction
π(a |s )
t t
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )
Lower variance than MC importance sampling
Policies π and β need to be similar over a single step only
Seungyul Han (UNIST) AI512/EE633 Spring 2024 18 / 62
Off-Policy Prediction and Control with Function Approximation

Off-Policy TD Prediction (V π (s)): Importance Sampling

Off-Policy TD Update:
Ordinary Importance Sampling (OIS):
π(a |s )
t t
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )

Weighted Importance Sampling (WIS):

π(at |st )
V (st ) ← V (st ) + α rt+1 + γV (st+1 ) − V (st )
β(at |st )

This update formula has the advantage that value is not updated when π(at |st ) is
zero for sample action at generated from β(at |st ) > 0. WIS is widely used.
In practice, when ISR π(a t |st )
β(at |st )
is away from 1, the behavior and target probabilities
of at given st are much different. So, we clip ISR from the below and from the
above if it is away from 1.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 19 / 62

Off-Policy Prediction and Control with Function Approximation

Off-Policy TD Prediction (V π (s)) with Function

Approximation
Importance Sampling-Based Off-Policy TD Prediction:

Target policy π, Behavior policy β

Tabular case

π(at |st )
V (st ) ← V (st ) + α (rt+1 + γV (st+1 )) − V (st )
β(at |st )
π(at |st )
V (st ) ← V (st ) + α (rt+1 + γV (st+1 ) − V (st ))
β(at |st )
Function approximation case: Off-policy semi-gradient TD prediction

π(at |st )
wt+1 = wt + α rt+1 + γ V̂wt (st+1 ) − V̂wt (st ) ∇V̂wt (st )
β(at |st )
π(at |st )
wt+1 = wt + α rt+1 + γ V̂wt (st+1 ) − V̂wt (st ) ∇V̂wt (st )
β(at |st )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 20 / 62

Off-Policy Prediction and Control with Function Approximation

Off-Policy (One-Step) TD Control

Behavior policy: β, target policy: π, current Q-table: Q
rt+1

st at st+1 at+1
given target policy π

For Qπ (st , at ) estimation, current action at is given in addition to st

Note that st+1 ∼ P (·|st , at ) and rt+1 = R(st , at , st+1 ). Hence, no role of (behavior or
target) policy in obtaining st+1 and rt+1 .
Once st+1 is reached, we apply bootstrapping from Q table. So, Q(st+1 , at+1 ) is read
from the Q look-up table for bootstrapping. Here, at+1 is not an actually generated
sample action, but a dummy variable for bootstrapping. So, we can follow the target
policy directly to generate the bootstrapping dummy action variable at+1 . Then, the TD
target and the corresponding update are respectively given by
X
TD target = rt+1 + γ π(a|st+1 )Q(st+1 , a)
a
X
Q(st , at ) ← Q(st , at ) + α rt+1 + γ π(a|st+1 )Q(st+1 , a) − Q(st , at )
a

Note that importance sampling ratio does not appear in one-step TD Q update.
Given at from st was drawn by the behavior policy β.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 21 / 62
Off-Policy Prediction and Control with Function Approximation

Off-Policy TD Control (Action-Value Function)

Off-Policy TD Control: For computation of Qπ (st , at ), action at

Tabular case: This is called expected SARSA

X
Q(st , at ) ← Q(st , at ) + α rt+1 + γ π(a|st+1 )Q(st+1 , a) − Q(st , at )
a

Function approximation case

X
wt+1 = wt + α rt+1 + γ π(a|st+1 )Q̂wt (st+1 , a) − Q̂wt (st , at ) ∇Q̂wt (st , at )
a

When target policy π is greedy policy, the corresponding off-policy control

becomes Q-learning.
For one-step off-policy TD Q-update, importance sampling is not required.
However, for n-step TD, importance sampling ratio appears.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 22 / 62

Off-Policy Prediction and Control with Function Approximation

The Deadly Triad

Function approximation
Bootstrapping
Off-policy learning

The combination of the above three provides a non-trivial problem to tackle.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 23 / 62

Off-Policy Prediction and Control with Function Approximation

Divergence of Off-Policy Prediction with Linear FA (Baird)

Consider off-policy prediction, i.e., V π estimation for the following MDP:
Linear function approximation for state-value func. with parameters w1 , · · · , w8
For each state, there exist two actions: solid and dashed.
Reward for all transitions are zero.
π: target policy, β: behavior policy. This is Baird’ example.

2w1 + w8 2w2 + w8 2w3 + w8 2w4 + w8 2w5 + w8 2w6 + w8

π(solid|·) = 1
β(solid|·) = 1/7
w7 + 2w8 β(dashed|·) = 6/7
γ = 0.99

Seungyul Han (UNIST) AI512/EE633 Spring 2024 24 / 62

Off-Policy Prediction and Control with Function Approximation

Baird’s Counterexample

State distribution µπ (s) for π:

1, s = 7
µπ (s) =
0, o.w.

Initial distribution: Uniform

2w1 + w
2w
8 2 +w
2w
83 +w
2w 2w
8 4 +w8 5 +w
2w
8 6 + w8 Behavior state distribution: Uniform

 1
  6 1
 1

π(solid|·) = 1 7 7 7 7
β(solid|·) = 1/7  1   6 1  1 
w7 + 2w8 β(dashed|·) = 6/7  7   7 7  7 
γ = 0.99  . = .
 .   . .. 
 . 
 
 .   . 6
7
.   .. 
1 1 1 1 1
7 7
··· 7 7 7

Since all reward is zero, true

V π (s) = 0, ∀ s.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 25 / 62

Off-Policy Prediction and Control with Function Approximation

Baird’s Counterexample

Linear Function Approximation for State-Value Function

w1 w2 w3 w4 w5 w6 w7 w8
V (1) 2 1
V (2) 2 1
V (3) 2 1
V (4) 2 1
V (5) 2 1
V (6) 2 1
V (7) 1 2
wsol,1 0 0 0 0 0 0 0 0
1 1 1 1 1 1
wsol,2 2 2 2 2 2 2
2 −1

Note that the vector, wsol,2 , is a null vector of the above matrix [V (i)w(j)].
In fact, any vector in the null space is a linear weight solution to V (s) ≡ 0.
So, there exist infinitely many weight solutions to this problem.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 26 / 62

Off-Policy Prediction and Control with Function Approximation

Baird’s Counterexample
si
Consider transition from si to
2wi + w8 s7 , i = 1, · · · , 6:
Linear approx. for V π

s7 wi w7 w8
w7 + 2w8 V (si ) 2 1
V (s7 ) 1 2

π(at |st ) 7 i

i
wt+1 = wt + α rt+1 + γ V̂wt (st+1 = s ) − V̂wt (st = s ) ∇V̂wt (st = s )
β(at |st )

     
wi wi 2
 w7  t t t t
= w7  + 7α γ(w7 + 2w8 ) − (2wi + w8 )  0 
w8 t+1
w8 t
1

2(−2wit + (2γ − 1)w8t t

     
wi wi + γw7 )
 w7  = w7  + 7α  0 
w8 t+1
w8 t −2wit + (2γ − 1)w8t t
+ γw7

Update of wi occurs once every 6 transitions on average, but update of w8 is at every transition.
(w7 does not change.) So, if we start a large w7 with all small w1 = · · · = w6 = w8 , w8 keeps
growing and wi grows too due to growing w8 (growing 6 times faster than w1 , · · · , w6 ).
Seungyul Han (UNIST) AI512/EE633 Spring 2024 27 / 62
Off-Policy Prediction and Control with Function Approximation

Baird’s Counterexample

1e6
4.0 w1
w2
3.5 w3
w4
3.0 w5
w6
The Value of Weights

2.5 w7
w8
2.0

1.5

1.0

0.5

0.0

0 1000 2000 3000 4000 5000

Timesteps

Seungyul Han (UNIST) AI512/EE633 Spring 2024 28 / 62

Off-Policy Prediction and Control with Function Approximation

Convergence of Methods with Func. Approx.

Prediction
On/Off-Policy Algorithm Tabular Linear FA Nonlinear FA
On-Policy MC O O O
TD O O X
Off-Policy MC O O O
TD O X X

Note that MC uses an biased estimate Gt for V π (s) in SGD prediction. In this case, SGD
with MC estimate converges.
We have shown the convergence of linear semi-gradient TD prediction.
There exist examples such as Baird’s example for divergence of off-policy semi-gradient
TD prediction.
Control
Algorithm Tabular Linear FA Nonlinear FA
MC O △ X
SARSA O △ X
Q-learning O X X

△ = chatters around near optimal value function

Seungyul Han (UNIST) AI512/EE633 Spring 2024 29 / 62
(Linear) Gradient TD (gTD) Learning

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 30 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning (Sutton et al. 2009)

Off-policy semi-gradient TD prediction even with linear function approximation is

not guaranteed to converge.
This is because semi-gradient TD is not a stochastic version of the true gradient.
Gradient TD is an approach to guarantee convergence of off-policy learning with
linear function approximation.
Gradient TD considered projected Bellman error instead of value error.
Gradient TD can be applied to on-policy learning too.
There is an extension of gradient TD to off-policy control.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 31 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Recall
replacements
Mean Square Value Error:

V̂w (s)
V π (s)
s S

h i2
J(w) = M SE(w) = Eπ V π (s) − V̂w (s)
X π h i2
= µ (s) V π (s) − V̂w (s)
s∈S

π
where µ (s) is the on-policy distribution.
We know that under MSVE, even linear semi-gradient TD prediction diverges
(Baird’s example).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 32 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Bellman Error (BE) and Projected Bellman Error (PBE):

Vπ
T π V̂
w

w2 BE
ΠV π
PBE
ΠT π V̂w
V̂w
w1

Π: Projection operator onto the linear subspace L(w1 , w2 )

π
T : Bellman backup operator
Bellman Equation
X X
V π (s) = T π (V π (s)) = π(a|s) p(r, s′ |s, a)[r + γV π (s′ )]
a r,s′

Seungyul Han (UNIST) AI512/EE633 Spring 2024 33 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Weighted Norm: X
∆
||v||2µ = µ(s)|v(s)|2 = vT Dv,
s

where µ(s) is the considered distribution on s ∈ S (e.g., on-policy distribution in the

on-policy case, or state distribution induced by the behavior policy in the off-policy case)

Projection operator and projection matrix onto L(w1 , · · · , wd ):

Πv = V̂w∗ such that w∗ = arg min ||v − V̂w ||2µ

Π = X(XT DX)−1 XT D,

where D = diag(µ(s1 ), · · · , µ(s|S| )) and the feature matrix X is given by

   
xT (s1 ) x1 (s1 ) · · · xd (s1 )
 ..   .. .. .. 
X= . = . . . 
xT (s|S| ) x1 (s|S| ) · · · xd (s )|S|

Seungyul Han (UNIST) AI512/EE633 Spring 2024 34 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Bellman Error at State s:
∆
BE(s) = δ̄w (s) = T π V̂w (s) − V̂w (s)
X X
= π(a|s) p(r, s′ |s, a)[r + γV π (s′ )] − V̂w (s)
a r,s′

= Eat ∼π [rt+1 + γ V̂w (st+1 ) − V̂w (st )|st = s]

 
 
 π(at |st ) 
= Eat ∼β  [r
 β(a |s ) |t+1 + γ V̂ w (s t+1 ) − V̂ w (s t ) ] s t = s 

 t t {z } 
| {z } δt
ρt

= Expected TD error
Mean Square Bellman Error (Baird et al., 1995):
X
M SBE(w) = µ(s)|δ̄w (s)|2
s

Mean Square Projected Bellman Error (Sutton et al., 2009):

X
M SP BE(w) = µ(s)|Πδ̄w (s)|2 ,
s

Seungyul Han (UNIST) AI512/EE633 Spring 2024 35 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Mean Square Bellman Error:
X
M SBE(w) = µ(s)|δ̄w (s)|2
s
X
∇w M SBE(w) = 2 µ(s)δ̄w (s)∇w δ̄w
s

Minimizing MSBE (Baird et al. 1995): On-Policy Case

wt+1 = wt + α(rt + γVwt (st+1 ) − Vwt (st ))∇(rt + γVwt (st+1 ) − Vwt (st ))
This stochastic approximation to the true gradient is not good since st+1 in the first
term and the second term are correlated. (This will be explained later.)
Double Sampling:

wt+1 = wt + α(rt + γVwt (s′t+1 ) − Vwt (st ))∇(rt + γVwt (s′′t+1 ) − Vwt (st ))

However, this algorithm minimizing Bellman error is not a stable algorithm. Instead, we
minimize the projected Bellman error to obtain a stable algorithm.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 36 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Minimizing MSPBE (Sutton et al., 2009):

M SP BE(w) = ||Πδ̄w ||2µ

= (Πδ̄w )T DΠδ̄w
T T
= δ̄w Π DΠδ̄w
T
= δ̄w [X(XT DX)−1 XT D]T DX(XT DX)−1 XT Dδ̄w
T
= δ̄w DX(XT DX)−T XT DX(XT DX)−1 XT Dδ̄w
= (XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w ) (quadratic in w)

Gradient of MSPBE:

h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 37 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Gradient of MSPBE:

h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )

where
 
x1 (s1 ) ··· x1 (s|S| )
 . .. . 
XT = [x(s1 ), · · · , x(s|S| )] = 
 .
. . .
.
,

xd (s1 ) ··· |S|
xd (s )

   
µ(s1 ) ··· δ̄w (s1 )
 .. .   . 
D= . . , and δ̄w = . 
.  . 
··· |S|
µ(s ) δ̄w (s )|S|

Seungyul Han (UNIST) AI512/EE633 Spring 2024 38 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Gradient of MSPBE: Each Component

X
XT Dδ̄w = µ(s)x(s)δ̄w (s) = Eβ [ρt δt xt ], xt = x(st )
s

∇(XT Dδ̄w )T = ∇Eβ [ρt δt xt ]T = Eβ [(∇δt )xT

t ]

= Eβ [(∇(rt+1 + γwT xt+1 − wT xt ))xT

t ]

= Eβ [ρt (γxt+1 − xt )xT

t ]
X
T
X DX = µ(s)x(s)x(s) = Eβ [xt xT
T
t ]
s

Gradient of MSPBE:

h i
∇M SP BE(w) = 2 ∇(XT Dδ̄w )T (XT DX)−1 (XT Dδ̄w )

= 2Eβ [ρt (γxt+1 − xt )xT T −1

t ](Eβ [xt xt ]) Eβ [ρt δt xt ]

Seungyul Han (UNIST) AI512/EE633 Spring 2024 39 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Gradient of MSPBE:

∇M SP BE(w) = 2Eβ [ρt (γxt+1 − xt )xT T −1

t ](Eβ [xt xt ]) Eβ [ρt δt xt ] (2)

Handling Correlation in Sampling:

Suppose that X ∼ p(x) with E[X] = 0.
Then, y = E[X]E[X] = 0.
Now, we want to approximate the quantity y by using sample x1 drawn from p(x) as
ŷ = x1 x1 . Then, ŷ does not have zero mean.
One way to circumvent this bias effect is that we sample x1 ∼ p(x) and then
independently sample x2 ∼ p(x), and use x1 for a sample estimate of the first E[X] and
use x2 for a sample estimate for the second E[X]. Then, ŷ = x1 x2 . This yields E[ŷ] = 0
and there is no bias.
Note that the first and third terms contain xt+1 = x(st+1 ). Hence, if for given st we
sample at (∼ β), rt+1 = r(st , at , st+1 ), st+1 ∼ p(st+1 |st , at ) just once and use this single
sample to approximate the right-hand side in (3), then we have similar bias effect to the
above.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 40 / 62

(Linear) Gradient TD (gTD) Learning

Gradient TD Learning for Prediction

Gradient of MSPBE:

∇M SP BE(w) = 2Eβ [ρt (γxt+1 − xt )xT T −1

t ] (Eβ [xt xt ]) Eβ [ρt δt xt ] (3)
| {z }
=:vt compute f irst

Least Mean Square (LMS) Algorithm: Linear Adaptive Filtering

yt = x T
t vt

et = d t − x T
t vt

Goal : min E[|et |2 ]

E[|et |2 ] = E[(dt − xT T T
t vt ) (dt − xt vt )]

= E[d2t − 2dt xT T T
t vt + vt xt xt vt ]

∇v E[|et |2 ] = E[−2dt xt + 2xt xT

t vt ]

vt⋆ = E[xt xT
t ]
−1
E[dt xt ]
vt+1 = vt + β(dt − vtT xt )xt

Gradient TD:
wt+1 = wt + αρt (xt − γxt+1 )xT
t vt

Seungyul Han (UNIST) AI512/EE633 Spring 2024 41 / 62

(Linear) Least Squares Batch Methods (LSTD/LSPI)

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 42 / 62

(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Prediction (Bradtke and Barto, 1996)

Experience Buffer D with Size T
e1 e2 e3 ··· eT

Suppose that we collected experience samples e1 , e2 , · · · , eT in buffer, where

et = (st , at , rt+1 , st+1 )
We want to approximate the value function (V π or Qπ ) using a function
approximator based on all data samples in the buffer by minimizing the loss:

J(w) = ED [(V π (s) − V̂w (s))2 ] or ED [(Qπ (s, a) − Q̂w (s, a))2 ]
T
X
= (V π (st ) − V̂w (st ))2 , (4)
t=1

where ED means expectation w.r.t. samples in buffer (i.e., empirical mean).

P
If we adopt a linear function approximator V̂w (s) = i xi (s)wi = x( s)w, then
minimizing (4) reduces to good old Least Squares estimation (LSE)!
Efficient solutions are available for LSE such as the recursive least squares (RLS).
Seungyul Han (UNIST) AI512/EE633 Spring 2024 43 / 62
(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Prediction

Least Squares: Overdetermined Case (T ≥ d)

T
X
J(w) = (U (st ) −V̂w (st ))2
t=1
| {z }
target
2
   
U (s1 ) x1 (s1 ) ··· xd (s1 )  
 U (s2 )   x1 (s2 ) ··· xd (s2 )  w1
     .. 
=  .. − .. .. ..  . 
 .   . . . 
wd
U (sT ) x1 (sT ) ··· xd (sT ) | {z }
| {z } | {z } =w
=u =X

Seungyul Han (UNIST) AI512/EE633 Spring 2024 44 / 62

(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Prediction

Normal Equation:

∗ ∗
Πu = w1 x1 + w2 x2

Error ⊥ Spanning Space

xTi (u − XwLS
∗
) = 0, i = 1, · · · , d ⇐⇒ XT (u − Xw∗ ) = 0

Seungyul Han (UNIST) AI512/EE633 Spring 2024 45 / 62

(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Prediction

Setting Target:

LSMC U (st ) = Gt
LSTD U (st ) = rt+1 + γ V̂w (st+1 ) (Bradtke and Barto, 1996)
LSTD(λ) U (st ) = Gλt

Normal Equation:

XT (u − XwLS
∗
)=0
Linear LS Prediction:
MC
XT (g − XwLS
∗
)=0 ⇐⇒ ∗
wLS = (XT X)−1 XT g
TD

XTt (r + γXt+1 wLS

∗ ∗
− Xt wLS )=0 ⇐⇒ ∗
wLS = [XTt (Xt − γXt+1 )]−1 XTt r

Seungyul Han (UNIST) AI512/EE633 Spring 2024 46 / 62

(Linear) Least Squares Batch Methods (LSTD/LSPI)

Least Squares Control

Generalized Policy Iteration Based on Batch-Processing Least Squares:

Policy evaluation
LS Q-learning
Policy improvement
Greedy with respect to estimated Q

Seungyul Han (UNIST) AI512/EE633 Spring 2024 47 / 62

(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Qπ Estimation

Experience Buffer D with Size T
e1 e2 e3 ··· eT , et = (st , at , rt+1 , st+1 )

LS loss over D:

T
X
J(w) = ED [(Qπ (s, a) − Q̂w (s, a))2 ] = (Qπ (st , at ) − Q̂w (st , at ))2 , (5)
t=1

where ED means expectation w.r.t. samples in buffer (i.e., empirical mean).

P T
Linear function approximator: Q̂w (s, a) =i xi (s, a)wi = x (s, a)w, then
minimizing (5) yields a batch LS solution:
From the normal equation, LSTDQ(π,D)
∗
wLS = [XTt (Xt − γXt+1 )]−1 XTt r
" T #−1 T
X T T
X
= χ(st , at )(χ (st , at ) − γχ (st+1 , π(st+1 ))) χ(st , at )rt+1
t=1 t=1

where χ(st , at ) = [x1 (st , at ), · · · , xd (st , at )] which is the t-th column of XTt .
T

Seungyul Han (UNIST) AI512/EE633 Spring 2024 48 / 62

(Linear) Least Squares Batch Methods (LSTD/LSPI)

Linear Least Squares Policy Iteration (LSPI) Algorithm

Least Squares Policy Iteration (LSPI, Lagoudakis and Parr, 2003):

A popular batch RL algorithm

Use LSTDQ for policy evaluation of policy π for given experience D (off-policy)
Repeatedly use experience D to evaluate updated policies

function LSPI-TD(D, π0 )
π ′ ← π0
repeat
π ← π′
Qw ← LSTDQ(π, D) (policy evaluation)
for all s ∈ S do
π ′ (s) ← arg max Qw (s, a) (policy improvement)
a∈A
end for
until (π ≈ π ′ )
return π
end function

Seungyul Han (UNIST) AI512/EE633 Spring 2024 49 / 62

(Linear) Least Squares Batch Methods (LSTD/LSPI)

Convergence of Control Algorithms

Algorithm Tabular Linear FA Non-Linear FA

MC Control O △ X
SARSA O △ X
Q-learning O X X
Gradient Q-learning (GQ) O O X
LSPI O △ -

△ : wanders around near-optimal value function

Seungyul Han (UNIST) AI512/EE633 Spring 2024 50 / 62

Deep Q Network (DQN)

Table of Contents

1 On-Policy Control with Function Approximation

2 Convergence Issues in Control with Function Approximation

3 Off-Policy Prediction and Control with Function Approximation

4 (Linear) Gradient TD (gTD) Learning

5 (Linear) Least Squares Batch Methods (LSTD/LSPI)

6 Deep Q Network (DQN)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 51 / 62

Deep Q Network (DQN)

Previous Attempts to Combine RL Artificial Neural

Networks

RL + ANN
Tesauro, 1995
Riedmiller, 2005, Hafner & Riedmiller, 2011 - NFQCA

Deep Learning:
Krizhevsky Sutskever, and Hinton, ”ImageNet Classification with Deep
Convolutional Neural Networks”, 2012

Deep Q-Network: Successful Incorporation of Deep NN Function Approximators

Mnih et al., DQN, 2015

Seungyul Han (UNIST) AI512/EE633 Spring 2024 52 / 62

Deep Q Network (DQN)

Deep Q-Networks (DQN, Mnih et al., 2015)

32 4x4 filters
256 hidden units Fully-connected linear
output layer

16 8x8 filters

4x84x84

Fully-connected layer of
rectified linear units

Stack of 4 previous Convolutional layer of Convolutional layer of

frames rectified linear units rectified linear units

Nonlinear function approximation with DNN (CNN) for action-value function

Large S with small A : eight directions + button pressed/unpressed (4 ∼ 18)
To handle instability of off-policy Q-learning with non-linear FA, DQN uses two techniques
Off-policy with experience replay - reduces correlation in successive samples in
standard on-line sequential SGD update
Target network - uncorrelate the target value with the current reference value
(Semi-gradient TD ⇒ Gradient TD: Recall the derivation of the gradient formula
assumes target is independent of the updating parameter. )
Seungyul Han (UNIST) AI512/EE633 Spring 2024 53 / 62
Deep Q Network (DQN)

Recall On-Policy On-Line Stochastic Gradient Descent

Q̂w (s, a)

Qπ (s, a)
(s, a) S ×A
Goal: find parameter vector w minimizing the MSVE loss:
X π 2
J(w) = Eπ (Qπ (s, a) − Q̂w (s, a))2 = µ (s, a) Qπ (s, a) − Q̂w (s, a)
s,a

Gradient descent:
1
wt+1 = wt − α∇w J(w)|w=wt
2
X π
= wt + α µ (s, a) Qπ (s, a) − Q̂wt (s, a) ∇w Q̂w (s, a)|w=wt
s,a

On-line stochastic gradient descent:

wt+1 = wt + α Qπ (st , at ) − Q̂wt (st , at ) ∇w Q̂w (st , at )|w=wt
⇒ Single-sample approximation to true expectation and sequential step-by-step
update!
Seungyul Han (UNIST) AI512/EE633 Spring 2024 54 / 62
Deep Q Network (DQN)

Batch Methods
Experience Buffer with Size B
et = (st , at , rt+1 , st+1 )
e1 e2 e3 ···

em m m
1 e2 e3 · · · em
M

Mini-batch

Standard on-line SGD is sequential sample-by-sample update and is not

sample-efficient
Batch methods seek to find the best fitting value function with batch processing
from experience buffer D
Typically, experience is stored in buffer D,
A mini-batch is drawn from the buffer,
Stochastic (semi-)gradient update is performed based on the mini-batch
This update can be done multiple times based on multiple mini-batch drawing.
Samples are used multiple times on average ⇒ sample efficiency increased.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 55 / 62
Deep Q Network (DQN)

Update with Mini-Batch: Least Squares

Update with Mini-batch:
Mini-batch: with mini-batch size M

M = [em m m
1 , e2 , · · · , eM ] ∼ U (D)

Target values associated with the mini-batch:

U = [U1m , U2m , · · · , UM
m
]

Least-Squares Value Error Loss:

L(w) = EM [U m (s, a) − Qw (s, a)]2

M
X
= [Uim (si , ai ) − Qw (si , ai )]2
i=1

Stochastic (semi-)gradient descent:

M
X
w ←w+α [Uim (si , ai ) − Qw (si , ai )]∇Qw (si , ai )
i=1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 56 / 62

Deep Q Network (DQN)

Deep Q-Networks: Target Network

To handle instability of off-policy Q-learning with non-linear FA, DQN uses two techniques

Off-policy with experience replay - reduces correlation in successive samples in

standard on-line sequential SGD update
Target network - uncorrelate the target value with the current reference value
(Semi-gradient TD ⇒ Gradient TD: Recall the derivation of the gradient formula
assumes target is independent of the updating parameter. )

M
X
w ←w+α [Uim (si , ai ) −Qw (si , ai )]∇Qw (si , ai )
| {z }
i=1
T arget

Qw (s, a) and Qw− (s, a)

Uim (si , ai ) = ri + γ max
′
Qw− (s′i , a′ )
a

Seungyul Han (UNIST) AI512/EE633 Spring 2024 57 / 62

Deep Q Network (DQN)

Deep Q-Networks (DQN)

1 Take action at using ǫ-greedy policy w.r.t. current Q function (i.e., behavior
policy)
2 Store sample (st , at , rt+1 , st+1 ) to the replay buffer D
3 Sample random mini-batch Mi = {(s, a, r, s′ )} from D
4 Compute Q-learning targets using target network, i.e., Q-network with the
same structure but target parameter w− (updated every C steps and held
fixed in-between)
5 Update the network parameter w to minimize the LS loss by using SGD:
" #
2
′ ′
Li (wi ) = Es,a,r,s′ ∼Mi r + γ max Qw− (s , a ) −Qw (s, a)
a′
| {z }
T arget

6 Go to next time step t + 1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 58 / 62

Deep Q Network (DQN)

DQN Results in Atari Games

Seungyul Han (UNIST) AI512/EE633 Spring 2024 59 / 62

Deep Q Network (DQN)

DQN: Ablation Study

Replay Replay No reply No replay

Target Q Net Q-learning Target Q Net Q-learning
Enduro 1006.3 832.2 141.9 29.1
River Raid 7446.6 4102.8 2867.7 1453
Seaquest 2894.4 822.6 1003 275.8
Space Invaders 1088.9 826.3 373.2 302

Seungyul Han (UNIST) AI512/EE633 Spring 2024 60 / 62

Deep Q Network (DQN)

Improvement of DQN

Prioritized replay mechanism (Schaul et al. 2016)

Double DQN trick (Van Hasselt et al. 2016): See Lecture 5 about
maximization bias and double learning.
···

Seungyul Han (UNIST) AI512/EE633 Spring 2024 61 / 62

Deep Q Network (DQN)

References

Textbook: Sutton and Barto, Reinforcement Learning: An Introduction, The MIT

Press, Cambridge MA, 2018
Dr. David Silver’s course material.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 62 / 62

Mod3 Slides
No ratings yet
Mod3 Slides
199 pages
EAI Endorsed Transactions: Prediction of Dogecoin Price Using Deep Learning and Social Media Trends
No ratings yet
EAI Endorsed Transactions: Prediction of Dogecoin Price Using Deep Learning and Social Media Trends
12 pages
Lecture 4 Pre
No ratings yet
Lecture 4 Pre
85 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
Paper RL
No ratings yet
Paper RL
61 pages
Lecture 6 Value Function Approximation
No ratings yet
Lecture 6 Value Function Approximation
56 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
Control and Reinforcement Learning
No ratings yet
Control and Reinforcement Learning
6 pages
IEEE 2024 DQL Improved DQL
No ratings yet
IEEE 2024 DQL Improved DQL
6 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Journal December 21
No ratings yet
Journal December 21
181 pages
CFM - Programming Task
No ratings yet
CFM - Programming Task
10 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Book All in One
No ratings yet
Book All in One
288 pages
Jia Zhou - JMLR 2023
No ratings yet
Jia Zhou - JMLR 2023
61 pages
07 FA Methods
No ratings yet
07 FA Methods
58 pages
Adaptive Dynamic Programming For Stochastic Systems With State and Control Dependent Noise
No ratings yet
Adaptive Dynamic Programming For Stochastic Systems With State and Control Dependent Noise
12 pages
Solution 9
No ratings yet
Solution 9
3 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
7 pages
3 - Chapter 8 Value Function Approximation
No ratings yet
3 - Chapter 8 Value Function Approximation
39 pages
Exam MT7051 VT24
No ratings yet
Exam MT7051 VT24
2 pages
13-Fixed Point Arithmetic Operations - Booths Algorithm-18-01-2024
No ratings yet
13-Fixed Point Arithmetic Operations - Booths Algorithm-18-01-2024
19 pages
Ai Unit 7
No ratings yet
Ai Unit 7
9 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Sample Question Paper For Placement
No ratings yet
Sample Question Paper For Placement
5 pages
【PPT】Conservative policy iteration
No ratings yet
【PPT】Conservative policy iteration
75 pages
Fujimoto 18 A
No ratings yet
Fujimoto 18 A
10 pages
Module 5 - Probability Assignment DS
No ratings yet
Module 5 - Probability Assignment DS
2 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Ee 708 Report
No ratings yet
Ee 708 Report
3 pages
Lnotes 04
No ratings yet
Lnotes 04
8 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
Lecture 18
No ratings yet
Lecture 18
7 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
Differential Equations MSC
No ratings yet
Differential Equations MSC
6 pages
CLO5 CHE4413 Chemical Process HAZOP and Risk Analysis-202220-2
No ratings yet
CLO5 CHE4413 Chemical Process HAZOP and Risk Analysis-202220-2
41 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Speed Control of DC Motor Using Fuzzy PID Controller
No ratings yet
Speed Control of DC Motor Using Fuzzy PID Controller
15 pages
RL DP and Value and Policy
No ratings yet
RL DP and Value and Policy
4 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Subtitle
No ratings yet
Subtitle
2 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
A2 Linear-Quadratic Optimal Control
No ratings yet
A2 Linear-Quadratic Optimal Control
8 pages
Hibbeler D14 e CH 12 P 4
No ratings yet
Hibbeler D14 e CH 12 P 4
1 page
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
Lecture 6: Value Function Approximation: David Silver
No ratings yet
Lecture 6: Value Function Approximation: David Silver
56 pages
Chap 1 Introduction
No ratings yet
Chap 1 Introduction
99 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Formal Languages Models of Computation: Spring 2005 Costas Busch - RPI
No ratings yet
Formal Languages Models of Computation: Spring 2005 Costas Busch - RPI
36 pages
2209 01383v3
No ratings yet
2209 01383v3
5 pages
18 NTPP Is 1 Toc E9596 Amandeep Kaur
No ratings yet
18 NTPP Is 1 Toc E9596 Amandeep Kaur
2 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
Limiting Average Cost Control Problems in A Class of Discrete-Time Stochastic Systems
No ratings yet
Limiting Average Cost Control Problems in A Class of Discrete-Time Stochastic Systems
13 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
JN - 2013 - Li - Model and Simulation For Collaborative VRPSPD
No ratings yet
JN - 2013 - Li - Model and Simulation For Collaborative VRPSPD
8 pages
Optimal Lineal Quadratic Control
No ratings yet
Optimal Lineal Quadratic Control
13 pages
Lab Manual - AETN2302 - L3 (Variables and Input) Ali Al Shamlan
No ratings yet
Lab Manual - AETN2302 - L3 (Variables and Input) Ali Al Shamlan
6 pages
OCDM2223 Tutorial7solved
No ratings yet
OCDM2223 Tutorial7solved
5 pages
ch13 Linear Factor Models
No ratings yet
ch13 Linear Factor Models
33 pages
Getting Started Guide: Model Predictive Control Toolbox™
100% (1)
Getting Started Guide: Model Predictive Control Toolbox™
174 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
Engineering Mathematics Ii Ras203
No ratings yet
Engineering Mathematics Ii Ras203
2 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Machine Learning For Predicting Patient Wait Time
No ratings yet
Machine Learning For Predicting Patient Wait Time
7 pages
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
No ratings yet
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
10 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Robin Assignment
100% (1)
Robin Assignment
29 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Bessel Function Zeroes
No ratings yet
Bessel Function Zeroes
5 pages
Stochastic Control Princeton
No ratings yet
Stochastic Control Princeton
14 pages
Guru Govind Singh Indraprastha University, Delhi: Bba Sem Vi - 302 Project Management
No ratings yet
Guru Govind Singh Indraprastha University, Delhi: Bba Sem Vi - 302 Project Management
39 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Data Structures VIVA Questions and Answers
100% (1)
Data Structures VIVA Questions and Answers
6 pages