0% found this document useful (0 votes)

5 views56 pages

Lecture 6 Value Function Approximation

Lecture 6 discusses Value Function Approximation in large-scale reinforcement learning, addressing the challenges of storing and learning values for numerous states and actions. It introduces various function approximators and incremental methods, including stochastic gradient descent, for estimating value functions. The lecture also covers the convergence of different algorithms and the implications of using non-linear function approximations.

Uploaded by

ispaik0602

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views56 pages

Lecture 6 Value Function Approximation

Uploaded by

ispaik0602

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Lecture 6: Value Function Approximation

David Silver
Lecture 6: Value Function Approximation

Outline

1 Introduction

2 Incremental Methods

3 Batch Methods
Lecture 6: Value Function Approximation
Introduction

Outline

1 Introduction

2 Incremental Methods

3 Batch Methods
Lecture 6: Value Function Approximation
Introduction

Large-Scale Reinforcement Learning

Reinforcement learning can be used to solve large problems, e.g.

Backgammon: 1020 states
Computer Go: 10170 states
Helicopter: continuous state space
Lecture 6: Value Function Approximation
Introduction

Large-Scale Reinforcement Learning

Reinforcement learning can be used to solve large problems, e.g.

Backgammon: 1020 states
Computer Go: 10170 states
Helicopter: continuous state space
How can we scale up the model-free methods for prediction and
control from the last two lectures?
Lecture 6: Value Function Approximation
Introduction

Value Function Approximation

So far we have represented value function by a lookup table

Every state s has an entry V (s)
Or every state-action pair s, a has an entry Q(s, a)
Problem with large MDPs:
There are too many states and/or actions to store in memory
It is too slow to learn the value of each state individually
Solution for large MDPs:
Estimate value function with function approximation

v̂ (s, w) ⇡ v⇡ (s)
or q̂(s, a, w) ⇡ q⇡ (s, a)

Generalise from seen states to unseen states

Update parameter w using MC or TD learning
Lecture 6: Value Function Approximation
Introduction

Types of Value Function Approximation

^
v(s,w) ^
q(s,a,w) ^
q(s,a … q(s,a
^
1,w) m,w)

w w w

s s a s
Lecture 6: Value Function Approximation
Introduction

Which Function Approximator?

There are many function approximators, e.g.

Linear combinations of features
Neural network
Decision tree
Nearest neighbour
Fourier / wavelet bases
...
Lecture 6: Value Function Approximation
Introduction

Which Function Approximator?

We consider di↵erentiable function approximators, e.g.

Linear combinations of features
Neural network
Decision tree
Nearest neighbour
Fourier / wavelet bases
...
Furthermore, we require a training method that is suitable for
non-stationary, non-iid data
Lecture 6: Value Function Approximation
Incremental Methods

Outline

1 Introduction

2 Incremental Methods

3 Batch Methods
Lecture 6: Value Function Approximation
Incremental Methods
Gradient Descent

Gradient Descent
!"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,#
Let J(w) be a di↵erentiable function of
parameter vector w
Define!"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$
the gradient of J(w) to be
'*$%-/0'*,('-*$.'("$("#$1)*%('-*$
0 1
@J(w)
,22&-3'/,(-& %,*$0#$)+#4$(-$
B @w. 1 C
rw J(w) = @ .. C
B
%&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$
A
@J(w)
@wn
!"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$
To find a local minimum of J(w)
1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($
Adjust w in direction of -ve gradient
%,*$*-.$0#$)+#4$(-$)24,(#$("#$
'*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$
1
w= ↵rw J(w)
,22&-3'/,(-&29,*4$%&'('%:;$$$$$$
where ↵ is a step-size parameter
Lecture 6: Value Function Approximation
Incremental Methods
Gradient Descent

Value Function Approx. By Stochastic Gradient Descent

Goal: find parameter vector w minimising mean-squared error

between approximate value fn v̂ (s, w) and true value fn v⇡ (s)
⇥ ⇤
J(w) = E⇡ (v⇡ (S) v̂ (S, w))2

Gradient descent finds a local minimum

1
w= ↵rw J(w)
2
= ↵E⇡ [(v⇡ (S) v̂ (S, w))rw v̂ (S, w)]

Stochastic gradient descent samples the gradient

w = ↵(v⇡ (S) v̂ (S, w))rw v̂ (S, w)

Expected update is equal to full gradient update

Lecture 6: Value Function Approximation
Incremental Methods
Linear Function Approximation

Feature Vectors

Represent state by a feature vector

0 1
x1 (S)
B C
x(S) = @ ... A
xn (S)

For example:
Distance of robot from landmarks
Trends in the stock market
Piece and pawn configurations in chess
Lecture 6: Value Function Approximation
Incremental Methods
Linear Function Approximation

Linear Value Function Approximation

Represent value function by a linear combination of features
n
X
v̂ (S, w) = x(S)> w = xj (S)wj
j=1

Objective function is quadratic in parameters w

h i
J(w) = E⇡ (v⇡ (S) x(S)> w)2

Stochastic gradient descent converges on global optimum

Update rule is particularly simple
rw v̂ (S, w) = x(S)
w = ↵(v⇡ (S) v̂ (S, w))x(S)
Update = step-size ⇥ prediction error ⇥ feature value
Lecture 6: Value Function Approximation
Incremental Methods
Linear Function Approximation

Table Lookup Features

Table lookup is a special case of linear value function

approximation
Using table lookup features
0 1
1(S = s1 )
B .. C
xtable (S) = @ . A
1(S = sn )

Parameter vector w gives value of each individual state

0 1 0 1
1(S = s1 ) w1
B .
.. C B . C
v̂ (S, w) = @ A · @ .. A
1(S = sn ) wn
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms

Incremental Prediction Algorithms

Have assumed true value function v⇡ (s) given by supervisor

But in RL there is no supervisor, only rewards
In practice, we substitute a target for v⇡ (s)
For MC, the target is the return Gt

w = ↵(Gt v̂ (St , w))rw v̂ (St , w)

For TD(0), the target is the TD target Rt+1 + v̂ (St+1 , w)

w = ↵(Rt+1 + v̂ (St+1 , w) v̂ (St , w))rw v̂ (St , w)

For TD( ), the target is the -return Gt

w = ↵(Gt v̂ (St , w))rw v̂ (St , w)

Lecture 6: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms

Monte-Carlo with Value Function Approximation

Return Gt is an unbiased, noisy sample of true value v⇡ (St )

Can therefore apply supervised learning to “training data”:

hS1 , G1 i, hS2 , G2 i, ..., hST , GT i

For example, using linear Monte-Carlo policy evaluation

w = ↵(Gt v̂ (St , w))rw v̂ (St , w)

= ↵(Gt v̂ (St , w))x(St )

Monte-Carlo evaluation converges to a local optimum

Even when using non-linear value function approximation
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms

TD Learning with Value Function Approximation

The TD-target Rt+1 + v̂ (St+1 , w) is a biased sample of true

value v⇡ (St )
Can still apply supervised learning to “training data”:

hS1 , R2 + v̂ (S2 , w)i, hS2 , R3 + v̂ (S3 , w)i, ..., hST 1 , RT i

For example, using linear TD(0)

w = ↵(R + v̂ (S 0 , w) v̂ (S, w))rw v̂ (S, w)

= ↵ x(S)

Linear TD(0) converges (close) to global optimum

Lecture 6: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms

TD( ) with Value Function Approximation

The -return Gt is also a biased sample of true value v⇡ (s)
Can again apply supervised learning to “training data”:
D E D E D E
S1 , G1 , S2 , G2 , ..., ST 1 , GT 1

Forward view linear TD( )

w = ↵(Gt v̂ (St , w))rw v̂ (St , w)
= ↵(Gt v̂ (St , w))x(St )
Backward view linear TD( )
t = Rt+1 + v̂ (St+1 , w) v̂ (St , w)
Et = Et 1 + x(St )
w = ↵ t Et
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Prediction Algorithms

TD( ) with Value Function Approximation

The -return Gt is also a biased sample of true value v⇡ (s)
Can again apply supervised learning to “training data”:
D E D E D E
S1 , G1 , S2 , G2 , ..., ST 1 , GT 1

Forward view linear TD( )

w = ↵(Gt v̂ (St , w))rw v̂ (St , w)
= ↵(Gt v̂ (St , w))x(St )
Backward view linear TD( )
t = Rt+1 + v̂ (St+1 , w) v̂ (St , w)
Et = Et 1 + x(St )
w = ↵ t Et
Forward view and backward view linear TD( ) are equivalent
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Control Algorithms

Control with Value Function Approximation

q
w =q
π

Starting w qw ≈ q*

)
dy(q w
ε- gree
π =

Policy evaluation Approximate policy evaluation, q̂(·, ·, w) ⇡ q⇡

Policy improvement ✏-greedy policy improvement
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Control Algorithms

Action-Value Function Approximation

Approximate the action-value function

q̂(S, A, w) ⇡ q⇡ (S, A)

Minimise mean-squared error between approximate

action-value fn q̂(S, A, w) and true action-value fn q⇡ (S, A)
⇥ ⇤
J(w) = E⇡ (q⇡ (S, A) q̂(S, A, w))2

Use stochastic gradient descent to find a local minimum

1
rw J(w) = (q⇡ (S, A) q̂(S, A, w))rw q̂(S, A, w)
2
w = ↵(q⇡ (S, A) q̂(S, A, w))rw q̂(S, A, w)
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Control Algorithms

Linear Action-Value Function Approximation

Represent state and action by a feature vector
0 1
x1 (S, A)
B .. C
x(S, A) = @ . A
xn (S, A)
Represent action-value fn by linear combination of features
n
X
q̂(S, A, w) = x(S, A)> w = xj (S, A)wj
j=1

Stochastic gradient descent update

rw q̂(S, A, w) = x(S, A)
w = ↵(q⇡ (S, A) q̂(S, A, w))x(S, A)
Lecture 6: Value Function Approximation
Incremental Methods
Incremental Control Algorithms

Incremental Control Algorithms

Like prediction, we must substitute a target for q⇡ (S, A)
For MC, the target is the return Gt
w = ↵(Gt q̂(St , At , w))rw q̂(St , At , w)
For TD(0), the target is the TD target Rt+1 + Q(St+1 , At+1 )
w = ↵(Rt+1 + q̂(St+1 , At+1 , w) q̂(St , At , w))rw q̂(St , At , w)
For forward-view TD( ), target is the action-value -return
w = ↵(qt q̂(St , At , w))rw q̂(St , At , w)
For backward-view TD( ), equivalent update is

t = Rt+1 + q̂(St+1 , At+1 , w) q̂(St , At , w)

Et = Et 1 + rw q̂(St , At , w)
w = ↵ t Et
Lecture 6: Value Function Approximation
Incremental Methods
Mountain Car

Linear Sarsa with Coarse Coding in Mountain Car

Lecture 6: Value Function Approximation
Incremental Methods
Mountain Car

Linear Sarsa with Radial Basis Functions in Mountain Car

Lecture 6: Value Function Approximation
Incremental Methods
Mountain Car

Study of : Should We Bootstrap?

Lecture 6: Value Function Approximation
Incremental Methods
Convergence

Baird’s Counterexample
Lecture 6: Value Function Approximation
Incremental Methods
Convergence

Parameter Divergence in Baird’s Counterexample

Lecture 6: Value Function Approximation
Incremental Methods
Convergence

Convergence of Prediction Algorithms

On/O↵-Policy Algorithm Table Lookup Linear Non-Linear

MC 3 3 3
On-Policy
TD(0) 3 3 7
TD( ) 3 3 7
MC 3 3 3
O↵-Policy
TD(0) 3 7 7
TD( ) 3 7 7
Lecture 6: Value Function Approximation
Incremental Methods
Convergence

Gradient Temporal-Di↵erence Learning

TD does not follow the gradient of any objective function

This is why TD can diverge when o↵-policy or using
non-linear function approximation
Gradient TD follows true gradient of projected Bellman error

On/O↵-Policy Algorithm Table Lookup Linear Non-Linear

MC 3 3 3
On-Policy
TD 3 3 7
Gradient TD 3 3 3
MC 3 3 3
O↵-Policy
TD 3 7 7
Gradient TD 3 3 3
Lecture 6: Value Function Approximation
Incremental Methods
Convergence

Convergence of Control Algorithms

Algorithm Table Lookup Linear Non-Linear

Monte-Carlo Control 3 (3) 7
Sarsa 3 (3) 7
Q-learning 3 7 7
Gradient Q-learning 3 3 7

(3) = chatters around near-optimal value function

Lecture 6: Value Function Approximation
Batch Methods

Outline

1 Introduction

2 Incremental Methods

3 Batch Methods
Lecture 6: Value Function Approximation
Batch Methods

Batch Reinforcement Learning

Gradient descent is simple and appealing

But it is not sample efficient
Batch methods seek to find the best fitting value function
Given the agent’s experience (“training data”)
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Least Squares Prediction

Given value function approximation v̂ (s, w) ⇡ v⇡ (s)

And experience D consisting of hstate, valuei pairs

D = {hs1 , v1⇡ i, hs2 , v2⇡ i, ..., hsT , vT⇡ i}

Which parameters w give the best fitting value fn v̂ (s, w)?

Least squares algorithms find parameter vector w minimising
sum-squared error between v̂ (st , w) and target values vt⇡ ,
T
X
LS(w) = (vt⇡ v̂ (st , w))2
t=1
⇥ ⇤
= ED (v ⇡ v̂ (s, w))2
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Stochastic Gradient Descent with Experience Replay

Given experience consisting of hstate, valuei pairs

D = {hs1 , v1⇡ i, hs2 , v2⇡ i, ..., hsT , vT⇡ i}

Repeat:
1 Sample state, value from experience

hs, v ⇡ i ⇠ D

2 Apply stochastic gradient descent update

w = ↵(v ⇡ v̂ (s, w))rw v̂ (s, w)

Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Stochastic Gradient Descent with Experience Replay

Given experience consisting of hstate, valuei pairs

D = {hs1 , v1⇡ i, hs2 , v2⇡ i, ..., hsT , vT⇡ i}

Repeat:
1 Sample state, value from experience

hs, v ⇡ i ⇠ D

2 Apply stochastic gradient descent update

w = ↵(v ⇡ v̂ (s, w))rw v̂ (s, w)

Converges to least squares solution

w⇡ = argmin LS(w)
w
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Experience Replay in Deep Q-Networks (DQN)

DQN uses experience replay and fixed Q-targets

Take action at according to ✏-greedy policy
Store transition (st , at , rt+1 , st+1 ) in replay memory D
Sample random mini-batch of transitions (s, a, r , s 0 ) from D
Compute Q-learning targets w.r.t. old, fixed parameters w
Optimise MSE between Q-network and Q-learning targets
"✓ ◆2 #
Li (wi ) = Es,a,r ,s 0 ⇠Di r+ max
0
Q(s 0 , a0 ; wi ) Q(s, a; wi )
a

Using variant of stochastic gradient descent

Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

DQN in Atari

End-to-end learning of values Q(s, a) from pixels s

Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

DQN Results in Atari

Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

How much does DQN help?

Replay Replay No replay No replay

Fixed-Q Q-learning Fixed-Q Q-learning
Breakout 316.81 240.73 10.16 3.17
Enduro 1006.3 831.25 141.89 29.1
River Raid 7446.62 4102.81 2867.66 1453.02
Seaquest 2894.4 822.55 1003 275.81
Space Invaders 1088.94 826.33 373.22 301.99
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Linear Least Squares Prediction

Experience replay finds least squares solution

But it may take many iterations
Using linear value function approximation v̂ (s, w) = x(s)> w
We can solve the least squares solution directly
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Linear Least Squares Prediction (2)

At minimum of LS(w), the expected update must be zero

ED [ w] = 0
T
X
↵ x(st )(vt⇡ x(st )> w) = 0
t=1
T
X T
X
x(st )vt⇡ = x(st )x(st )> w
t=1 t=1
T
! 1 T
X X
>
w= x(st )x(st ) x(st )vt⇡
t=1 t=1

For N features, direct solution time is O(N 3 )

Incremental solution time is O(N 2 ) using Shermann-Morrison
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Linear Least Squares Prediction Algorithms

We do not know true values vt⇡

In practice, our “training data” must use noisy or biased
samples of vt⇡
LSMC Least Squares Monte-Carlo uses return
vt⇡ ⇡ Gt
LSTD Least Squares Temporal-Di↵erence uses TD target
vt⇡ ⇡ Rt+1 + v̂ (St+1 , w)
LSTD( ) Least Squares TD( ) uses -return
vt⇡ ⇡ Gt

In each case solve directly for fixed point of MC / TD / TD( )

Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Linear Least Squares Prediction Algorithms (2)

T
X
LSMC 0= ↵(Gt v̂ (St , w))x(St )
t=1
T
! 1 T
X X
>
w= x(St )x(St ) x(St )Gt
t=1 t=1
T
X
LSTD 0= ↵(Rt+1 + v̂ (St+1 , w) v̂ (St , w))x(St )
t=1
T
! 1 T
X X
>
w= x(St )(x(St ) x(St+1 )) x(St )Rt+1
t=1 t=1
T
X
LSTD( ) 0= ↵ t Et
t=1
T
! 1 T
X X
>
w= Et (x(St ) x(St+1 )) Et Rt+1
t=1 t=1
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Prediction

Convergence of Linear Least Squares Prediction Algorithms

On/O↵-Policy Algorithm Table Lookup Linear Non-Linear

MC 3 3 3
LSMC 3 3 -
On-Policy
TD 3 3 7
LSTD 3 3 -
MC 3 3 3
O↵-Policy
LSMC 3 3 -
TD 3 7 7
LSTD 3 3 -
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Least Squares Policy Iteration

q
w =q
π

Starting w qw ≈ q*

)
(q w
g r eedy
π =

Policy evaluation Policy evaluation by least squares Q-learning

Policy improvement Greedy policy improvement
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Least Squares Action-Value Function Approximation

Approximate action-value function q⇡ (s, a)

using linear combination of features x(s, a)

q̂(s, a, w) = x(s, a)> w ⇡ q⇡ (s, a)

Minimise least squares error between q̂(s, a, w) and q⇡ (s, a)

from experience generated using policy ⇡
consisting of h(state, action), valuei pairs

D = {h(s1 , a1 ), v1⇡ i, h(s2 , a2 ), v2⇡ i, ..., h(sT , aT ), vT⇡ i}

Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Least Squares Control

For policy evaluation, we want to efficiently use all experience

For control, we also want to improve the policy
This experience is generated from many policies
So to evaluate q⇡ (S, A) we must learn o↵-policy
We use the same idea as Q-learning:
Use experience generated by old policy
St , At , Rt+1 , St+1 ⇠ ⇡old
Consider alternative successor action A0 = ⇡new (St+1 )
Update q̂(St , At , w) towards value of alternative action
Rt+1 + q̂(St+1 , A0 , w))
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Least Squares Q-Learning

Consider the following linear Q-learning update

= Rt+1 + q̂(St+1 , ⇡(St+1 ), w) q̂(St , At , w)

w = ↵ x(St , At )

LSTDQ algorithm: solve for total update = zero

T
X
0= ↵(Rt+1 + q̂(St+1 , ⇡(St+1 ), w) q̂(St , At , w))x(St , At )
t=1
T
! 1 T
X X
>
w= x(St , At )(x(St , At ) x(St+1 , ⇡(St+1 ))) x(St , At )Rt+1
t=1 t=1
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Least Squares Policy Iteration Algorithm

The following pseudocode uses LSTDQ for policy evaluation

It repeatedly re-evaluates experience D with di↵erent policies
function LSPI-TD(D, ⇡0 )
⇡0 ⇡0
repeat
⇡ ⇡0
Q LSTDQ(⇡, D)
for all s 2 S do
⇡ 0 (s) argmax Q(s, a)
a2A
end for
until (⇡ ⇡ ⇡ 0 )
return ⇡
end function
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Convergence of Control Algorithms

Algorithm Table Lookup Linear Non-Linear

Monte-Carlo Control 3 (3) 7
Sarsa 3 (3) 7
Q-learning 3 7 7
LSPI 3 (3) -

(3) = chatters around near-optimal value function

Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Least-Squares Policy Iteration

Chain Walk Example
0.1 0.1
0.1

0.1
0.9 0.9 0.9

R R R R
L
1 2 3 4 0.9
0.9 L L L

0.9 0.9 0.9

0.1
0.1
0.1 0.1

r=0 r=1 r=1 r=0

Figure 9: The problematic MDP.

Consider the 50 state version of this problem
where s is the state number. LSPI was applied on the same problem using the same basis
Reward
functions +1forineachstates
repeated 10actions
of the two and 41, 0 each
so that elsewhere
action gets its own parameters:10
Optimal policy: R (1-9),L I(a (10-25),
= L) × 1 R
(26-41), L (42, 50)
 I(a = L) × s 
Features: 10 evenly spaced  Gaussians 
 I(a = L) × s2 ( = 4) for each action
φ(s, a) =  
 I(a = R) × 1  .
Experience: 10,000 steps from random
I(a = R) × s 
 walk policy

I(a = R) × s2
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Least-Squares Policy Iteration

LSPI in Chain Walk: Action-Value Function
1.5 1.5
1
1 0.5
0.5 0
−0.5
0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iteration1 Iteration2

4 4

2
2

0
0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iteration3 Iteration4

4 4

2 2

0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iteration5 Iteration6

0
5 10 15 20 25 30 35 40 45 50
Iteration7
2 2
Lecture 6: Value Function Approximation
Batch Methods
0 0
5 10Control
Least Squares 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Iteration5 Iteration6

4
LSPI in Chain Walk: Policy
2

0
5 10 15 20 25 30 35 40 45 50
Iteration7

10 20 30 40 50 10 20 30 40 50
Iteration1 Iteration2

10 20 30 40 50 10 20 30 40 50
Iteration3 Iteration4

10 20 30 40 50 10 20 30 40 50
Iteration5 Iteration6

10 20 30 40 50
Iteration7

Figure 13: LSPI iterations on a 50-state chain with a radial basis function approximator
(reward only in states 10 and 41). Top: The state-action value function of the
policy being evaluated in each iteration (LSPI approximation - solid lines; exact
Lecture 6: Value Function Approximation
Batch Methods
Least Squares Control

Questions?

Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Real Estrate Valuation Theory PDF
100% (2)
Real Estrate Valuation Theory PDF
440 pages
Lecture 6: Value Function Approximation: David Silver
No ratings yet
Lecture 6: Value Function Approximation: David Silver
56 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
07 FA Methods
No ratings yet
07 FA Methods
58 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
402-lec20
No ratings yet
402-lec20
21 pages
Module 6
No ratings yet
Module 6
47 pages
3 - Chapter 8 Value Function Approximation
No ratings yet
3 - Chapter 8 Value Function Approximation
39 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
L8-Value Function Approximation
No ratings yet
L8-Value Function Approximation
72 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
RL Unit 4
No ratings yet
RL Unit 4
9 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
Instance Based Learning: Artificial Intelligence and Machine Learning 18CS71
No ratings yet
Instance Based Learning: Artificial Intelligence and Machine Learning 18CS71
19 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
ML Unit Iv
No ratings yet
ML Unit Iv
8 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
RL DP and Value and Policy
No ratings yet
RL DP and Value and Policy
4 pages
Universal Value Function Approximators.
No ratings yet
Universal Value Function Approximators.
9 pages
15 Ec 834
No ratings yet
15 Ec 834
26 pages
Internal
No ratings yet
Internal
25 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
20 pages
ProbNum GilP PF14
No ratings yet
ProbNum GilP PF14
342 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Summer School 19
No ratings yet
Summer School 19
95 pages
Exercises 6
No ratings yet
Exercises 6
9 pages
M2 P&F-AlgoStoch
No ratings yet
M2 P&F-AlgoStoch
132 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
3 - Chapter 6 Stochastic Approximation
No ratings yet
3 - Chapter 6 Stochastic Approximation
24 pages
L7 Temporal Difference Learning
No ratings yet
L7 Temporal Difference Learning
56 pages
Module 5 Part 2 3
No ratings yet
Module 5 Part 2 3
19 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Exercises 5
No ratings yet
Exercises 5
20 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Osmlf 111
No ratings yet
Osmlf 111
30 pages
Probabilites Numeriques
No ratings yet
Probabilites Numeriques
354 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
Notes ValueFunctionIteration
No ratings yet
Notes ValueFunctionIteration
10 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
ML Module5Notes
No ratings yet
ML Module5Notes
20 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
Supervised Learning: Instance Based Learning
No ratings yet
Supervised Learning: Instance Based Learning
16 pages
Unit 2 Introduction To Deep Learning
50% (2)
Unit 2 Introduction To Deep Learning
79 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Comprehensive Linear Algebra
From Everand
Comprehensive Linear Algebra
Kartikeya Dutta
No ratings yet
3 Lecture03
No ratings yet
3 Lecture03
30 pages
Combined Least Squares
No ratings yet
Combined Least Squares
21 pages
Quality By Experimental Design 3rd Edition Barker all chapters available
No ratings yet
Quality By Experimental Design 3rd Edition Barker all chapters available
156 pages
QMM Epgdm 5
No ratings yet
QMM Epgdm 5
58 pages
Rotor Balancing
100% (1)
Rotor Balancing
13 pages
Axisymmetric Drop Shape Analysis Computational Methods
No ratings yet
Axisymmetric Drop Shape Analysis Computational Methods
12 pages
Nonlinear Curve Fitting
No ratings yet
Nonlinear Curve Fitting
43 pages
Sales Analysis Using The Forecasting Method: Bit-Tech April 2019
No ratings yet
Sales Analysis Using The Forecasting Method: Bit-Tech April 2019
5 pages
Chapter 3 Analysis and Adjustment of Observations
100% (1)
Chapter 3 Analysis and Adjustment of Observations
67 pages
Incorporating Whole-Stand and Individual-Tree Models in A
No ratings yet
Incorporating Whole-Stand and Individual-Tree Models in A
5 pages
A Review of Current Geo-Metric Tolerancing Theo - Ries and Inspection Data Analysis Algorithms
No ratings yet
A Review of Current Geo-Metric Tolerancing Theo - Ries and Inspection Data Analysis Algorithms
22 pages
Quantitative Techniques Notes
No ratings yet
Quantitative Techniques Notes
21 pages
20185644444444444
100% (1)
20185644444444444
59 pages
RL Unit 3,4,5
No ratings yet
RL Unit 3,4,5
19 pages
Unit 14 - Time Series Analysis
No ratings yet
Unit 14 - Time Series Analysis
29 pages
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
No ratings yet
Regression: Introduction: Basic Idea: Use Data To Identify Among Variables and Use These Relationships To Make
23 pages
Business Statistics Important Theory Questions
No ratings yet
Business Statistics Important Theory Questions
22 pages
Kinetics of Castor Oil Alkyd Resin Polycondensation Reaction 2157 7048 1000240
No ratings yet
Kinetics of Castor Oil Alkyd Resin Polycondensation Reaction 2157 7048 1000240
8 pages
Utstat - Toronto.edu-Graduate Course Offerings
No ratings yet
Utstat - Toronto.edu-Graduate Course Offerings
16 pages
EECE 481 Assignment 2022
No ratings yet
EECE 481 Assignment 2022
3 pages
Recurrent Neural Networks and Robust Time Series Prediction
No ratings yet
Recurrent Neural Networks and Robust Time Series Prediction
15 pages
Introduce To Probabilistic Machine Learning
No ratings yet
Introduce To Probabilistic Machine Learning
53 pages
10.4 Applications of Numerical Methods Applications of Gaussian Elimination With Pivoting
No ratings yet
10.4 Applications of Numerical Methods Applications of Gaussian Elimination With Pivoting
11 pages
Previewpdf
No ratings yet
Previewpdf
27 pages
【10】Comparing Cross-Section and Time-Series Factor Models
No ratings yet
【10】Comparing Cross-Section and Time-Series Factor Models
36 pages
Economics
0% (1)
Economics
44 pages
FE - Ch11 Monte Carlo Simulation For American Options
No ratings yet
FE - Ch11 Monte Carlo Simulation For American Options
16 pages