0% found this document useful (0 votes)

4 views

07 FA Methods

Uploaded by

udipi.adithya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

07 FA Methods

Uploaded by

udipi.adithya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Function Approximation Methods

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Oct 21, 22, 2021

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Agenda

§ Get started with the function approximation methods. Revisit risk

minimization, gradient descent etc. from Machine Learning class.
§ Get familiar with different on and off policy evaluation and control
methods with function approximation.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 2 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Resources

§ Reinforcement Learning by Balaraman Ravindran [Link]

§ Reinforcement Learning by David Silver [Link]
§ SB: Chapter 9

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 3 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods

§ In the last few lectures we have seen ‘tabular methods’ for solving RL
problems.
§ The state or action values are all represented by look-up tables.
Either every state s or every state-action pair (s, a) has an entry in
the form of V (s) or Q(s, a).
§ The approach, in general, updates old values in these tables towards a
target value in each iteration.
§ Let the notation s 7→ u denote an individual update where s is the
state updated and u is the target.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 4 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 4 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations

§ What are some problems of tabular methods?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations

§ What are some problems of tabular methods?
§ Large state spaces and that means huge memory need and also huge
amount of time to update each state enough number of times.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations

§ What are some problems of tabular methods?
§ Large state spaces and that means huge memory need and also huge
amount of time to update each state enough number of times.
§ This also means, continuous states and actions can not be handled.
§ States not encountered previously will not have a sensible policy i.e.,
generalisation is an issue.
§ Fortunately, a change of representation can address all these issues to
some extent. Instead of representing the value functions as look-up
tables, they are represented as a parameterized functions.
§ We denote the approximate value function in parameterized form as
v̂(s; w) ≈ vπ (s) where w ∈ Rd is the learnable parameter vector.
§ Any form of function approximator e.g., linear function approximator,
multi-layer neural networks, decision trees, nearest neighbours etc.
can be used. However, in practice, some fit more easily into this role
than others.
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Advantages and Specialities of Function Approximators

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 6 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Advantages and Specialities of Function Approximators

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 6 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Advantages and Specialities of Function Approximators

§ Handle large state spaces.
§ In tabular methods, learned values at each state are decoupled - an
update at one state affected no other state. Parameterized value
function representation means an update at one state affects many
others helping generalization.
§ In supervised machine learning teminology, an update of the form
s 7→ u means training with input-output pairs where the output is u.
§ Viewing each update as conventional training example has some
downsides too.
§ The training set is not static unlike most traditional supervised
learning setting. RL generally requires function approximation
methods able to handle nonstationary target functions (target
functions that change over time)
§ Control methods changes the policy and thus the generated data. For
evaluation even, the target values of training examples are
non-stationary when bootstrapping is applied.
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 6 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

TypesofofValue
Types ValueFunction
Function Approximators
Approximation

^
v(s,w) ^
q(s,a,w) ^
q(s,a … q(s,a
^
1,w) m,w)

w w w

s s a s
Figure credit: David Silver, DeepMind

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 7 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ We need to figure out a measure for the error that the function
approximator is making.
§ One candidate can be the sum of error squares.
X
[vπ (s) − v̂(s; w)]2
s

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 8 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ We need to figure out a measure for the error that the function
approximator is making.
§ One candidate can be the sum of error squares.
X
[vπ (s) − v̂(s; w)]2
s

§ Do we average?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 8 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ We need to figure out a measure for the error that the function
approximator is making.
§ One candidate can be the sum of error squares.
X
[vπ (s) − v̂(s; w)]2
s

§ Do we average?
§ In classical supervised setting, the averaging by N1 is to approximate
the expected error by empirical error.
§ With training data {s(i) , vπ (s(i) )} coming from probability
distribution D(s, vπ (s)),
R
I Expected error: l(v̂(s; w), vπ (s))dD(s, vπ (s))
1
P
N
I Empirical error: N l(v̂(s(i) ; w), vπ (s(i) ))
i=1

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 8 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ This was assuming the training data that we get, follows the true data
distribution. So, if a data point is more likely in the true distribution,
it is more likely that it comes more often as the training example.
§ So, we need to average but with a state distribution.
X
µ(s) [vπ (s) − v̂(s; w)]2
s

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 9 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

P
§ µ(s) > 0, µ(s) = 1 provides the probability of obtaining a state any
s
time we sample a MRP (why MRP?)

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 9 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

P
§ µ(s) > 0, µ(s) = 1 provides the probability of obtaining a state any
s
time we sample a MRP (why MRP?)
§ In a non-episodic task, the distribution is the stationary distribution
under π
§ In an episodic task, µ(s) is the fraction of time spent in s. A simple
proof is given in [SB] - section 9.2. Under on-policy training, this is
called the on-policy distribution
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 9 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 10 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ But what is the practical problem in it? Hint: A big assumption was
made.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 10 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ But what is the practical problem in it? Hint: A big assumption was
made. - Where will I get the target vπ (st )?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 10 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ But what is the practical problem in it? Hint: A big assumption was
made. - Where will I get the target vπ (st )?
§ We will come back to this.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 10 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Gradient Descent Primer

§ Let J(w) be a differentiable function of parameter vector w.
§ The gradient of J(w) is defined as,
 
∂J(w)
∂w1
 .. 
∇w J(w) = 
 . 

∂J(w)
∂wd

§ To find local minimum of J(w), w is adjusted in the direction of

negative gradient.
1
∆w = − α∇w J(w),
2
where α is the step size parameter.
§ Stochastic Gradient Descent (SGD) adjusts the weight vector after
each example by computing gradient of J only for that example.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 11 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Gradient Monte Carlo Algorithm

§ SGD update rule for mean square error.
1
wk+1 = wk − α∇wk [vπ (st ) − v̂(st ; wk )]2
2
= wk + α [vπ (st ) − v̂(st ; wk )] ∇wk v̂(st ; wk ) (1)

§ Lets look at different possibilities to approximate the target vπ (st ).

I Monte Carlo target: Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · ·
I TD(0) target: Rt+1 + γv̂(st+1 ; wk )

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 12 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Gradient Monte Carlo Algorithm

§ SGD update rule for mean square error.
1
wk+1 = wk − α∇wk [vπ (st ) − v̂(st ; wk )]2
2
= wk + α [vπ (st ) − v̂(st ; wk )] ∇wk v̂(st ; wk ) (1)

§ Lets look at different possibilities to approximate the target vπ (st ).

I Monte Carlo target: Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · ·
I TD(0) target: Rt+1 + γv̂(st+1 ; wk )

Figure credit: [SB: Chapter 9]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 12 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Data Preparation
§ Traditionally supervised methods gets data in the form
< {x(i) , f (x(i) )} >, from where a function fˆ(x) to predict for an
unknown x is learned.
§ In RL, data comes in the form of
< st , at , Rt+1 , st+1 , at+1 , Rt+2 + · · · > from where the approximate
targets have to be generated.
§ For MC targets, this is done as < st , Rt+1 + γRt+2 + γ 2 Rt+3 + · · · >,
< st+1 , Rt+2 + γRt+3 + γ 2 Rt+4 + · · · > etc.
§ For TD(0) targets, this is done as < st , Rt+1 + γv̂(st+1 ; wk ) >,
< st+1 , Rt+2 + γv̂(st+2 ; wk+1 ) > etc.
§ The last setting gives non-stationary training data as data is
generated using the function approximator itself.
§ The key step in eqn. 1, relies on the target being independent of wt .
§ An approximate way is to take into account the effect of changing wt
on the estimate but ignore its effect on the target. Thus it is termed
as a semi-gradient descent method.
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 13 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Semi-gradient TD(0)

Figure credit: [SB: Chapter 9]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 14 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Feature Vectors
§ Feature vectors are representations of states. Corresponding to every
state s, there is a real-valued vector x(s) ∈ Rd i.e., with same
number of components as w
 
x1 (s)
 
x(s) =  ... 
xd (s)

§ For example, features can be,

I Distance of robot from landmarks.
I Co-ordinates of cells in gridworlds.
I Piece and pawn configurations in chess.
§ Linear methods approximate state-value function by the inner product
between w and x(s).
Xd
v̂(s; w) = wT x(s) = wi x i
i=1
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 15 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Linear Function Approximator

§ The gradient w.r.t. w is, ∇w v̂(s; w) = x(s).

§ So, the SGD update rule in linear function approximator case is
wk+1 = wk + α [vπ (st ) − v̂(st ; wk )] x(st )

§ The gradient Monte Carlo algorithm converges to the global optimum

under linear function approximation if α is reduced over time
according to the usual conditions.
§ The semi-gradient TD(0) algorithm also converges under linear
function approximation, but the convergence proof is not trivial (see
section 9.4 in [SB]).
§ The update rule, in this case, is

wk+1 = wk + α Rt+1 + γwkT x(st+1 ) − wkT x(st ) x(st )

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 16 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Table Look-up Features

§ Table look-up is a special case of linear value function approximation

§ Using table lookup features
 
1(s = s1 )
 .. 
xtable (s) =  . 
1(s = sd )

§ Parameter vector w gives value of each individual state,

 T  
1(s = s1 ) w1
 .
..   . 
v̂(s; w) =    .. 
1(s = sd ) wd

§ State aggregation methods are used to save memory and ease

computation.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 17 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Looking back at Eligibility Traces

§ Lets say we get a reward at the end of some step. What eligibility
trace says is that the credit for the reward should trickle down in
proportion to all the way to the first state. The credit should be more
for the state-action pairs which were close to the rewarding step and
also for those state-action pairs which were visited frequently along
the way.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 18 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ Earlier we had eligibility values for all states. Now, we have two
options for eligibility traces in function approximation based methods.
§ Should the eligibility trace be on the features x(s) or the parameters
w?
§ Hint:
I What was eligibility doing in TD(λ) or SARSA(λ) algorithms? Was it
associated with something getting updated or not?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 19 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 19 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ From slide 14, the update rule w/o eligibility was
wk+1 = wk +αδt ∇w v̂(st ; wk ), where, δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )

§ Now it will be,

wk+1 = wk + αδt ∇w v̂(st ; wk )e(wk ) //elementwise

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 20 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ From slide 14, the update rule w/o eligibility was
wk+1 = wk +αδt ∇w v̂(st ; wk ), where, δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )

§ Now it will be,

wk+1 = wk + αδt ∇w v̂(st ; wk )e(wk ) //elementwise

§ From slide 16, the update rule w/o eligibility for linear models was
wk+1 = wk + αδt x(st )

§ Now it will be,

wk+1 = wk + αδt x(st )e(wk ) //elementwise

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 20 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ What eligibility of a state (in earlier cases of TD(0), say) signifies is

that - how much a state is responsible for the the final reward that is
obtained.
§ Keeping this in mind - what is a quantity that is performing a similar
thing in function approximation?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 21 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ What eligibility of a state (in earlier cases of TD(0), say) signifies is

that - how much a state is responsible for the the final reward that is
obtained.
§ Keeping this in mind - what is a quantity that is performing a similar
thing in function approximation?
§ Gradients - i.e., the partial derivatives of the predicted value w.r.t.
the parameters.
§ It tells how much a particular parameter is responsible for the
predicted output.
§ This similarity is used by replacing the usual way of incrementing
eligibility by 1, by an accumulation of gradients.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 21 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )]

et = γλet−1 + ∇w v̂(st ; wk )
wk+1 = wk + αδt et

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 22 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )]

et = γλet−1 + ∇w v̂(st ; wk )
wk+1 = wk + αδt et

§ For linear models, this second line will be changed to,

et = γλet+1 + x(st )

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 22 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )]

et = γλet−1 + ∇w v̂(st ; wk )
wk+1 = wk + αδt et

§ For linear models, this second line will be changed to,

et = γλet+1 + x(st )

§ Using lookup table representation, this becomes,

 
1(s = s1 )
 .. 
et = γλet+1 +  . 
1(s = sd )
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 22 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Control with Function Approximation

§ Like the MC or TD approaches, we first need to resort to action value

function Q(s, a).
§ For this, the very first thing that is needed is to encode the actions as
well, i.e., we need to find x(s, a) instead of x(s).
§ Encoding the actions as one-hot vectors is often a possible choice.
§ Another option is to maintain different set of parameters for different
action, i.e., Q(s, a) = xT (s)wa

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 23 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Control with Function Approximation

§ Like the MC or TD approaches, we first need to resort to action value

§ So, for Q(st+1 , at+1 ), xT (st+1 )wat+1 will be used and for
Q(st , at ), xT (st )wat will be used

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 23 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Afterstates
§ These options are managable for small number of actions.
§ What about large number of actions or continuous actions?
§ What about generalization?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 24 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Afterstates
§ These options are managable for small number of actions.
§ What about large number of actions or continuous actions?
§ What about generalization?
§ The concept of ‘afterstates’ help in this regard.
§ Afterstate concept is based on separating stochasticity of states and
deterministic nature of agent’s moves.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 24 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Figure credit: [SB: Chapter 6]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 24 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Afterstates

Figure credit: [SB: Chapter 6]

§ In such cases, the position move pairs are different but produce the
same “afterposition”.
§ A conventional action value function would have to separately asses
both pairs whereas an afterstate value function would immediately
asses both equally.
§ S × A → S 0 → S. Instead of learning action value over (s, a), state
value function over afterstates are learned.
§ What are the advantages?
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 25 / 35
Agenda Incremental
Introduction Value Function Approximation
Control Algorithms Function Approximation Control Batch Methods

Controlwith
Control withValue
ValueFunction
FunctionApproximation
Approximation

q
w =q
π

Starting w qw ≈ q*

)
dy(q w
ε- gree
π =

Policy evaluation Approximate policy evaluation, q̂(·, ·, w) ≈ qπ

Policy improvement -greedy policy improvement
Figure credit: [David Silver: Deepmind]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 26 / 35

AgendaIncremental Control Algorithms
Introduction Value Function Approximation Function Approximation Control Batch Methods

Action-Value
Control Function
with Value Approximation
Function Approximation

Approximate the action-value function

q̂(S, A, w) ≈ qπ (S, A)

Minimise mean-squared error between approximate

action-value fn q̂(S, A, w) and true action-value fn qπ (S, A)

J(w) = Eπ (qπ (S, A) − q̂(S, A, w))2

Use stochastic gradient descent to find a local minimum

1
− ∇w J(w) = (qπ (S, A) − q̂(S, A, w))∇w q̂(S, A, w)
2
∆w = α(qπ (S, A) − q̂(S, A, w))∇w q̂(S, A, w)
Slide credit: [David Silver: Deepmind]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 27 / 35

Incremental Control
Agenda Algorithms
Introduction Value Function Approximation Function Approximation Control Batch Methods

Incremental Control
Control with ValueAlgorithms
Function Approximation
Like prediction, we must substitute a target for qπ (S, A)
For MC, the target is the return Gt
∆w = α(Gt − q̂(St , At , w))∇w q̂(St , At , w)
For TD(0), the target is the TD target Rt+1 + γQ(St+1 , At+1 )
∆w = α(Rt+1 + γ q̂(St+1 , At+1 , w) − q̂(St , At , w))∇w q̂(St , At , w)
For forward-view TD(λ), target is the action-value λ-return
∆w = α(qtλ − q̂(St , At , w))∇w q̂(St , At , w)
For backward-view TD(λ), equivalent update is
δt = Rt+1 + γ q̂(St+1 , At+1 , w) − q̂(St , At , w)
Et = γλEt−1 + ∇w q̂(St , At , w)
∆w = αδt Et
Slide credit: [David Silver: Deepmind]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 28 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Least Squares Solution

§ We have seen that the error that is minimized in function

approximation based methods is,
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1

§ We have also seen how this error function is minimized using gradient
descent updates.
§ But, if we look carefully, solving this is nothing but finding a least
squares solution, i.e., finding parameter vector w minimising
sum-squared error between the target values and the predicted values.
§ We know that, for linear function approximator (i.e. when v̂(st ; w) is
a linear function of w), the solution to this is exact and is obtained in
closed form.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 29 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Least Squares Solution

§ From slide 16, we have seen that the update rule for linear function
approximator is
wt+1 = wt + α∆wt = wt + α Rt+1 + γwtT x(st+1 ) − wtT x(st ) x(st )

= wt + α Rt+1 + γwtT xt+1 − wtT xt xt [using xt for x(st )]
= wt + α[Rt+1 xt + γ wtT xt+1 xt − wtT xt xt ]
| {z } | {z }
scalar scalar
= wt + α[Rt+1 xt + γxt xTt+1 wt − xt xTt wt ]
= wt + α[Rt+1 xt − xt (xt − γxt+1 )T wt ]

§ For a given wt , the expected value of the new weight (wt+1 ) can be
written as, (The expectation is for different values of R and x’s)
E[wt+1 |wt ] = wt + α(b − Awt )

where, b = E[Rt+1 xt ] ∈ Rd and A = E[xt (xt − γxt+1 )T ] ∈ Rd×d

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 30 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Least Squares Solution

§ At minimum, the update will mean no change, i.e.,
b − AwT D = 0
=⇒ b = AwT D
=⇒ wT D = A−1 b

§ This quantity is called the TD fixed point.

§ For non-linear function approximators, the TD fixed point is not
obtained in closed form. However, gradient descent update can be
applied on a batch of training data for an iterative solution in such
cases.
§ Traditional supervised least square solution requires training data in
the form
D = {hx(1) , f (x(1) )i, hx(2) , f (x(2) )i, · · · , hx(T ) , f (x(T ) )i}

where the following loss is minimized,

1 h i2 1 Xh
T i2
LS(w) = ED f (x(i) ) − fˆ(x(i) ; w) = f (x(i) ) − fˆ(x(i) ; w)
2 2T i=1
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 31 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Least Squares Solution

§ The gradient of the loss is computed as,
1 Xh i
T
f (x(i) ) − fˆ(x(i) ; w) ∇w fˆ(x(i) ; w)
T
i=1

§ This is pretty similar to the gradients used in the function

approximation methods till now, but with an important difference.
§ They are not batch methods. The gradients are computed per step.
Thus it is not sample efficient.
§ The idea is to form batches of training data from a few trajectories.
This is called experience.
D = {hs(1) , vπ (s(1) )i, hs(2) , vπ (s(2) )i, · · · , hs(T ) , vπ (s(T ) )i}

§ And the gradient of the loss is,

1 Xh i
T
vπ (s(i) ) − v̂(s(i) ; w) ∇w v̂(s(i) ; w)
T i=1

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 32 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Mini-batch Gradient Descent with Experience Replay

§ Given experience of the form,

D = {hs(1) , vπ (s(1) )i, hs(2) , vπ (s(2) )i, · · · , hs(T ) , vπ (s(T ) )i}

§ Repeat
I Sample state, value from experience hs(i) , vπ (s(i) )i ∼ D
I Apply minibatch gradient descent update
Xh i
w ←w+α vπ (s(i) ) − v̂(s(i) ; w) ∇w v̂(s(i) ; w)
i

§ Converges to least square solution.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 33 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Experience Replay in Deep Q-Networks (DQN)

DQN uses experience replay and fixed Q-targets

§ Take action at at according to -greedy policy.
§ Store transition (st , at , Rt+1 , st+1 , at+1 , · · · ) in replay memory D
§ Sample random mini-batch of transitions (s, a, r, s0 ) from D
§ Compute Q-learning targets w.r.t. old, fixed parameters w−
§ Optimise MSE between Q-network and Q-learning targets
2
0 0 −
E(s,a,r,s0 )∼D r + γ max
0
Q(s , a ; w ) − Q(s, a; wi )
a

§ Using minibatch gradient descent

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 34 / 35

Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

DQN in Atari
DQN in Atari
End-to-end learning of values Q(s, a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Slide credit: [David Silver: Deepmind]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 35 / 35

01-CAG-113 LECTURE-001 Intro Functions
No ratings yet
01-CAG-113 LECTURE-001 Intro Functions
59 pages
Final 12
No ratings yet
Final 12
11 pages
Detailed Lesson Plan For Introducing Linear Function
100% (4)
Detailed Lesson Plan For Introducing Linear Function
5 pages
Agglomerated Cork, Particle Size Distribution Analysis PDF
No ratings yet
Agglomerated Cork, Particle Size Distribution Analysis PDF
6 pages
Lecture 6: Value Function Approximation: David Silver
No ratings yet
Lecture 6: Value Function Approximation: David Silver
56 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
3 - Chapter 8 Value Function Approximation
No ratings yet
3 - Chapter 8 Value Function Approximation
39 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
2.3+Value+Function+Approximation
No ratings yet
2.3+Value+Function+Approximation
55 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
Internal
No ratings yet
Internal
25 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
E0_270_RL
No ratings yet
E0_270_RL
10 pages
Rl Dp and Value and Policy
No ratings yet
Rl Dp and Value and Policy
4 pages
08 PG Methods
No ratings yet
08 PG Methods
83 pages
Numerical Methods and Computation MTL 107
No ratings yet
Numerical Methods and Computation MTL 107
50 pages
Fujimoto 18 A
No ratings yet
Fujimoto 18 A
10 pages
20820-Article Text-24833-1-2-20220628
No ratings yet
20820-Article Text-24833-1-2-20220628
9 pages
Rational Approximation of The Absolute Value Function From Measurements: A Numerical Study of Recent Methods
No ratings yet
Rational Approximation of The Absolute Value Function From Measurements: A Numerical Study of Recent Methods
28 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
C2_M2_Exam_withSol (1)
No ratings yet
C2_M2_Exam_withSol (1)
12 pages
Chapter_1
No ratings yet
Chapter_1
78 pages
Advanced Machine Learning
No ratings yet
Advanced Machine Learning
74 pages
Lec 18
No ratings yet
Lec 18
6 pages
Optimization
No ratings yet
Optimization
89 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
4.deep Learning Assignment4 Solution PDF
100% (1)
4.deep Learning Assignment4 Solution PDF
12 pages
PAMS-22Fall-Smart Marketing With RRM-5-Optimization
No ratings yet
PAMS-22Fall-Smart Marketing With RRM-5-Optimization
40 pages
DL Slides 3
No ratings yet
DL Slides 3
99 pages
AIML - Unit 4 Notes
No ratings yet
AIML - Unit 4 Notes
23 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
Batch Reinforcement Learning: Alan Fern
No ratings yet
Batch Reinforcement Learning: Alan Fern
47 pages
SESO2018_Wednesday_Sagastizabal
No ratings yet
SESO2018_Wednesday_Sagastizabal
181 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Chapter
No ratings yet
Chapter
46 pages
Linear Quadratic Control
No ratings yet
Linear Quadratic Control
7 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
No ratings yet
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
10 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
lec30 (6)
No ratings yet
lec30 (6)
22 pages
exact (RL IITH)
No ratings yet
exact (RL IITH)
47 pages
WEEK 4
No ratings yet
WEEK 4
61 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Optimized Fuzzy Model in Piecewise Interval for Function Approximation
No ratings yet
Optimized Fuzzy Model in Piecewise Interval for Function Approximation
10 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Sc Assignment 2
No ratings yet
Sc Assignment 2
10 pages
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
From Everand
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
Fouad Sabry
No ratings yet
2019 Activity 1-Familiarization With MS EXCEL
No ratings yet
2019 Activity 1-Familiarization With MS EXCEL
3 pages
Presentation Topic Unit # 06 Qualities of A Good Test
0% (1)
Presentation Topic Unit # 06 Qualities of A Good Test
18 pages
C++ Programming: From Problem Analysis To Program Design, Fifth Edition
No ratings yet
C++ Programming: From Problem Analysis To Program Design, Fifth Edition
38 pages
Sharada Mandir First Prelim Maths
0% (1)
Sharada Mandir First Prelim Maths
3 pages
tn7 1
No ratings yet
tn7 1
5 pages
Unit 3 Practice Test
No ratings yet
Unit 3 Practice Test
5 pages
Birds Engineering Mathematics 9th Edition John Bird pdf download
100% (1)
Birds Engineering Mathematics 9th Edition John Bird pdf download
79 pages
Polymer Solution Viscosity Functions
No ratings yet
Polymer Solution Viscosity Functions
3 pages
Université Ferhat Abbas Sétif-1 2023-2024 Faculté Des Sciences, Département D'informatique Programmation Web Avancée
No ratings yet
Université Ferhat Abbas Sétif-1 2023-2024 Faculté Des Sciences, Département D'informatique Programmation Web Avancée
4 pages
Eco-244 Course-Outline Spring2012
No ratings yet
Eco-244 Course-Outline Spring2012
2 pages
MS-MATHEMATICS-12-COMMON EXAM-SET 2
No ratings yet
MS-MATHEMATICS-12-COMMON EXAM-SET 2
9 pages
Check
No ratings yet
Check
3 pages
CURRICULUM MAP 9 (Mathematics)
No ratings yet
CURRICULUM MAP 9 (Mathematics)
37 pages
複變講義20
No ratings yet
複變講義20
101 pages
Linear Programming and Graph
No ratings yet
Linear Programming and Graph
19 pages
AIATS Practise Test
No ratings yet
AIATS Practise Test
20 pages
MST129 Tutorial_1 Functions
No ratings yet
MST129 Tutorial_1 Functions
50 pages
000 Mathematics
No ratings yet
000 Mathematics
13 pages
Python Functions PDF
100% (1)
Python Functions PDF
21 pages
Lesson Plan 1
No ratings yet
Lesson Plan 1
6 pages
Form V Mathematics Extension 1: 2019 Annual Examination
No ratings yet
Form V Mathematics Extension 1: 2019 Annual Examination
30 pages
Advanced Microeconomics
100% (2)
Advanced Microeconomics
605 pages
L9 Derivative of Logarithmic Functions
No ratings yet
L9 Derivative of Logarithmic Functions
21 pages
BC III-10
No ratings yet
BC III-10
3 pages
Quadratic Graphs
No ratings yet
Quadratic Graphs
7 pages
Am Reb670
No ratings yet
Am Reb670
344 pages
Grade 10 November Examination Information and Demarcation Booklet - Final
No ratings yet
Grade 10 November Examination Information and Demarcation Booklet - Final
24 pages