0% found this document useful (0 votes)
4 views

07 FA Methods

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

07 FA Methods

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Function Approximation Methods

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Oct 21, 22, 2021


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Agenda

§ Get started with the function approximation methods. Revisit risk


minimization, gradient descent etc. from Machine Learning class.
§ Get familiar with different on and off policy evaluation and control
methods with function approximation.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 2 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Resources

§ Reinforcement Learning by Balaraman Ravindran [Link]


§ Reinforcement Learning by David Silver [Link]
§ SB: Chapter 9

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 3 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods

§ In the last few lectures we have seen ‘tabular methods’ for solving RL
problems.
§ The state or action values are all represented by look-up tables.
Either every state s or every state-action pair (s, a) has an entry in
the form of V (s) or Q(s, a).
§ The approach, in general, updates old values in these tables towards a
target value in each iteration.
§ Let the notation s 7→ u denote an individual update where s is the
state updated and u is the target.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 4 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods

§ In the last few lectures we have seen ‘tabular methods’ for solving RL
problems.
§ The state or action values are all represented by look-up tables.
Either every state s or every state-action pair (s, a) has an entry in
the form of V (s) or Q(s, a).
§ The approach, in general, updates old values in these tables towards a
target value in each iteration.
§ Let the notation s 7→ u denote an individual update where s is the
state updated and u is the target.
I Monte Carlo update is st 7→ Gt
I TD(0) update is st 7→ Rt+1 + γV (st+1 )
 
P P 0 0
I DP update is st 7→ π(a|st ) R(st , a) + γ p(s |st , a)v(s )
a∈A s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 4 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations


§ What are some problems of tabular methods?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations


§ What are some problems of tabular methods?
§ Large state spaces and that means huge memory need and also huge
amount of time to update each state enough number of times.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations


§ What are some problems of tabular methods?
§ Large state spaces and that means huge memory need and also huge
amount of time to update each state enough number of times.
§ This also means, continuous states and actions can not be handled.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations


§ What are some problems of tabular methods?
§ Large state spaces and that means huge memory need and also huge
amount of time to update each state enough number of times.
§ This also means, continuous states and actions can not be handled.
§ States not encountered previously will not have a sensible policy i.e.,
generalisation is an issue.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Tabular Methods - Limitations


§ What are some problems of tabular methods?
§ Large state spaces and that means huge memory need and also huge
amount of time to update each state enough number of times.
§ This also means, continuous states and actions can not be handled.
§ States not encountered previously will not have a sensible policy i.e.,
generalisation is an issue.
§ Fortunately, a change of representation can address all these issues to
some extent. Instead of representing the value functions as look-up
tables, they are represented as a parameterized functions.
§ We denote the approximate value function in parameterized form as
v̂(s; w) ≈ vπ (s) where w ∈ Rd is the learnable parameter vector.
§ Any form of function approximator e.g., linear function approximator,
multi-layer neural networks, decision trees, nearest neighbours etc.
can be used. However, in practice, some fit more easily into this role
than others.
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 5 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Advantages and Specialities of Function Approximators


§ Handle large state spaces.
§ In tabular methods, learned values at each state are decoupled - an
update at one state affected no other state. Parameterized value
function representation means an update at one state affects many
others helping generalization.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 6 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Advantages and Specialities of Function Approximators


§ Handle large state spaces.
§ In tabular methods, learned values at each state are decoupled - an
update at one state affected no other state. Parameterized value
function representation means an update at one state affects many
others helping generalization.
§ In supervised machine learning teminology, an update of the form
s 7→ u means training with input-output pairs where the output is u.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 6 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Advantages and Specialities of Function Approximators


§ Handle large state spaces.
§ In tabular methods, learned values at each state are decoupled - an
update at one state affected no other state. Parameterized value
function representation means an update at one state affects many
others helping generalization.
§ In supervised machine learning teminology, an update of the form
s 7→ u means training with input-output pairs where the output is u.
§ Viewing each update as conventional training example has some
downsides too.
§ The training set is not static unlike most traditional supervised
learning setting. RL generally requires function approximation
methods able to handle nonstationary target functions (target
functions that change over time)
§ Control methods changes the policy and thus the generated data. For
evaluation even, the target values of training examples are
non-stationary when bootstrapping is applied.
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 6 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

TypesofofValue
Types ValueFunction
Function Approximators
Approximation

^
v(s,w) ^
q(s,a,w) ^
q(s,a … q(s,a
^
1,w) m,w)

w w w

s s a s
Figure credit: David Silver, DeepMind

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 7 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators


§ We need to figure out a measure for the error that the function
approximator is making.
§ One candidate can be the sum of error squares.
X
[vπ (s) − v̂(s; w)]2
s

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 8 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators


§ We need to figure out a measure for the error that the function
approximator is making.
§ One candidate can be the sum of error squares.
X
[vπ (s) − v̂(s; w)]2
s

§ Do we average?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 8 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators


§ We need to figure out a measure for the error that the function
approximator is making.
§ One candidate can be the sum of error squares.
X
[vπ (s) − v̂(s; w)]2
s

§ Do we average?
§ In classical supervised setting, the averaging by N1 is to approximate
the expected error by empirical error.
§ With training data {s(i) , vπ (s(i) )} coming from probability
distribution D(s, vπ (s)),
R
I Expected error: l(v̂(s; w), vπ (s))dD(s, vπ (s))
1
P
N
I Empirical error: N l(v̂(s(i) ; w), vπ (s(i) ))
i=1

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 8 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators


§ This was assuming the training data that we get, follows the true data
distribution. So, if a data point is more likely in the true distribution,
it is more likely that it comes more often as the training example.
§ So, we need to average but with a state distribution.
X
µ(s) [vπ (s) − v̂(s; w)]2
s

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 9 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators


§ This was assuming the training data that we get, follows the true data
distribution. So, if a data point is more likely in the true distribution,
it is more likely that it comes more often as the training example.
§ So, we need to average but with a state distribution.
X
µ(s) [vπ (s) − v̂(s; w)]2
s

P
§ µ(s) > 0, µ(s) = 1 provides the probability of obtaining a state any
s
time we sample a MRP (why MRP?)

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 9 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators


§ This was assuming the training data that we get, follows the true data
distribution. So, if a data point is more likely in the true distribution,
it is more likely that it comes more often as the training example.
§ So, we need to average but with a state distribution.
X
µ(s) [vπ (s) − v̂(s; w)]2
s

P
§ µ(s) > 0, µ(s) = 1 provides the probability of obtaining a state any
s
time we sample a MRP (why MRP?)
§ In a non-episodic task, the distribution is the stationary distribution
under π
§ In an episodic task, µ(s) is the fraction of time spent in s. A simple
proof is given in [SB] - section 9.2. Under on-policy training, this is
called the on-policy distribution
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 9 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 10 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1

§ But what is the practical problem in it? Hint: A big assumption was
made.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 10 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1

§ But what is the practical problem in it? Hint: A big assumption was
made. - Where will I get the target vπ (st )?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 10 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Error Measures in Function Approximators

§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1

§ But what is the practical problem in it? Hint: A big assumption was
made. - Where will I get the target vπ (st )?
§ We will come back to this.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 10 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Gradient Descent Primer


§ Let J(w) be a differentiable function of parameter vector w.
§ The gradient of J(w) is defined as,
 
∂J(w)
∂w1
 .. 
∇w J(w) = 
 . 

∂J(w)
∂wd

§ To find local minimum of J(w), w is adjusted in the direction of


negative gradient.
1
∆w = − α∇w J(w),
2
where α is the step size parameter.
§ Stochastic Gradient Descent (SGD) adjusts the weight vector after
each example by computing gradient of J only for that example.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 11 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Gradient Monte Carlo Algorithm


§ SGD update rule for mean square error.
1
wk+1 = wk − α∇wk [vπ (st ) − v̂(st ; wk )]2
2
= wk + α [vπ (st ) − v̂(st ; wk )] ∇wk v̂(st ; wk ) (1)

§ Lets look at different possibilities to approximate the target vπ (st ).


I Monte Carlo target: Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · ·
I TD(0) target: Rt+1 + γv̂(st+1 ; wk )

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 12 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Gradient Monte Carlo Algorithm


§ SGD update rule for mean square error.
1
wk+1 = wk − α∇wk [vπ (st ) − v̂(st ; wk )]2
2
= wk + α [vπ (st ) − v̂(st ; wk )] ∇wk v̂(st ; wk ) (1)

§ Lets look at different possibilities to approximate the target vπ (st ).


I Monte Carlo target: Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · ·
I TD(0) target: Rt+1 + γv̂(st+1 ; wk )

Figure credit: [SB: Chapter 9]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 12 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Data Preparation
§ Traditionally supervised methods gets data in the form
< {x(i) , f (x(i) )} >, from where a function fˆ(x) to predict for an
unknown x is learned.
§ In RL, data comes in the form of
< st , at , Rt+1 , st+1 , at+1 , Rt+2 + · · · > from where the approximate
targets have to be generated.
§ For MC targets, this is done as < st , Rt+1 + γRt+2 + γ 2 Rt+3 + · · · >,
< st+1 , Rt+2 + γRt+3 + γ 2 Rt+4 + · · · > etc.
§ For TD(0) targets, this is done as < st , Rt+1 + γv̂(st+1 ; wk ) >,
< st+1 , Rt+2 + γv̂(st+2 ; wk+1 ) > etc.
§ The last setting gives non-stationary training data as data is
generated using the function approximator itself.
§ The key step in eqn. 1, relies on the target being independent of wt .
§ An approximate way is to take into account the effect of changing wt
on the estimate but ignore its effect on the target. Thus it is termed
as a semi-gradient descent method.
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 13 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Semi-gradient TD(0)

Figure credit: [SB: Chapter 9]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 14 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Feature Vectors
§ Feature vectors are representations of states. Corresponding to every
state s, there is a real-valued vector x(s) ∈ Rd i.e., with same
number of components as w
 
x1 (s)
 
x(s) =  ... 
xd (s)

§ For example, features can be,


I Distance of robot from landmarks.
I Co-ordinates of cells in gridworlds.
I Piece and pawn configurations in chess.
§ Linear methods approximate state-value function by the inner product
between w and x(s).
Xd
v̂(s; w) = wT x(s) = wi x i
i=1
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 15 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Linear Function Approximator

§ The gradient w.r.t. w is, ∇w v̂(s; w) = x(s).


§ So, the SGD update rule in linear function approximator case is
wk+1 = wk + α [vπ (st ) − v̂(st ; wk )] x(st )

§ The gradient Monte Carlo algorithm converges to the global optimum


under linear function approximation if α is reduced over time
according to the usual conditions.
§ The semi-gradient TD(0) algorithm also converges under linear
function approximation, but the convergence proof is not trivial (see
section 9.4 in [SB]).
§ The update rule, in this case, is
 
wk+1 = wk + α Rt+1 + γwkT x(st+1 ) − wkT x(st ) x(st )

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 16 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Table Look-up Features

§ Table look-up is a special case of linear value function approximation


§ Using table lookup features
 
1(s = s1 )
 .. 
xtable (s) =  . 
1(s = sd )

§ Parameter vector w gives value of each individual state,


 T  
1(s = s1 ) w1
 .
..   . 
v̂(s; w) =    .. 
1(s = sd ) wd

§ State aggregation methods are used to save memory and ease


computation.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 17 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Looking back at Eligibility Traces

§ Lets say we get a reward at the end of some step. What eligibility
trace says is that the credit for the reward should trickle down in
proportion to all the way to the first state. The credit should be more
for the state-action pairs which were close to the rewarding step and
also for those state-action pairs which were visited frequently along
the way.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 18 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ Earlier we had eligibility values for all states. Now, we have two
options for eligibility traces in function approximation based methods.
§ Should the eligibility trace be on the features x(s) or the parameters
w?
§ Hint:
I What was eligibility doing in TD(λ) or SARSA(λ) algorithms? Was it
associated with something getting updated or not?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 19 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ Earlier we had eligibility values for all states. Now, we have two
options for eligibility traces in function approximation based methods.
§ Should the eligibility trace be on the features x(s) or the parameters
w?
§ Hint:
I What was eligibility doing in TD(λ) or SARSA(λ) algorithms? Was it
associated with something getting updated or not?
§ The update step updates the parameter values. So, eligibilities are
associated with the parameters.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 19 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation


§ From slide 14, the update rule w/o eligibility was
wk+1 = wk +αδt ∇w v̂(st ; wk ), where, δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )

§ Now it will be,


wk+1 = wk + αδt ∇w v̂(st ; wk )e(wk ) //elementwise

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 20 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation


§ From slide 14, the update rule w/o eligibility was
wk+1 = wk +αδt ∇w v̂(st ; wk ), where, δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )

§ Now it will be,


wk+1 = wk + αδt ∇w v̂(st ; wk )e(wk ) //elementwise

§ From slide 16, the update rule w/o eligibility for linear models was
wk+1 = wk + αδt x(st )

§ Now it will be,


wk+1 = wk + αδt x(st )e(wk ) //elementwise

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 20 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ What eligibility of a state (in earlier cases of TD(0), say) signifies is


that - how much a state is responsible for the the final reward that is
obtained.
§ Keeping this in mind - what is a quantity that is performing a similar
thing in function approximation?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 21 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation

§ What eligibility of a state (in earlier cases of TD(0), say) signifies is


that - how much a state is responsible for the the final reward that is
obtained.
§ Keeping this in mind - what is a quantity that is performing a similar
thing in function approximation?
§ Gradients - i.e., the partial derivatives of the predicted value w.r.t.
the parameters.
§ It tells how much a particular parameter is responsible for the
predicted output.
§ This similarity is used by replacing the usual way of incrementing
eligibility by 1, by an accumulation of gradients.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 21 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation


§

δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )]


et = γλet−1 + ∇w v̂(st ; wk )
wk+1 = wk + αδt et

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 22 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation


§

δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )]


et = γλet−1 + ∇w v̂(st ; wk )
wk+1 = wk + αδt et

§ For linear models, this second line will be changed to,


et = γλet+1 + x(st )

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 22 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Eligibility Trace for Function Approximation


§

δt = [Rt+1 + γv̂(st+1 ; wk ) − v̂(st ; wk )]


et = γλet−1 + ∇w v̂(st ; wk )
wk+1 = wk + αδt et

§ For linear models, this second line will be changed to,


et = γλet+1 + x(st )

§ Using lookup table representation, this becomes,


 
1(s = s1 )
 .. 
et = γλet+1 +  . 
1(s = sd )
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 22 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Control with Function Approximation

§ Like the MC or TD approaches, we first need to resort to action value


function Q(s, a).
§ For this, the very first thing that is needed is to encode the actions as
well, i.e., we need to find x(s, a) instead of x(s).
§ Encoding the actions as one-hot vectors is often a possible choice.
§ Another option is to maintain different set of parameters for different
action, i.e., Q(s, a) = xT (s)wa

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 23 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Control with Function Approximation

§ Like the MC or TD approaches, we first need to resort to action value


function Q(s, a).
§ For this, the very first thing that is needed is to encode the actions as
well, i.e., we need to find x(s, a) instead of x(s).
§ Encoding the actions as one-hot vectors is often a possible choice.
§ Another option is to maintain different set of parameters for different
action, i.e., Q(s, a) = xT (s)wa
§ (From previous topic): The SARSA update rule is
Q(st , at ) ← Q(st , at ) + α (Rt+1 + γQ(st+1 , at+1 ) − Q(st , at ))

§ So, for Q(st+1 , at+1 ), xT (st+1 )wat+1 will be used and for
Q(st , at ), xT (st )wat will be used

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 23 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Afterstates
§ These options are managable for small number of actions.
§ What about large number of actions or continuous actions?
§ What about generalization?

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 24 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Afterstates
§ These options are managable for small number of actions.
§ What about large number of actions or continuous actions?
§ What about generalization?
§ The concept of ‘afterstates’ help in this regard.
§ Afterstate concept is based on separating stochasticity of states and
deterministic nature of agent’s moves.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 24 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Afterstates
§ These options are managable for small number of actions.
§ What about large number of actions or continuous actions?
§ What about generalization?
§ The concept of ‘afterstates’ help in this regard.
§ Afterstate concept is based on separating stochasticity of states and
deterministic nature of agent’s moves.

Figure credit: [SB: Chapter 6]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 24 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Afterstates

Figure credit: [SB: Chapter 6]

§ In such cases, the position move pairs are different but produce the
same “afterposition”.
§ A conventional action value function would have to separately asses
both pairs whereas an afterstate value function would immediately
asses both equally.
§ S × A → S 0 → S. Instead of learning action value over (s, a), state
value function over afterstates are learned.
§ What are the advantages?
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 25 / 35
Agenda Incremental
Introduction Value Function Approximation
Control Algorithms Function Approximation Control Batch Methods

Controlwith
Control withValue
ValueFunction
FunctionApproximation
Approximation

q
w =q
π

Starting w qw ≈ q*

)
dy(q w
ε- gree
π =

Policy evaluation Approximate policy evaluation, q̂(·, ·, w) ≈ qπ


Policy improvement -greedy policy improvement
Figure credit: [David Silver: Deepmind]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 26 / 35


AgendaIncremental Control Algorithms
Introduction Value Function Approximation Function Approximation Control Batch Methods

Action-Value
Control Function
with Value Approximation
Function Approximation

Approximate the action-value function

q̂(S, A, w) ≈ qπ (S, A)

Minimise mean-squared error between approximate


action-value fn q̂(S, A, w) and true action-value fn qπ (S, A)
 
J(w) = Eπ (qπ (S, A) − q̂(S, A, w))2

Use stochastic gradient descent to find a local minimum


1
− ∇w J(w) = (qπ (S, A) − q̂(S, A, w))∇w q̂(S, A, w)
2
∆w = α(qπ (S, A) − q̂(S, A, w))∇w q̂(S, A, w)
Slide credit: [David Silver: Deepmind]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 27 / 35


Incremental Control
Agenda Algorithms
Introduction Value Function Approximation Function Approximation Control Batch Methods

Incremental Control
Control with ValueAlgorithms
Function Approximation
Like prediction, we must substitute a target for qπ (S, A)
For MC, the target is the return Gt
∆w = α(Gt − q̂(St , At , w))∇w q̂(St , At , w)
For TD(0), the target is the TD target Rt+1 + γQ(St+1 , At+1 )
∆w = α(Rt+1 + γ q̂(St+1 , At+1 , w) − q̂(St , At , w))∇w q̂(St , At , w)
For forward-view TD(λ), target is the action-value λ-return
∆w = α(qtλ − q̂(St , At , w))∇w q̂(St , At , w)
For backward-view TD(λ), equivalent update is
δt = Rt+1 + γ q̂(St+1 , At+1 , w) − q̂(St , At , w)
Et = γλEt−1 + ∇w q̂(St , At , w)
∆w = αδt Et
Slide credit: [David Silver: Deepmind]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 28 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Least Squares Solution

§ We have seen that the error that is minimized in function


approximation based methods is,
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1

§ We have also seen how this error function is minimized using gradient
descent updates.
§ But, if we look carefully, solving this is nothing but finding a least
squares solution, i.e., finding parameter vector w minimising
sum-squared error between the target values and the predicted values.
§ We know that, for linear function approximator (i.e. when v̂(st ; w) is
a linear function of w), the solution to this is exact and is obtained in
closed form.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 29 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Least Squares Solution


§ From slide 16, we have seen that the update rule for linear function
approximator is  
wt+1 = wt + α∆wt = wt + α Rt+1 + γwtT x(st+1 ) − wtT x(st ) x(st )
 
= wt + α Rt+1 + γwtT xt+1 − wtT xt xt [using xt for x(st )]
= wt + α[Rt+1 xt + γ wtT xt+1 xt − wtT xt xt ]
| {z } | {z }
scalar scalar
= wt + α[Rt+1 xt + γxt xTt+1 wt − xt xTt wt ]
= wt + α[Rt+1 xt − xt (xt − γxt+1 )T wt ]

§ For a given wt , the expected value of the new weight (wt+1 ) can be
written as, (The expectation is for different values of R and x’s)
E[wt+1 |wt ] = wt + α(b − Awt )

where, b = E[Rt+1 xt ] ∈ Rd and A = E[xt (xt − γxt+1 )T ] ∈ Rd×d

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 30 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Least Squares Solution


§ At minimum, the update will mean no change, i.e.,
b − AwT D = 0
=⇒ b = AwT D
=⇒ wT D = A−1 b

§ This quantity is called the TD fixed point.


§ For non-linear function approximators, the TD fixed point is not
obtained in closed form. However, gradient descent update can be
applied on a batch of training data for an iterative solution in such
cases.
§ Traditional supervised least square solution requires training data in
the form
D = {hx(1) , f (x(1) )i, hx(2) , f (x(2) )i, · · · , hx(T ) , f (x(T ) )i}

where the following loss is minimized,


1 h i2 1 Xh
T i2
LS(w) = ED f (x(i) ) − fˆ(x(i) ; w) = f (x(i) ) − fˆ(x(i) ; w)
2 2T i=1
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 31 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Least Squares Solution


§ The gradient of the loss is computed as,
1 Xh i
T
f (x(i) ) − fˆ(x(i) ; w) ∇w fˆ(x(i) ; w)
T
i=1

§ This is pretty similar to the gradients used in the function


approximation methods till now, but with an important difference.
§ They are not batch methods. The gradients are computed per step.
Thus it is not sample efficient.
§ The idea is to form batches of training data from a few trajectories.
This is called experience.
D = {hs(1) , vπ (s(1) )i, hs(2) , vπ (s(2) )i, · · · , hs(T ) , vπ (s(T ) )i}

§ And the gradient of the loss is,


1 Xh i
T
vπ (s(i) ) − v̂(s(i) ; w) ∇w v̂(s(i) ; w)
T i=1

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 32 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Mini-batch Gradient Descent with Experience Replay

§ Given experience of the form,


D = {hs(1) , vπ (s(1) )i, hs(2) , vπ (s(2) )i, · · · , hs(T ) , vπ (s(T ) )i}

§ Repeat
I Sample state, value from experience hs(i) , vπ (s(i) )i ∼ D
I Apply minibatch gradient descent update
Xh i
w ←w+α vπ (s(i) ) − v̂(s(i) ; w) ∇w v̂(s(i) ; w)
i

§ Converges to least square solution.

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 33 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

Experience Replay in Deep Q-Networks (DQN)

DQN uses experience replay and fixed Q-targets


§ Take action at at according to -greedy policy.
§ Store transition (st , at , Rt+1 , st+1 , at+1 , · · · ) in replay memory D
§ Sample random mini-batch of transitions (s, a, r, s0 ) from D
§ Compute Q-learning targets w.r.t. old, fixed parameters w−
§ Optimise MSE between Q-network and Q-learning targets
 2 
0 0 −
E(s,a,r,s0 )∼D r + γ max
0
Q(s , a ; w ) − Q(s, a; wi )
a

§ Using minibatch gradient descent

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 34 / 35


Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods

DQN in Atari
DQN in Atari
End-to-end learning of values Q(s, a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step

Network architecture and hyperparameters fixed across all games


Slide credit: [David Silver: Deepmind]

Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 35 / 35

You might also like