07 FA Methods
07 FA Methods
Abir Das
IIT Kharagpur
Agenda
Resources
Tabular Methods
§ In the last few lectures we have seen ‘tabular methods’ for solving RL
problems.
§ The state or action values are all represented by look-up tables.
Either every state s or every state-action pair (s, a) has an entry in
the form of V (s) or Q(s, a).
§ The approach, in general, updates old values in these tables towards a
target value in each iteration.
§ Let the notation s 7→ u denote an individual update where s is the
state updated and u is the target.
Tabular Methods
§ In the last few lectures we have seen ‘tabular methods’ for solving RL
problems.
§ The state or action values are all represented by look-up tables.
Either every state s or every state-action pair (s, a) has an entry in
the form of V (s) or Q(s, a).
§ The approach, in general, updates old values in these tables towards a
target value in each iteration.
§ Let the notation s 7→ u denote an individual update where s is the
state updated and u is the target.
I Monte Carlo update is st 7→ Gt
I TD(0) update is st 7→ Rt+1 + γV (st+1 )
P P 0 0
I DP update is st 7→ π(a|st ) R(st , a) + γ p(s |st , a)v(s )
a∈A s0 ∈S
TypesofofValue
Types ValueFunction
Function Approximators
Approximation
^
v(s,w) ^
q(s,a,w) ^
q(s,a … q(s,a
^
1,w) m,w)
w w w
s s a s
Figure credit: David Silver, DeepMind
§ Do we average?
§ Do we average?
§ In classical supervised setting, the averaging by N1 is to approximate
the expected error by empirical error.
§ With training data {s(i) , vπ (s(i) )} coming from probability
distribution D(s, vπ (s)),
R
I Expected error: l(v̂(s; w), vπ (s))dD(s, vπ (s))
1
P
N
I Empirical error: N l(v̂(s(i) ; w), vπ (s(i) ))
i=1
P
§ µ(s) > 0, µ(s) = 1 provides the probability of obtaining a state any
s
time we sample a MRP (why MRP?)
P
§ µ(s) > 0, µ(s) = 1 provides the probability of obtaining a state any
s
time we sample a MRP (why MRP?)
§ In a non-episodic task, the distribution is the stationary distribution
under π
§ In an episodic task, µ(s) is the fraction of time spent in s. A simple
proof is given in [SB] - section 9.2. Under on-policy training, this is
called the on-policy distribution
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 9 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods
§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1
§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1
§ But what is the practical problem in it? Hint: A big assumption was
made.
§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1
§ But what is the practical problem in it? Hint: A big assumption was
made. - Where will I get the target vπ (st )?
§ This essentially tells that for the states that are visited more often
under the policy π, the approximation will be good.
§ For on-policy sampling, the following will approximate the error
measure described in the previous slide.
N
1 X
[vπ (st ) − v̂(st ; w)]2
N
t=1
§ But what is the practical problem in it? Hint: A big assumption was
made. - Where will I get the target vπ (st )?
§ We will come back to this.
Data Preparation
§ Traditionally supervised methods gets data in the form
< {x(i) , f (x(i) )} >, from where a function fˆ(x) to predict for an
unknown x is learned.
§ In RL, data comes in the form of
< st , at , Rt+1 , st+1 , at+1 , Rt+2 + · · · > from where the approximate
targets have to be generated.
§ For MC targets, this is done as < st , Rt+1 + γRt+2 + γ 2 Rt+3 + · · · >,
< st+1 , Rt+2 + γRt+3 + γ 2 Rt+4 + · · · > etc.
§ For TD(0) targets, this is done as < st , Rt+1 + γv̂(st+1 ; wk ) >,
< st+1 , Rt+2 + γv̂(st+2 ; wk+1 ) > etc.
§ The last setting gives non-stationary training data as data is
generated using the function approximator itself.
§ The key step in eqn. 1, relies on the target being independent of wt .
§ An approximate way is to take into account the effect of changing wt
on the estimate but ignore its effect on the target. Thus it is termed
as a semi-gradient descent method.
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 13 / 35
Agenda Introduction Value Function Approximation Function Approximation Control Batch Methods
Semi-gradient TD(0)
Feature Vectors
§ Feature vectors are representations of states. Corresponding to every
state s, there is a real-valued vector x(s) ∈ Rd i.e., with same
number of components as w
x1 (s)
x(s) = ...
xd (s)
§ Lets say we get a reward at the end of some step. What eligibility
trace says is that the credit for the reward should trickle down in
proportion to all the way to the first state. The credit should be more
for the state-action pairs which were close to the rewarding step and
also for those state-action pairs which were visited frequently along
the way.
§ Earlier we had eligibility values for all states. Now, we have two
options for eligibility traces in function approximation based methods.
§ Should the eligibility trace be on the features x(s) or the parameters
w?
§ Hint:
I What was eligibility doing in TD(λ) or SARSA(λ) algorithms? Was it
associated with something getting updated or not?
§ Earlier we had eligibility values for all states. Now, we have two
options for eligibility traces in function approximation based methods.
§ Should the eligibility trace be on the features x(s) or the parameters
w?
§ Hint:
I What was eligibility doing in TD(λ) or SARSA(λ) algorithms? Was it
associated with something getting updated or not?
§ The update step updates the parameter values. So, eligibilities are
associated with the parameters.
§ From slide 16, the update rule w/o eligibility for linear models was
wk+1 = wk + αδt x(st )
§ So, for Q(st+1 , at+1 ), xT (st+1 )wat+1 will be used and for
Q(st , at ), xT (st )wat will be used
Afterstates
§ These options are managable for small number of actions.
§ What about large number of actions or continuous actions?
§ What about generalization?
Afterstates
§ These options are managable for small number of actions.
§ What about large number of actions or continuous actions?
§ What about generalization?
§ The concept of ‘afterstates’ help in this regard.
§ Afterstate concept is based on separating stochasticity of states and
deterministic nature of agent’s moves.
Afterstates
§ These options are managable for small number of actions.
§ What about large number of actions or continuous actions?
§ What about generalization?
§ The concept of ‘afterstates’ help in this regard.
§ Afterstate concept is based on separating stochasticity of states and
deterministic nature of agent’s moves.
Afterstates
§ In such cases, the position move pairs are different but produce the
same “afterposition”.
§ A conventional action value function would have to separately asses
both pairs whereas an afterstate value function would immediately
asses both equally.
§ S × A → S 0 → S. Instead of learning action value over (s, a), state
value function over afterstates are learned.
§ What are the advantages?
Abir Das (IIT Kharagpur) CS60077 Oct 21, 22, 2021 25 / 35
Agenda Incremental
Introduction Value Function Approximation
Control Algorithms Function Approximation Control Batch Methods
Controlwith
Control withValue
ValueFunction
FunctionApproximation
Approximation
q
w =q
π
Starting w qw ≈ q*
)
dy(q w
ε- gree
π =
Action-Value
Control Function
with Value Approximation
Function Approximation
q̂(S, A, w) ≈ qπ (S, A)
Incremental Control
Control with ValueAlgorithms
Function Approximation
Like prediction, we must substitute a target for qπ (S, A)
For MC, the target is the return Gt
∆w = α(Gt − q̂(St , At , w))∇w q̂(St , At , w)
For TD(0), the target is the TD target Rt+1 + γQ(St+1 , At+1 )
∆w = α(Rt+1 + γ q̂(St+1 , At+1 , w) − q̂(St , At , w))∇w q̂(St , At , w)
For forward-view TD(λ), target is the action-value λ-return
∆w = α(qtλ − q̂(St , At , w))∇w q̂(St , At , w)
For backward-view TD(λ), equivalent update is
δt = Rt+1 + γ q̂(St+1 , At+1 , w) − q̂(St , At , w)
Et = γλEt−1 + ∇w q̂(St , At , w)
∆w = αδt Et
Slide credit: [David Silver: Deepmind]
§ We have also seen how this error function is minimized using gradient
descent updates.
§ But, if we look carefully, solving this is nothing but finding a least
squares solution, i.e., finding parameter vector w minimising
sum-squared error between the target values and the predicted values.
§ We know that, for linear function approximator (i.e. when v̂(st ; w) is
a linear function of w), the solution to this is exact and is obtained in
closed form.
§ For a given wt , the expected value of the new weight (wt+1 ) can be
written as, (The expectation is for different values of R and x’s)
E[wt+1 |wt ] = wt + α(b − Awt )
§ Repeat
I Sample state, value from experience hs(i) , vπ (s(i) )i ∼ D
I Apply minibatch gradient descent update
Xh i
w ←w+α vπ (s(i) ) − v̂(s(i) ; w) ∇w v̂(s(i) ; w)
i
DQN in Atari
DQN in Atari
End-to-end learning of values Q(s, a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step