0% found this document useful (0 votes)
170 views

Lecture 5: Value Function Approximation: Emma Brunskill

This document summarizes Emma Brunskill's lecture on value function approximation for reinforcement learning. The lecture discusses using linear function approximation to represent value functions instead of tabular representations for problems with large state spaces. Specifically, it covers using Monte Carlo and temporal-difference learning with linear function approximation for policy evaluation, and control methods that extend these ideas. Function approximation can reduce the memory, computation, and data needed for reinforcement learning problems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views

Lecture 5: Value Function Approximation: Emma Brunskill

This document summarizes Emma Brunskill's lecture on value function approximation for reinforcement learning. The lecture discusses using linear function approximation to represent value functions instead of tabular representations for problems with large state spaces. Specifically, it covers using Monte Carlo and temporal-difference learning with linear function approximation for policy evaluation, and control methods that extend these ideas. Function approximation can reduce the memory, computation, and data needed for reinforcement learning problems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Lecture 5: Value Function Approximation

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2021

The value function approximation structure for today closely follows much
of David Silver’s Lecture 6.

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 1 / 62
Refresh Your Knowledge 4

The basic idea of TD methods are to make state-next state pairs fit the constraints of the Bellman equation on average

(question by: Phil Thomas)

1 True
2 False
3 Not sure
In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state randomly
selects an action, then (select all)

1 Q-learning will converge to the optimal Q-values


2 SARSA will converge to the optimal Q-values
3 Q-learning is learning off-policy
4 SARSA is learning off-policy
5 Not sure
A TD error > 0 can occur even if the current V (s) is correct ∀s: [select all]

1 False
2 True if the MDP has stochastic state transitions
3 True if the MDP has deterministic state transitions
4 Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 2 / 62
Refresh Your Knowledge 4

The basic idea of TD methods are to make state-next state pairs fit the constraints of the Bellman equation on average

(question by: Phil Thomas)

1 True (True)
2 False
3 Not sure
In tabular MDPs, if using a decision poilcy that visits all states an infinite number of times, and in each state randomly
selects an action, then (select all)

1 Q-learning will converge to the optimal Q-values (True)


2 SARSA will converge to the optimal Q-values (False)
3 Q-learning is learning off-policy (True)
4 SARSA is learning off-policy (False)
A TD error > 0 can occur even if the current V (s) is correct ∀s: [select all]

1 False
2 True if the MDP has stochastic state transitions (True)
3 True if the MDP has deterministic state transitions (False)
4 Not sure

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 3 / 62
Break

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 4 / 62
Class Structure

Last time: Control (making decisions) without a model of how the


world works
This time: Linear value function approximation
Next time: Deep reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 5 / 62
Outline for Today

Value function approximation


Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 6 / 62
Reinforcement Learning

Goal: Learn to select actions to maximize total expected future reward

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 7 / 62
Last Time: Tabular Representations for Model-free Control

Last time: how to learn a good policy from experience


So far, have been assuming we can represent the value function or
state-action value function as a vector/ matrix
Tabular representation
Many real world problems have enormous state and/or action spaces
Tabular representation is insufficient

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 8 / 62
Motivation for Function Approximation

Don’t want to have to explicitly store or learn for every single state a
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 9 / 62
Benefits of Function Approximation

Reduce memory needed to store (P, R)/V /Q/π


Reduce computation needed to compute (P, R)/V /Q/π
Reduce experience needed to find a good P, R/V /Q/π

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 10 / 62
Value Function Approximation (VFA)

Represent a (state-action/state) value function with a parameterized


function instead of a table

𝑠 𝑤 𝑉# (𝑠; 𝑤)

𝑠 𝑤 𝑄# (𝑠, 𝑎; 𝑤)
𝑎
Which function approximator?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 11 / 62
Function Approximators

Many possible function approximators including


Linear combinations of features
Neural networks
Decision trees
Nearest neighbors
Fourier/ wavelet bases
In this class we will focus on function approximators that are
differentiable (Why?)
Two very popular classes of differentiable function approximators
Linear feature representations (Today)
Neural networks (Next lecture)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 12 / 62
Outline for the Rest of Today: Policy Evaluation to Control

Given known dynamics and reward models, and a tabular


representation
Discussed how to do policy evaluation and then control (value iteration
and policy iteration)
Given no models, and a tabular representation
Discussed how to do policy evaluation (MC/TD) and then control
(MC, SARSA, Q-learning)
Given no models, and function approximation
Today will discuss how to do policy evaluation and then control

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 13 / 62
Review: Gradient Descent

Consider a function J(w ) that is a differentiable function of a


parameter vector w
Goal is to find parameter w that minimizes J
The gradient of J(w ) is

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 14 / 62
Value Function Approximation for Policy Evaluation with
an Oracle

First assume we could query any state s and an oracle would return
the true value for V π (s)
Similar to supervised learning: assume given (s, V π (s)) pairs
The objective is to find the best approximate representation of V π
given a particular parameterized function V̂ (s; w )

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 15 / 62
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function V π (s) and its approximation V̂ (s; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as

J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]

Can use gradient descent to find a local minimum


1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:

Expected SGD is the same as the full gradient update


Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 16 / 62
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function V π (s) and its approximation V̂ (s; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:
∆w J(w ) = ∆w Eπ [V π (s) − V̂ (s; w )]2
= Eπ [2(V π (s) − V̂ (s; w )]∆w V̂ (s, w )
Expected SGD is the same as the full gradient update
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 17 / 62
Model Free VFA Policy Evaluation

Don’t actually have access to an oracle to tell true V π (s) for any
state s
Now consider how to do model-free value function approximation for
prediction / evaluation / policy evaluation without a model

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 18 / 62
Model Free VFA Prediction / Policy Evaluation

Recall model-free policy evaluation (Lecture 3)


Following a fixed policy π (or had access to prior data)
Goal is to estimate V π and/or Q π
Maintained a lookup table to store estimates V π and/or Q π
Updated these estimates after each episode (Monte Carlo methods)
or after each step (TD methods)
Now: in value function approximation, change the estimate
update step to include fitting the function approximator

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 19 / 62
Break

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 20 / 62
Outline for Today

Value function approximation


Monte Carlo policy evaluation with linear function
approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 21 / 62
Feature Vectors

Use a feature vector to represent a state s


 
x1 (s)
 x2 (s) 
x(s) = 
 ... 

xn (s)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 22 / 62
Linear Value Function Approximation for Prediction With
An Oracle
Represent a value function (or state-action value function) for a
particular policy with a weighted linear combination of features
n
X
V̂ (s; w ) = xj (s)wj = x(s)T w
j=1

Objective function is

J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]

Recall weight update is


1
∆w = − α∇w J(w )
2
Update is:

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 23 / 62
Feature Vectors

Use a feature vector to represent a state s


 
x1 (s)
 x2 (s) 
x(s) = 
 ... 

xn (s)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 24 / 62
Linear Value Function Approximation for Prediction With
An Oracle
Represent a value function (or state-action value function) for a
particular policy with a weighted linear combination of features
n
X
V̂ (s; w ) = xj (s)wj = x(s)T w
j=1

Objective function is
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Recall weight update is
1
∆w = − α∇w J(w )
2
Update is:

Update = step-size × prediction error × feature value


Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 25 / 62
Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected return


V π (st )
Therefore can reduce MC VFA to doing supervised learning on a set
of (state,return) pairs: hs1 , G1 i, hs2 , G2 i, . . . , hsT , GT i
Substitute Gt for the true V π (st ) when fit function approximator

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 26 / 62
Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected return


V π (st )
Therefore can reduce MC VFA to doing supervised learning on a set
of (state,return) pairs: hs1 , G1 i, hs2 , G2 i, . . . , hsT , GT i
Substitute Gt for the true V π (st ) when fit function approximator
Concretely when using linear VFA for policy evaluation

∆w = α(Gt − V̂ (st ; w ))∇w V̂ (st ; w )


= α(Gt − V̂ (st ; w ))x(st )
= α(Gt − x(st )T w )x(st )

Note: Gt may be a very noisy estimate of true return

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 27 / 62
MC Linear Value Function Approximation for Policy
Evaluation

1: Initialize w = 0, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visitPto (s) in episode k then
6: Gt (s) = Lj=t k
rk,j
7: Update weights:

8: end if
9: end for
10: k =k +1
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 28 / 62
Baird (1995)-Like Example with MC Policy Evaluation1
s1= [2 0 0 0 0 0 0 1
s2 = [0 2 0 0 0 0 0 1
..
s6 = [0 0 0 0 0 2 0 1
s7 = [0 0 0 0 0 0 1 2

w0 = [1 1 1 1 1 1 1 1
rewards all

s7 small prob of
going to terminal
state

MC update: ∆w = α(Gt − x(st )T w )x(st )


]

Small prob s7 goes to terminal state, x(s7 )T = [0 0 0 0 0 0 1 2]


.

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 29 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation

Define the mean squared error of a linear value function


approximation for a particular policy π relative to the true value as
X
MSVEµ (w ) = µ(s)(V π (s) − V̂ π (s; w ))2
s∈S

where
P
µ(s): probability of visiting state s under policy π . Note s µ(s) = 1
V̂ π (s; w ) = x(s)T w , a linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 30 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation
Define the mean squared error of a linear value function
approximation for a particular policy π relative to the true value as
X
MSVEµ (w ) = µ(s)(V π (s) − V̂ π (s; w ))2
s∈S

where
P
µ(s): probability of visiting state s under policy π . Note s µ(s) = 1
V̂ π (s; w ) = x(s)T w , a linear value function approximation
Monte Carlo policy evaluation with VFA converges to the weights
wMC which has the minimum mean squared error possible with
respect to the distribution µ:
X
MSVEµ (wMC ) = min µ(s)(V π (s) − V̂ π (s; w ))2
w
s∈S

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 31 / 62
Break

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 32 / 62
Today: Focus on Generalization using Linear Value
Function

Preliminaries
Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 33 / 62
Recall: Temporal Difference Learning w/ Lookup Table

Uses bootstrapping and sampling to approximate V π


Updates V π (s) after each transition (s, a, r , s 0 ):

V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))

Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)


Represent value for each state with a separate table entry

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 34 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation

Uses bootstrapping and sampling to approximate true V π


Updates estimate V π (s) after each transition (s, a, r , s 0 ):

V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))

Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)


In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 35 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation

Uses bootstrapping and sampling to approximate true V π


Updates estimate V π (s) after each transition (s, a, r , s 0 ):

V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))

Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)


In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:
1 Sampling
2 Bootstrapping
3 Value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 36 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation

In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased


and approximated estimate of the true value V π (s)
Can reduce doing TD(0) learning with value function approximation
to supervised learning on a set of data pairs:
hs1 , r1 + γ V̂ π (s2 ; w )i, hs2 , r2 + γ V̂ (s3 ; w )i, . . .
Find weights to minimize mean squared error

J(w ) = Eπ [(rj + γ V̂ π (sj+1 , w ) − V̂ (sj ; w ))2 ]

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 37 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation

In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased


and approximated estimate of the true value V π (s)
Supervised learning on a different set of data pairs:
hs1 , r1 + γ V̂ π (s2 ; w )i, hs2 , r2 + γ V̂ (s3 ; w )i, . . .
In linear TD(0)

∆w = α(r + γ V̂ π (s 0 ; w ) − V̂ π (s; w ))∇w V̂ π (s; w )


= α(r + γ V̂ π (s 0 ; w ) − V̂ π (s; w ))x(s)
= α(r + γx(s 0 )T w − x(s)T w )x(s)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 38 / 62
TD(0) Linear Value Function Approximation for Policy
Evaluation

1: Initialize w = 0, k = 1
2: loop
3: Sample tuple (sk , ak , rk , sk+1 ) given π
4: Update weights:

w = w + α(r + γx(s 0 )T w − x(s)T w )x(s)

5: k =k +1
6: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 39 / 62
1
Baird Example with TD(0) On Policy Evaluation
s1= [2 0 0 0 0 0 0 1
s2 = [0 2 0 0 0 0 0 1
..
s6 = [0 0 0 0 0 2 0 1
s7 = [0 0 0 0 0 0 1 2

w0 = [1 1 1 1 1 1 1 1
rewards all

s7 small prob of
going to terminal
state
]

TD update: ∆w = α(r + γx(s 0 )T w − x(s)T w )x(s)


.

1
Figure from Sutton and Barto 2018
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 40 / 62
Convergence Guarantees for TD Linear VFA for Policy
Evaluation: Preliminaries

For infinite horizon, the Markov Chain defined by a MDP with a


particular policy will eventually converge to a probability distribution
over states d(s)
d(s) is called the stationary distribution over states of π
P
s d(s) = 1
d(s) satisfies the following balance equation:
XX
d(s 0 ) = π(a|s)p(s 0 |s, a)d(s)
s a

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 41 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation
Define the mean squared error of a linear value function
approximation for a particular policy π relative to the true value given
the distribution d as
X
MSVEd (w ) = d(s)(V π (s) − V̂ π (s; w ))2
s∈S

where
d(s): stationary distribution of π in the true decision process
V̂ π (s; w ) = x(s)T w , a linear value function approximation
TD(0) policy evaluation with VFA converges to weights wTD which is
within a constant factor of the min mean squared error possible given
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 42 / 62
Check Your Understanding: Poll

TD(0) policy evaluation with VFA converges to weights wTD which is


within a constant factor of the min mean squared error possible for
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S

If the VFA is a tabular representation (one feature for each state),


what is the MSVEd for TD?
1 Depends on the problem
2 MSVE = 0 for TD
3 Not sure

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 43 / 62
Check Your Understanding: Poll

TD(0) policy evaluation with VFA converges to weights wTD which is


within a constant factor of the min mean squared error possible for
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S

If the VFA is a tabular representation (one feature for each state),


what is the MSVEd for TD?
1 MSVE = 0 for TD (Answer)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 44 / 62
Break

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 45 / 62
Outline for Today

Value function approximation


Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 46 / 62
Control using Value Function Approximation

Use value function approximation to represent state-action values


Q̂ π (s, a; w ) ≈ Q π
Interleave
Approximate policy evaluation using value function approximation
Perform -greedy policy improvement
Can be unstable. Generally involves intersection of the following:
Function approximation
Bootstrapping
Off-policy learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 48 / 62
Action-Value Function Approximation with an Oracle

Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:

J(w ) = Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]

Use stochastic gradient descent to find a local minimum


1
E
h i
− ∇W J(w ) = (Q π (s, a) − Q̂ π (s, a; w ))∇w Q̂ π (s, a; w )
2
1
∆(w ) = − α∇w J(w )
2
Stochastic gradient descent (SGD) samples the gradient

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 49 / 62
Check Your Understanding: Predict Control Updates
The weight update for control for MC and TD-style methods will be
near identical to the policy evaluation steps. Try to see if you can
predict which are the right weight update equations for the different
methods (select all that are true)
(1) is the SARSA control update
(2) is the MC control update
(3) is the Q-learning control update
(4) is the MC control update
(5) is the Q-learning control update
∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(1)
∆w = α(Gt + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(2)
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(3)
a
∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )(4)
∆w = α(r + γ max
0
Q̂(s 0 , a; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(5)
s

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 50 / 62
Check Your Understanding: Answers

The weight update for control for MC and TD-style methods will be
near identical to the policy evaluation steps. Try to see if you can
predict which are the right weight update equations for the different
methods.
(1) is the SARSA control update
(3) is the Q-learning control update
(4) is the MC control update

∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(1)


∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(3)
a
∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )(4)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 51 / 62
Linear State Action Value Function Approximation with an
Oracle
Use features to represent both the state and action
 
x1 (s, a)
 x2 (s, a) 
x(s, a) = 
 ... 

xn (s, a)
Represent state-action value function with a weighted linear
combination of features
n
X
Q̂(s, a; w ) = x(s, a)T w = xj (s, a)wj
j=1

Stochastic gradient descent update:

∇w J(w ) = ∇w Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 52 / 62
Incremental Model-Free Control Approaches

Similar to policy evaluation, true state-action value function for a


state is unknown and so substitute a target value
In Monte Carlo methods, use a return Gt as a substitute target

∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

For SARSA instead use a TD target r + γ Q̂(s 0 , a0 ; w ) which leverages


the current function approximation value

∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )

For Q-learning instead use a TD target r + γ maxa0 Q̂(s 0 , a0 ; w ) which


leverages the max of the current function approximation value

∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 53 / 62
Convergence of TD Methods with VFA

Informally, updates involve doing an (approximate) Bellman backup


followed by best trying to fit underlying value function to a particular
feature representation
Bellman operators are contractions, but value function approximation
fitting can be an expansion

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 54 / 62
1
Challenges of Off Policy Control: Baird Example

Behavior policy and target policy are not identical


Value can diverge

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 55 / 62
Convergence of Control Methods with VFA

Algorithm Tabular Linear VFA Nonlinear VFA


Monte-Carlo Control
Sarsa
Q-learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 56 / 62
Important Open Area: Off Policy Learning with Function
Approximation

Extensive work in better TD-style algorithms with value function


approximation, some with convergence guarantees: see Chp 11 SB
Will come up further later in this course

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 57 / 62
Linear Value Function Approximation2

2
Figure from Sutton and Barto 2018
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 58 / 62
What You Should Understand

Be able to implement TD(0) and MC on policy evaluation with linear


value function approximation
Be able to define what TD(0) and MC on policy evaluation with
linear VFA are converging to and when this solution has 0 error and
non-zero error.
Be able to implement Q-learning and SARSA and MC control
algorithms
List the 3 issues that can cause instability and describe the problems
qualitatively: function approximation, bootstrapping and off policy
learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 59 / 62
Class Structure

Last time: Control (making decisions) without a model of how the


world works
This time: Value function approximation
Next time: Deep reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 5: Value Function Approximation Winter 2021 60 / 62

You might also like