Lecture 5: Value Function Approximation
Emma Brunskill
CS234 Reinforcement Learning.
Winter 2021
The value function approximation structure for today closely follows much
of David Silver’s Lecture 6.
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 1 / 62
Refresh Your Knowledge 4
The basic idea of TD methods are to make state-next state pairs fit the constraints of the Bellman equation on average
(question by: Phil Thomas)
1 True
2 False
3 Not sure
In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state randomly
selects an action, then (select all)
1 Q-learning will converge to the optimal Q-values
2 SARSA will converge to the optimal Q-values
3 Q-learning is learning off-policy
4 SARSA is learning off-policy
5 Not sure
A TD error > 0 can occur even if the current V (s) is correct ∀s: [select all]
1 False
2 True if the MDP has stochastic state transitions
3 True if the MDP has deterministic state transitions
4 Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 2 / 62
Refresh Your Knowledge 4
The basic idea of TD methods are to make state-next state pairs fit the constraints of the Bellman equation on average
(question by: Phil Thomas)
1 True (True)
2 False
3 Not sure
In tabular MDPs, if using a decision poilcy that visits all states an infinite number of times, and in each state randomly
selects an action, then (select all)
1 Q-learning will converge to the optimal Q-values (True)
2 SARSA will converge to the optimal Q-values (False)
3 Q-learning is learning off-policy (True)
4 SARSA is learning off-policy (False)
A TD error > 0 can occur even if the current V (s) is correct ∀s: [select all]
1 False
2 True if the MDP has stochastic state transitions (True)
3 True if the MDP has deterministic state transitions (False)
4 Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 3 / 62
Break
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 4 / 62
Class Structure
Last time: Control (making decisions) without a model of how the
world works
This time: Linear value function approximation
Next time: Deep reinforcement learning
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 5 / 62
Outline for Today
Value function approximation
Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 6 / 62
Reinforcement Learning
Goal: Learn to select actions to maximize total expected future reward
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 7 / 62
Last Time: Tabular Representations for Model-free Control
Last time: how to learn a good policy from experience
So far, have been assuming we can represent the value function or
state-action value function as a vector/ matrix
Tabular representation
Many real world problems have enormous state and/or action spaces
Tabular representation is insufficient
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 8 / 62
Motivation for Function Approximation
Don’t want to have to explicitly store or learn for every single state a
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 9 / 62
Benefits of Function Approximation
Reduce memory needed to store (P, R)/V /Q/π
Reduce computation needed to compute (P, R)/V /Q/π
Reduce experience needed to find a good P, R/V /Q/π
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 10 / 62
Value Function Approximation (VFA)
Represent a (state-action/state) value function with a parameterized
function instead of a table
𝑠 𝑤 𝑉# (𝑠; 𝑤)
𝑠 𝑤 𝑄# (𝑠, 𝑎; 𝑤)
𝑎
Which function approximator?
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 11 / 62
Function Approximators
Many possible function approximators including
Linear combinations of features
Neural networks
Decision trees
Nearest neighbors
Fourier/ wavelet bases
In this class we will focus on function approximators that are
differentiable (Why?)
Two very popular classes of differentiable function approximators
Linear feature representations (Today)
Neural networks (Next lecture)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 12 / 62
Outline for the Rest of Today: Policy Evaluation to Control
Given known dynamics and reward models, and a tabular
representation
Discussed how to do policy evaluation and then control (value iteration
and policy iteration)
Given no models, and a tabular representation
Discussed how to do policy evaluation (MC/TD) and then control
(MC, SARSA, Q-learning)
Given no models, and function approximation
Today will discuss how to do policy evaluation and then control
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 13 / 62
Review: Gradient Descent
Consider a function J(w ) that is a differentiable function of a
parameter vector w
Goal is to find parameter w that minimizes J
The gradient of J(w ) is
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 14 / 62
Value Function Approximation for Policy Evaluation with
an Oracle
First assume we could query any state s and an oracle would return
the true value for V π (s)
Similar to supervised learning: assume given (s, V π (s)) pairs
The objective is to find the best approximate representation of V π
given a particular parameterized function V̂ (s; w )
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 15 / 62
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function V π (s) and its approximation V̂ (s; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:
Expected SGD is the same as the full gradient update
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 16 / 62
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function V π (s) and its approximation V̂ (s; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:
∆w J(w ) = ∆w Eπ [V π (s) − V̂ (s; w )]2
= Eπ [2(V π (s) − V̂ (s; w )]∆w V̂ (s, w )
Expected SGD is the same as the full gradient update
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 17 / 62
Model Free VFA Policy Evaluation
Don’t actually have access to an oracle to tell true V π (s) for any
state s
Now consider how to do model-free value function approximation for
prediction / evaluation / policy evaluation without a model
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 18 / 62
Model Free VFA Prediction / Policy Evaluation
Recall model-free policy evaluation (Lecture 3)
Following a fixed policy π (or had access to prior data)
Goal is to estimate V π and/or Q π
Maintained a lookup table to store estimates V π and/or Q π
Updated these estimates after each episode (Monte Carlo methods)
or after each step (TD methods)
Now: in value function approximation, change the estimate
update step to include fitting the function approximator
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 19 / 62
Break
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 20 / 62
Outline for Today
Value function approximation
Monte Carlo policy evaluation with linear function
approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 21 / 62
Feature Vectors
Use a feature vector to represent a state s
x1 (s)
x2 (s)
x(s) =
...
xn (s)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 22 / 62
Linear Value Function Approximation for Prediction With
An Oracle
Represent a value function (or state-action value function) for a
particular policy with a weighted linear combination of features
n
X
V̂ (s; w ) = xj (s)wj = x(s)T w
j=1
Objective function is
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Recall weight update is
1
∆w = − α∇w J(w )
2
Update is:
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 23 / 62
Feature Vectors
Use a feature vector to represent a state s
x1 (s)
x2 (s)
x(s) =
...
xn (s)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 24 / 62
Linear Value Function Approximation for Prediction With
An Oracle
Represent a value function (or state-action value function) for a
particular policy with a weighted linear combination of features
n
X
V̂ (s; w ) = xj (s)wj = x(s)T w
j=1
Objective function is
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Recall weight update is
1
∆w = − α∇w J(w )
2
Update is:
Update = step-size × prediction error × feature value
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 25 / 62
Monte Carlo Value Function Approximation
Return Gt is an unbiased but noisy sample of the true expected return
V π (st )
Therefore can reduce MC VFA to doing supervised learning on a set
of (state,return) pairs: hs1 , G1 i, hs2 , G2 i, . . . , hsT , GT i
Substitute Gt for the true V π (st ) when fit function approximator
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 26 / 62
Monte Carlo Value Function Approximation
Return Gt is an unbiased but noisy sample of the true expected return
V π (st )
Therefore can reduce MC VFA to doing supervised learning on a set
of (state,return) pairs: hs1 , G1 i, hs2 , G2 i, . . . , hsT , GT i
Substitute Gt for the true V π (st ) when fit function approximator
Concretely when using linear VFA for policy evaluation
∆w = α(Gt − V̂ (st ; w ))∇w V̂ (st ; w )
= α(Gt − V̂ (st ; w ))x(st )
= α(Gt − x(st )T w )x(st )
Note: Gt may be a very noisy estimate of true return
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 27 / 62
MC Linear Value Function Approximation for Policy
Evaluation
1: Initialize w = 0, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visitPto (s) in episode k then
6: Gt (s) = Lj=t k
rk,j
7: Update weights:
8: end if
9: end for
10: k =k +1
11: end loop
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 28 / 62
Baird (1995)-Like Example with MC Policy Evaluation1
s1= [2 0 0 0 0 0 0 1
s2 = [0 2 0 0 0 0 0 1
..
s6 = [0 0 0 0 0 2 0 1
s7 = [0 0 0 0 0 0 1 2
w0 = [1 1 1 1 1 1 1 1
rewards all
s7 small prob of
going to terminal
state
MC update: ∆w = α(Gt − x(st )T w )x(st )
]
Small prob s7 goes to terminal state, x(s7 )T = [0 0 0 0 0 0 1 2]
.
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 29 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation
Define the mean squared error of a linear value function
approximation for a particular policy π relative to the true value as
X
MSVEµ (w ) = µ(s)(V π (s) − V̂ π (s; w ))2
s∈S
where
P
µ(s): probability of visiting state s under policy π . Note s µ(s) = 1
V̂ π (s; w ) = x(s)T w , a linear value function approximation
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 30 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation
Define the mean squared error of a linear value function
approximation for a particular policy π relative to the true value as
X
MSVEµ (w ) = µ(s)(V π (s) − V̂ π (s; w ))2
s∈S
where
P
µ(s): probability of visiting state s under policy π . Note s µ(s) = 1
V̂ π (s; w ) = x(s)T w , a linear value function approximation
Monte Carlo policy evaluation with VFA converges to the weights
wMC which has the minimum mean squared error possible with
respect to the distribution µ:
X
MSVEµ (wMC ) = min µ(s)(V π (s) − V̂ π (s; w ))2
w
s∈S
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 31 / 62
Break
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 32 / 62
Today: Focus on Generalization using Linear Value
Function
Preliminaries
Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 33 / 62
Recall: Temporal Difference Learning w/ Lookup Table
Uses bootstrapping and sampling to approximate V π
Updates V π (s) after each transition (s, a, r , s 0 ):
V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))
Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)
Represent value for each state with a separate table entry
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 34 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation
Uses bootstrapping and sampling to approximate true V π
Updates estimate V π (s) after each transition (s, a, r , s 0 ):
V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))
Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)
In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 35 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation
Uses bootstrapping and sampling to approximate true V π
Updates estimate V π (s) after each transition (s, a, r , s 0 ):
V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))
Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)
In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:
1 Sampling
2 Bootstrapping
3 Value function approximation
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 36 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation
In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased
and approximated estimate of the true value V π (s)
Can reduce doing TD(0) learning with value function approximation
to supervised learning on a set of data pairs:
hs1 , r1 + γ V̂ π (s2 ; w )i, hs2 , r2 + γ V̂ (s3 ; w )i, . . .
Find weights to minimize mean squared error
J(w ) = Eπ [(rj + γ V̂ π (sj+1 , w ) − V̂ (sj ; w ))2 ]
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 37 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation
In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased
and approximated estimate of the true value V π (s)
Supervised learning on a different set of data pairs:
hs1 , r1 + γ V̂ π (s2 ; w )i, hs2 , r2 + γ V̂ (s3 ; w )i, . . .
In linear TD(0)
∆w = α(r + γ V̂ π (s 0 ; w ) − V̂ π (s; w ))∇w V̂ π (s; w )
= α(r + γ V̂ π (s 0 ; w ) − V̂ π (s; w ))x(s)
= α(r + γx(s 0 )T w − x(s)T w )x(s)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 38 / 62
TD(0) Linear Value Function Approximation for Policy
Evaluation
1: Initialize w = 0, k = 1
2: loop
3: Sample tuple (sk , ak , rk , sk+1 ) given π
4: Update weights:
w = w + α(r + γx(s 0 )T w − x(s)T w )x(s)
5: k =k +1
6: end loop
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 39 / 62
1
Baird Example with TD(0) On Policy Evaluation
s1= [2 0 0 0 0 0 0 1
s2 = [0 2 0 0 0 0 0 1
..
s6 = [0 0 0 0 0 2 0 1
s7 = [0 0 0 0 0 0 1 2
w0 = [1 1 1 1 1 1 1 1
rewards all
s7 small prob of
going to terminal
state
]
TD update: ∆w = α(r + γx(s 0 )T w − x(s)T w )x(s)
.
1
Figure from Sutton and Barto 2018
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 40 / 62
Convergence Guarantees for TD Linear VFA for Policy
Evaluation: Preliminaries
For infinite horizon, the Markov Chain defined by a MDP with a
particular policy will eventually converge to a probability distribution
over states d(s)
d(s) is called the stationary distribution over states of π
P
s d(s) = 1
d(s) satisfies the following balance equation:
XX
d(s 0 ) = π(a|s)p(s 0 |s, a)d(s)
s a
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 41 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation
Define the mean squared error of a linear value function
approximation for a particular policy π relative to the true value given
the distribution d as
X
MSVEd (w ) = d(s)(V π (s) − V̂ π (s; w ))2
s∈S
where
d(s): stationary distribution of π in the true decision process
V̂ π (s; w ) = x(s)T w , a linear value function approximation
TD(0) policy evaluation with VFA converges to weights wTD which is
within a constant factor of the min mean squared error possible given
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 42 / 62
Check Your Understanding: Poll
TD(0) policy evaluation with VFA converges to weights wTD which is
within a constant factor of the min mean squared error possible for
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S
If the VFA is a tabular representation (one feature for each state),
what is the MSVEd for TD?
1 Depends on the problem
2 MSVE = 0 for TD
3 Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 43 / 62
Check Your Understanding: Poll
TD(0) policy evaluation with VFA converges to weights wTD which is
within a constant factor of the min mean squared error possible for
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S
If the VFA is a tabular representation (one feature for each state),
what is the MSVEd for TD?
1 MSVE = 0 for TD (Answer)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 44 / 62
Break
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 45 / 62
Outline for Today
Value function approximation
Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 46 / 62
Control using Value Function Approximation
Use value function approximation to represent state-action values
Q̂ π (s, a; w ) ≈ Q π
Interleave
Approximate policy evaluation using value function approximation
Perform -greedy policy improvement
Can be unstable. Generally involves intersection of the following:
Function approximation
Bootstrapping
Off-policy learning
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 48 / 62
Action-Value Function Approximation with an Oracle
Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:
J(w ) = Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]
Use stochastic gradient descent to find a local minimum
1
E
h i
− ∇W J(w ) = (Q π (s, a) − Q̂ π (s, a; w ))∇w Q̂ π (s, a; w )
2
1
∆(w ) = − α∇w J(w )
2
Stochastic gradient descent (SGD) samples the gradient
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 49 / 62
Check Your Understanding: Predict Control Updates
The weight update for control for MC and TD-style methods will be
near identical to the policy evaluation steps. Try to see if you can
predict which are the right weight update equations for the different
methods (select all that are true)
(1) is the SARSA control update
(2) is the MC control update
(3) is the Q-learning control update
(4) is the MC control update
(5) is the Q-learning control update
∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(1)
∆w = α(Gt + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(2)
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(3)
a
∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )(4)
∆w = α(r + γ max
0
Q̂(s 0 , a; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(5)
s
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 50 / 62
Check Your Understanding: Answers
The weight update for control for MC and TD-style methods will be
near identical to the policy evaluation steps. Try to see if you can
predict which are the right weight update equations for the different
methods.
(1) is the SARSA control update
(3) is the Q-learning control update
(4) is the MC control update
∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(1)
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(3)
a
∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )(4)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 51 / 62
Linear State Action Value Function Approximation with an
Oracle
Use features to represent both the state and action
x1 (s, a)
x2 (s, a)
x(s, a) =
...
xn (s, a)
Represent state-action value function with a weighted linear
combination of features
n
X
Q̂(s, a; w ) = x(s, a)T w = xj (s, a)wj
j=1
Stochastic gradient descent update:
∇w J(w ) = ∇w Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 52 / 62
Incremental Model-Free Control Approaches
Similar to policy evaluation, true state-action value function for a
state is unknown and so substitute a target value
In Monte Carlo methods, use a return Gt as a substitute target
∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )
For SARSA instead use a TD target r + γ Q̂(s 0 , a0 ; w ) which leverages
the current function approximation value
∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
For Q-learning instead use a TD target r + γ maxa0 Q̂(s 0 , a0 ; w ) which
leverages the max of the current function approximation value
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 53 / 62
Convergence of TD Methods with VFA
Informally, updates involve doing an (approximate) Bellman backup
followed by best trying to fit underlying value function to a particular
feature representation
Bellman operators are contractions, but value function approximation
fitting can be an expansion
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 54 / 62
1
Challenges of Off Policy Control: Baird Example
Behavior policy and target policy are not identical
Value can diverge
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 55 / 62
Convergence of Control Methods with VFA
Algorithm Tabular Linear VFA Nonlinear VFA
Monte-Carlo Control
Sarsa
Q-learning
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 56 / 62
Important Open Area: Off Policy Learning with Function
Approximation
Extensive work in better TD-style algorithms with value function
approximation, some with convergence guarantees: see Chp 11 SB
Will come up further later in this course
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 57 / 62
Linear Value Function Approximation2
2
Figure from Sutton and Barto 2018
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 58 / 62
What You Should Understand
Be able to implement TD(0) and MC on policy evaluation with linear
value function approximation
Be able to define what TD(0) and MC on policy evaluation with
linear VFA are converging to and when this solution has 0 error and
non-zero error.
Be able to implement Q-learning and SARSA and MC control
algorithms
List the 3 issues that can cause instability and describe the problems
qualitatively: function approximation, bootstrapping and off policy
learning
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 59 / 62
Class Structure
Last time: Control (making decisions) without a model of how the
world works
This time: Value function approximation
Next time: Deep reinforcement learning
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 60 / 62