0% found this document useful (0 votes)

246 views59 pages

Value Function Approximation SEO Guide

This document summarizes Emma Brunskill's lecture on value function approximation for reinforcement learning. The lecture discusses using linear function approximation to represent value functions instead of tabular representations for problems with large state spaces. Specifically, it covers using Monte Carlo and temporal-difference learning with linear function approximation for policy evaluation, and control methods that extend these ideas. Function approximation can reduce the memory, computation, and data needed for reinforcement learning problems.

Uploaded by

J. Fernando G. R.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

246 views59 pages

Value Function Approximation SEO Guide

Uploaded by

J. Fernando G. R.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 5: Value Function Approximation

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2021

The value function approximation structure for today closely follows much
of David Silver’s Lecture 6.

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 1 / 62
Refresh Your Knowledge 4

The basic idea of TD methods are to make state-next state pairs fit the constraints of the Bellman equation on average

(question by: Phil Thomas)

1 True
2 False
3 Not sure
In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state randomly
selects an action, then (select all)

1 Q-learning will converge to the optimal Q-values

2 SARSA will converge to the optimal Q-values
3 Q-learning is learning off-policy
4 SARSA is learning off-policy
5 Not sure
A TD error > 0 can occur even if the current V (s) is correct ∀s: [select all]

1 False
2 True if the MDP has stochastic state transitions
3 True if the MDP has deterministic state transitions
4 Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 2 / 62
Refresh Your Knowledge 4

The basic idea of TD methods are to make state-next state pairs fit the constraints of the Bellman equation on average

(question by: Phil Thomas)

1 True (True)
2 False
3 Not sure
In tabular MDPs, if using a decision poilcy that visits all states an infinite number of times, and in each state randomly
selects an action, then (select all)

1 Q-learning will converge to the optimal Q-values (True)

2 SARSA will converge to the optimal Q-values (False)
3 Q-learning is learning off-policy (True)
4 SARSA is learning off-policy (False)
A TD error > 0 can occur even if the current V (s) is correct ∀s: [select all]

1 False
2 True if the MDP has stochastic state transitions (True)
3 True if the MDP has deterministic state transitions (False)
4 Not sure

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 3 / 62
Break

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 4 / 62
Class Structure

Last time: Control (making decisions) without a model of how the

world works
This time: Linear value function approximation
Next time: Deep reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 5 / 62
Outline for Today

Value function approximation

Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 6 / 62
Reinforcement Learning

Goal: Learn to select actions to maximize total expected future reward

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 7 / 62
Last Time: Tabular Representations for Model-free Control

Last time: how to learn a good policy from experience

So far, have been assuming we can represent the value function or
state-action value function as a vector/ matrix
Tabular representation
Many real world problems have enormous state and/or action spaces
Tabular representation is insufficient

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 8 / 62
Motivation for Function Approximation

Don’t want to have to explicitly store or learn for every single state a
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 9 / 62
Benefits of Function Approximation

Reduce memory needed to store (P, R)/V /Q/π

Reduce computation needed to compute (P, R)/V /Q/π
Reduce experience needed to find a good P, R/V /Q/π

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 10 / 62
Value Function Approximation (VFA)

Represent a (state-action/state) value function with a parameterized

function instead of a table

𝑠 𝑤 𝑉# (𝑠; 𝑤)

𝑠 𝑤 𝑄# (𝑠, 𝑎; 𝑤)
𝑎
Which function approximator?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 11 / 62
Function Approximators

Many possible function approximators including

Linear combinations of features
Neural networks
Decision trees
Nearest neighbors
Fourier/ wavelet bases
In this class we will focus on function approximators that are
differentiable (Why?)
Two very popular classes of differentiable function approximators
Linear feature representations (Today)
Neural networks (Next lecture)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 12 / 62
Outline for the Rest of Today: Policy Evaluation to Control

Given known dynamics and reward models, and a tabular

representation
Discussed how to do policy evaluation and then control (value iteration
and policy iteration)
Given no models, and a tabular representation
Discussed how to do policy evaluation (MC/TD) and then control
(MC, SARSA, Q-learning)
Given no models, and function approximation
Today will discuss how to do policy evaluation and then control

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 13 / 62
Review: Gradient Descent

Consider a function J(w ) that is a differentiable function of a

parameter vector w
Goal is to find parameter w that minimizes J
The gradient of J(w ) is

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 14 / 62
Value Function Approximation for Policy Evaluation with
an Oracle

First assume we could query any state s and an oracle would return
the true value for V π (s)
Similar to supervised learning: assume given (s, V π (s)) pairs
The objective is to find the best approximate representation of V π
given a particular parameterized function V̂ (s; w )

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 15 / 62
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function V π (s) and its approximation V̂ (s; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as

J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]

Can use gradient descent to find a local minimum

1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:

Expected SGD is the same as the full gradient update

Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 16 / 62
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function V π (s) and its approximation V̂ (s; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:
∆w J(w ) = ∆w Eπ [V π (s) − V̂ (s; w )]2
= Eπ [2(V π (s) − V̂ (s; w )]∆w V̂ (s, w )
Expected SGD is the same as the full gradient update
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 17 / 62
Model Free VFA Policy Evaluation

Don’t actually have access to an oracle to tell true V π (s) for any
state s
Now consider how to do model-free value function approximation for
prediction / evaluation / policy evaluation without a model

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 18 / 62
Model Free VFA Prediction / Policy Evaluation

Recall model-free policy evaluation (Lecture 3)

Following a fixed policy π (or had access to prior data)
Goal is to estimate V π and/or Q π
Maintained a lookup table to store estimates V π and/or Q π
Updated these estimates after each episode (Monte Carlo methods)
or after each step (TD methods)
Now: in value function approximation, change the estimate
update step to include fitting the function approximator

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 19 / 62
Break

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 20 / 62
Outline for Today

Value function approximation

Monte Carlo policy evaluation with linear function
approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 21 / 62
Feature Vectors

Use a feature vector to represent a state s

 
x1 (s)
 x2 (s) 
x(s) = 
 ... 


xn (s)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 22 / 62
Linear Value Function Approximation for Prediction With
An Oracle
Represent a value function (or state-action value function) for a
particular policy with a weighted linear combination of features
n
X
V̂ (s; w ) = xj (s)wj = x(s)T w
j=1

Objective function is

J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]

Recall weight update is

1
∆w = − α∇w J(w )
2
Update is:

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 23 / 62
Feature Vectors

Use a feature vector to represent a state s

 
x1 (s)
 x2 (s) 
x(s) = 
 ... 


xn (s)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 24 / 62
Linear Value Function Approximation for Prediction With
An Oracle
Represent a value function (or state-action value function) for a
particular policy with a weighted linear combination of features
n
X
V̂ (s; w ) = xj (s)wj = x(s)T w
j=1

Objective function is
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Recall weight update is
1
∆w = − α∇w J(w )
2
Update is:

Update = step-size × prediction error × feature value

Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 25 / 62
Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected return

V π (st )
Therefore can reduce MC VFA to doing supervised learning on a set
of (state,return) pairs: hs1 , G1 i, hs2 , G2 i, . . . , hsT , GT i
Substitute Gt for the true V π (st ) when fit function approximator

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 26 / 62
Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected return

∆w = α(Gt − V̂ (st ; w ))∇w V̂ (st ; w )

= α(Gt − V̂ (st ; w ))x(st )
= α(Gt − x(st )T w )x(st )

Note: Gt may be a very noisy estimate of true return

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 27 / 62
MC Linear Value Function Approximation for Policy
Evaluation

1: Initialize w = 0, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visitPto (s) in episode k then
6: Gt (s) = Lj=t k
rk,j
7: Update weights:

8: end if
9: end for
10: k =k +1
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 28 / 62
Baird (1995)-Like Example with MC Policy Evaluation1
s1= [2 0 0 0 0 0 0 1
s2 = [0 2 0 0 0 0 0 1
..
s6 = [0 0 0 0 0 2 0 1
s7 = [0 0 0 0 0 0 1 2

w0 = [1 1 1 1 1 1 1 1
rewards all

s7 small prob of
going to terminal
state

MC update: ∆w = α(Gt − x(st )T w )x(st )

]

Small prob s7 goes to terminal state, x(s7 )T = [0 0 0 0 0 0 1 2]

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 29 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation

Define the mean squared error of a linear value function

approximation for a particular policy π relative to the true value as
X
MSVEµ (w ) = µ(s)(V π (s) − V̂ π (s; w ))2
s∈S

where
P
µ(s): probability of visiting state s under policy π . Note s µ(s) = 1
V̂ π (s; w ) = x(s)T w , a linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 30 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation
Define the mean squared error of a linear value function
approximation for a particular policy π relative to the true value as
X
MSVEµ (w ) = µ(s)(V π (s) − V̂ π (s; w ))2
s∈S

where
P
µ(s): probability of visiting state s under policy π . Note s µ(s) = 1
V̂ π (s; w ) = x(s)T w , a linear value function approximation
Monte Carlo policy evaluation with VFA converges to the weights
wMC which has the minimum mean squared error possible with
respect to the distribution µ:
X
MSVEµ (wMC ) = min µ(s)(V π (s) − V̂ π (s; w ))2
w
s∈S

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 31 / 62
Break

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 32 / 62
Today: Focus on Generalization using Linear Value
Function

Preliminaries
Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 33 / 62
Recall: Temporal Difference Learning w/ Lookup Table

Uses bootstrapping and sampling to approximate V π

Updates V π (s) after each transition (s, a, r , s 0 ):

V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))

Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)

Represent value for each state with a separate table entry

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 34 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation

Uses bootstrapping and sampling to approximate true V π

Updates estimate V π (s) after each transition (s, a, r , s 0 ):

V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))

Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)

In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 35 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation

Uses bootstrapping and sampling to approximate true V π

Updates estimate V π (s) after each transition (s, a, r , s 0 ):

V π (s) = V π (s) + α(r + γV π (s 0 ) − V π (s))

Target is r + γV π (s 0 ), a biased estimate of the true value V π (s)

In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:
1 Sampling
2 Bootstrapping
3 Value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 36 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation

In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased

and approximated estimate of the true value V π (s)
Can reduce doing TD(0) learning with value function approximation
to supervised learning on a set of data pairs:
hs1 , r1 + γ V̂ π (s2 ; w )i, hs2 , r2 + γ V̂ (s3 ; w )i, . . .
Find weights to minimize mean squared error

J(w ) = Eπ [(rj + γ V̂ π (sj+1 , w ) − V̂ (sj ; w ))2 ]

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 37 / 62
Temporal Difference (TD(0)) Learning with Value
Function Approximation

In value function approximation, target is r + γ V̂ π (s 0 ; w ), a biased

and approximated estimate of the true value V π (s)
Supervised learning on a different set of data pairs:
hs1 , r1 + γ V̂ π (s2 ; w )i, hs2 , r2 + γ V̂ (s3 ; w )i, . . .
In linear TD(0)

∆w = α(r + γ V̂ π (s 0 ; w ) − V̂ π (s; w ))∇w V̂ π (s; w )

= α(r + γ V̂ π (s 0 ; w ) − V̂ π (s; w ))x(s)
= α(r + γx(s 0 )T w − x(s)T w )x(s)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 38 / 62
TD(0) Linear Value Function Approximation for Policy
Evaluation

1: Initialize w = 0, k = 1
2: loop
3: Sample tuple (sk , ak , rk , sk+1 ) given π
4: Update weights:

w = w + α(r + γx(s 0 )T w − x(s)T w )x(s)

5: k =k +1
6: end loop

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 39 / 62
1
Baird Example with TD(0) On Policy Evaluation
s1= [2 0 0 0 0 0 0 1
s2 = [0 2 0 0 0 0 0 1
..
s6 = [0 0 0 0 0 2 0 1
s7 = [0 0 0 0 0 0 1 2

w0 = [1 1 1 1 1 1 1 1
rewards all

s7 small prob of
going to terminal
state
]

TD update: ∆w = α(r + γx(s 0 )T w − x(s)T w )x(s)

1
Figure from Sutton and Barto 2018
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 40 / 62
Convergence Guarantees for TD Linear VFA for Policy
Evaluation: Preliminaries

For infinite horizon, the Markov Chain defined by a MDP with a

particular policy will eventually converge to a probability distribution
over states d(s)
d(s) is called the stationary distribution over states of π
P
s d(s) = 1
d(s) satisfies the following balance equation:
XX
d(s 0 ) = π(a|s)p(s 0 |s, a)d(s)
s a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 41 / 62
Convergence Guarantees for Linear Value Function
Approximation for Policy Evaluation
Define the mean squared error of a linear value function
approximation for a particular policy π relative to the true value given
the distribution d as
X
MSVEd (w ) = d(s)(V π (s) − V̂ π (s; w ))2
s∈S

where
d(s): stationary distribution of π in the true decision process
V̂ π (s; w ) = x(s)T w , a linear value function approximation
TD(0) policy evaluation with VFA converges to weights wTD which is
within a constant factor of the min mean squared error possible given
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 42 / 62
Check Your Understanding: Poll

TD(0) policy evaluation with VFA converges to weights wTD which is

within a constant factor of the min mean squared error possible for
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S

If the VFA is a tabular representation (one feature for each state),

what is the MSVEd for TD?
1 Depends on the problem
2 MSVE = 0 for TD
3 Not sure

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 43 / 62
Check Your Understanding: Poll

TD(0) policy evaluation with VFA converges to weights wTD which is

within a constant factor of the min mean squared error possible for
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S

If the VFA is a tabular representation (one feature for each state),

what is the MSVEd for TD?
1 MSVE = 0 for TD (Answer)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 44 / 62
Break

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 45 / 62
Outline for Today

Value function approximation

Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 46 / 62
Control using Value Function Approximation

Use value function approximation to represent state-action values

Q̂ π (s, a; w ) ≈ Q π
Interleave
Approximate policy evaluation using value function approximation
Perform -greedy policy improvement
Can be unstable. Generally involves intersection of the following:
Function approximation
Bootstrapping
Off-policy learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 48 / 62
Action-Value Function Approximation with an Oracle

Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:

J(w ) = Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]

Use stochastic gradient descent to find a local minimum

1
E
h i
− ∇W J(w ) = (Q π (s, a) − Q̂ π (s, a; w ))∇w Q̂ π (s, a; w )
2
1
∆(w ) = − α∇w J(w )
2
Stochastic gradient descent (SGD) samples the gradient

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 49 / 62
Check Your Understanding: Predict Control Updates
The weight update for control for MC and TD-style methods will be
near identical to the policy evaluation steps. Try to see if you can
predict which are the right weight update equations for the different
methods (select all that are true)
(1) is the SARSA control update
(2) is the MC control update
(3) is the Q-learning control update
(4) is the MC control update
(5) is the Q-learning control update
∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(1)
∆w = α(Gt + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(2)
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(3)
a
∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )(4)
∆w = α(r + γ max
0
Q̂(s 0 , a; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(5)
s

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 50 / 62
Check Your Understanding: Answers

The weight update for control for MC and TD-style methods will be
near identical to the policy evaluation steps. Try to see if you can
predict which are the right weight update equations for the different
methods.
(1) is the SARSA control update
(3) is the Q-learning control update
(4) is the MC control update

∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(1)

∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )(3)
a
∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )(4)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 51 / 62
Linear State Action Value Function Approximation with an
Oracle
Use features to represent both the state and action
 
x1 (s, a)
 x2 (s, a) 
x(s, a) = 
 ... 


xn (s, a)
Represent state-action value function with a weighted linear
combination of features
n
X
Q̂(s, a; w ) = x(s, a)T w = xj (s, a)wj
j=1

Stochastic gradient descent update:

∇w J(w ) = ∇w Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 52 / 62
Incremental Model-Free Control Approaches

Similar to policy evaluation, true state-action value function for a

state is unknown and so substitute a target value
In Monte Carlo methods, use a return Gt as a substitute target

∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

For SARSA instead use a TD target r + γ Q̂(s 0 , a0 ; w ) which leverages

the current function approximation value

∆w = α(r + γ Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )

For Q-learning instead use a TD target r + γ maxa0 Q̂(s 0 , a0 ; w ) which

leverages the max of the current function approximation value

∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 53 / 62
Convergence of TD Methods with VFA

Informally, updates involve doing an (approximate) Bellman backup

followed by best trying to fit underlying value function to a particular
feature representation
Bellman operators are contractions, but value function approximation
fitting can be an expansion

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 54 / 62
1
Challenges of Off Policy Control: Baird Example

Behavior policy and target policy are not identical

Value can diverge

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 55 / 62
Convergence of Control Methods with VFA

Algorithm Tabular Linear VFA Nonlinear VFA

Monte-Carlo Control
Sarsa
Q-learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 56 / 62
Important Open Area: Off Policy Learning with Function
Approximation

Extensive work in better TD-style algorithms with value function

approximation, some with convergence guarantees: see Chp 11 SB
Will come up further later in this course

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 57 / 62
Linear Value Function Approximation2

2
Figure from Sutton and Barto 2018
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 58 / 62
What You Should Understand

Be able to implement TD(0) and MC on policy evaluation with linear

value function approximation
Be able to define what TD(0) and MC on policy evaluation with
linear VFA are converging to and when this solution has 0 error and
non-zero error.
Be able to implement Q-learning and SARSA and MC control
algorithms
List the 3 issues that can cause instability and describe the problems
qualitatively: function approximation, bootstrapping and off policy
learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 59 / 62
Class Structure

Last time: Control (making decisions) without a model of how the

world works
This time: Value function approximation
Next time: Deep reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 5: Value Function Approximation Winter 2021 60 / 62

402 Lec20
No ratings yet
402 Lec20
21 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Lecture 6 Value Function Approximation
No ratings yet
Lecture 6 Value Function Approximation
56 pages
Value Function Approximation Guide
No ratings yet
Value Function Approximation Guide
56 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Function Approximation in RL Methods
No ratings yet
Function Approximation in RL Methods
58 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
L8-Value Function Approximation
No ratings yet
L8-Value Function Approximation
72 pages
Module 6
No ratings yet
Module 6
47 pages
Function Approximation in Reinforcement Learning
No ratings yet
Function Approximation in Reinforcement Learning
35 pages
Value Function Approximation in RL
No ratings yet
Value Function Approximation in RL
39 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Lecture 22 - Value Function Approximation
No ratings yet
Lecture 22 - Value Function Approximation
17 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Universal Value Function Approximators.
No ratings yet
Universal Value Function Approximators.
9 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Lecture 4 Pre
No ratings yet
Lecture 4 Pre
85 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
Policy-Based RL Overview by Shusen Wang
No ratings yet
Policy-Based RL Overview by Shusen Wang
46 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
Handout 5
No ratings yet
Handout 5
72 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Policy Gradient Methods Guide
No ratings yet
Policy Gradient Methods Guide
28 pages
20820-Article Text-24833-1-2-20220628
No ratings yet
20820-Article Text-24833-1-2-20220628
9 pages
Introduction To Machine Learning - Unit 15 - Week 12
No ratings yet
Introduction To Machine Learning - Unit 15 - Week 12
3 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Lecture 4: Model Free Control: Emma Brunskill
No ratings yet
Lecture 4: Model Free Control: Emma Brunskill
66 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
Mod3 Slides
No ratings yet
Mod3 Slides
199 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
1.7 Policies and Value Functions
No ratings yet
1.7 Policies and Value Functions
23 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Introduction To Machine Learning - Unit 15 - Week 12
No ratings yet
Introduction To Machine Learning - Unit 15 - Week 12
3 pages
Policy Optimization in Reinforcement Learning
No ratings yet
Policy Optimization in Reinforcement Learning
62 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Subtitle
No ratings yet
Subtitle
2 pages
Module 04
No ratings yet
Module 04
63 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
RL 2021 22 Exam I
No ratings yet
RL 2021 22 Exam I
4 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Economics of Permissioned Blockchain Adoption
No ratings yet
Economics of Permissioned Blockchain Adoption
49 pages
Cartel Competition and Opioid Violence
No ratings yet
Cartel Competition and Opioid Violence
63 pages
Promptrobust: Towards Evaluating The Robustness of Large Language Models On Adversarial Prompts
No ratings yet
Promptrobust: Towards Evaluating The Robustness of Large Language Models On Adversarial Prompts
26 pages
(23444150 - Journal of Heterodox Economics) A Critical Review of The Main Approaches On Financial Market Dynamics Modelling
No ratings yet
(23444150 - Journal of Heterodox Economics) A Critical Review of The Main Approaches On Financial Market Dynamics Modelling
17 pages
Backdoor Attacks For In-Context Learning With Language Models
No ratings yet
Backdoor Attacks For In-Context Learning With Language Models
11 pages
Haushofer Shapiro UCT Online Appendix
No ratings yet
Haushofer Shapiro UCT Online Appendix
256 pages
The Long-Term Effects of Africa's Slave Trades
No ratings yet
The Long-Term Effects of Africa's Slave Trades
40 pages
Advanced Classification Techniques
No ratings yet
Advanced Classification Techniques
16 pages
Honest Causal Forests
No ratings yet
Honest Causal Forests
5 pages
Associative Classi Cation Approaches: Review and Comparison: Neda Abdelhamid
No ratings yet
Associative Classi Cation Approaches: Review and Comparison: Neda Abdelhamid
30 pages
Weak Perfect Bayesian Equilibria Analysis
No ratings yet
Weak Perfect Bayesian Equilibria Analysis
2 pages
Twenty Year Economic Impacts of Deworming
No ratings yet
Twenty Year Economic Impacts of Deworming
3 pages
Forecasting Stock Index Volatilities
100% (1)
Forecasting Stock Index Volatilities
34 pages
Economics and Identity
No ratings yet
Economics and Identity
23 pages
Game Theory for Economics Students
No ratings yet
Game Theory for Economics Students
28 pages
International Journal of Forecasting: Emre Soyer Robin M. Hogarth
No ratings yet
International Journal of Forecasting: Emre Soyer Robin M. Hogarth
17 pages
Analyzing Normal Form Games
No ratings yet
Analyzing Normal Form Games
100 pages
Pedro Acosta Valenzuela: Finance Profile
No ratings yet
Pedro Acosta Valenzuela: Finance Profile
12 pages
China's Internet Embrace
No ratings yet
China's Internet Embrace
17 pages
TS02D Ambaye 5521 PDF
No ratings yet
TS02D Ambaye 5521 PDF
27 pages
A Quiet Opening: North Koreans in A Changing Media Environment
No ratings yet
A Quiet Opening: North Koreans in A Changing Media Environment
94 pages
Training Indra PDF
No ratings yet
Training Indra PDF
108 pages
1741761793-JEE Advanced Paper - 1 IIT School 13th Class
No ratings yet
1741761793-JEE Advanced Paper - 1 IIT School 13th Class
16 pages
Reinforcement Pre-Training
No ratings yet
Reinforcement Pre-Training
15 pages
Thermodynamics Exam Questions and Solutions
No ratings yet
Thermodynamics Exam Questions and Solutions
5 pages
FEC For Ethernet - Whitepaper
No ratings yet
FEC For Ethernet - Whitepaper
4 pages
ALTOSONIC IV - Technical Datasheet - 20FT - 001 - Test Separator
No ratings yet
ALTOSONIC IV - Technical Datasheet - 20FT - 001 - Test Separator
4 pages
Chapter 3 Direct Design Method of TW Slabs
No ratings yet
Chapter 3 Direct Design Method of TW Slabs
13 pages
Rajarathinam 2023 BP 7762A
No ratings yet
Rajarathinam 2023 BP 7762A
15 pages
Discipline Section Sub Head (DSR Sub Head No.) Non-DSR Sub-Head DSR Item No. BOQ Item No
No ratings yet
Discipline Section Sub Head (DSR Sub Head No.) Non-DSR Sub-Head DSR Item No. BOQ Item No
154 pages
Mid 1 All Question Bank
No ratings yet
Mid 1 All Question Bank
15 pages
Electronics Principles Review
No ratings yet
Electronics Principles Review
13 pages
Energy Efficiency Format For Chiller Applications 1 - 20000 BTU - Chennai
No ratings yet
Energy Efficiency Format For Chiller Applications 1 - 20000 BTU - Chennai
2 pages
MPLS Traffic Engineering Guide
No ratings yet
MPLS Traffic Engineering Guide
5 pages
Edexcel Ial Physics Wpho2
No ratings yet
Edexcel Ial Physics Wpho2
15 pages
ADV EMS Cat
No ratings yet
ADV EMS Cat
27 pages
Bridge 10 35m-Abutment A Schedule
No ratings yet
Bridge 10 35m-Abutment A Schedule
1 page
Know The Difference Between Total Selectivity & Partial Selectivity
No ratings yet
Know The Difference Between Total Selectivity & Partial Selectivity
7 pages
How To Check Thermal Relay
No ratings yet
How To Check Thermal Relay
1 page
MANOPBYUM01 Rev 3.2 Part1
100% (1)
MANOPBYUM01 Rev 3.2 Part1
13 pages
Reversible Voltage in Hydrogen Fuel Cells
No ratings yet
Reversible Voltage in Hydrogen Fuel Cells
1 page
Methodology
No ratings yet
Methodology
18 pages
Introductions To Data Science - Lecture 1 - Introduction
No ratings yet
Introductions To Data Science - Lecture 1 - Introduction
15 pages
Chemistry Half Yearly Practice Question Papers-Merged
No ratings yet
Chemistry Half Yearly Practice Question Papers-Merged
11 pages
Gallop Set 7
No ratings yet
Gallop Set 7
2 pages
H 200x100x5.5x8
No ratings yet
H 200x100x5.5x8
2 pages
Local Hero
No ratings yet
Local Hero
371 pages
Math4 Q3 W4 DLL
No ratings yet
Math4 Q3 W4 DLL
7 pages
Rules of Subject Verb Agreement
No ratings yet
Rules of Subject Verb Agreement
2 pages
Arlian Kusuma P - 22410014 Algoritma Tugas Kelompok
No ratings yet
Arlian Kusuma P - 22410014 Algoritma Tugas Kelompok
7 pages
Chapter 3 Middle of The Chapter Test Review
No ratings yet
Chapter 3 Middle of The Chapter Test Review
7 pages