Lecture 5: Value Function Approximation: Emma Brunskill
Lecture 5: Value Function Approximation: Emma Brunskill
Emma Brunskill
Winter 2021
The value function approximation structure for today closely follows much
of David Silver’s Lecture 6.
The basic idea of TD methods are to make state-next state pairs fit the constraints of the Bellman equation on average
1 True
2 False
3 Not sure
In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state randomly
selects an action, then (select all)
1 False
2 True if the MDP has stochastic state transitions
3 True if the MDP has deterministic state transitions
4 Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 2 / 62
Refresh Your Knowledge 4
The basic idea of TD methods are to make state-next state pairs fit the constraints of the Bellman equation on average
1 True (True)
2 False
3 Not sure
In tabular MDPs, if using a decision poilcy that visits all states an infinite number of times, and in each state randomly
selects an action, then (select all)
1 False
2 True if the MDP has stochastic state transitions (True)
3 True if the MDP has deterministic state transitions (False)
4 Not sure
Don’t want to have to explicitly store or learn for every single state a
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions
𝑠 𝑤 𝑉# (𝑠; 𝑤)
𝑠 𝑤 𝑄# (𝑠, 𝑎; 𝑤)
𝑎
Which function approximator?
First assume we could query any state s and an oracle would return
the true value for V π (s)
Similar to supervised learning: assume given (s, V π (s)) pairs
The objective is to find the best approximate representation of V π
given a particular parameterized function V̂ (s; w )
Don’t actually have access to an oracle to tell true V π (s) for any
state s
Now consider how to do model-free value function approximation for
prediction / evaluation / policy evaluation without a model
xn (s)
Objective function is
xn (s)
Objective function is
J(w ) = Eπ [(V π (s) − V̂ (s; w ))2 ]
Recall weight update is
1
∆w = − α∇w J(w )
2
Update is:
1: Initialize w = 0, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visitPto (s) in episode k then
6: Gt (s) = Lj=t k
rk,j
7: Update weights:
8: end if
9: end for
10: k =k +1
11: end loop
w0 = [1 1 1 1 1 1 1 1
rewards all
s7 small prob of
going to terminal
state
where
P
µ(s): probability of visiting state s under policy π . Note s µ(s) = 1
V̂ π (s; w ) = x(s)T w , a linear value function approximation
where
P
µ(s): probability of visiting state s under policy π . Note s µ(s) = 1
V̂ π (s; w ) = x(s)T w , a linear value function approximation
Monte Carlo policy evaluation with VFA converges to the weights
wMC which has the minimum mean squared error possible with
respect to the distribution µ:
X
MSVEµ (wMC ) = min µ(s)(V π (s) − V̂ π (s; w ))2
w
s∈S
Preliminaries
Monte Carlo policy evaluation with linear function approximation
TD policy evaluation with linear function approximation
Control methods with linear value function approximation
1: Initialize w = 0, k = 1
2: loop
3: Sample tuple (sk , ak , rk , sk+1 ) given π
4: Update weights:
5: k =k +1
6: end loop
w0 = [1 1 1 1 1 1 1 1
rewards all
s7 small prob of
going to terminal
state
]
1
Figure from Sutton and Barto 2018
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 40 / 62
Convergence Guarantees for TD Linear VFA for Policy
Evaluation: Preliminaries
where
d(s): stationary distribution of π in the true decision process
V̂ π (s; w ) = x(s)T w , a linear value function approximation
TD(0) policy evaluation with VFA converges to weights wTD which is
within a constant factor of the min mean squared error possible given
distribution d:
1 X
MSVEd (wTD ) ≤ min d(s)(V π (s) − V̂ π (s; w ))2
1−γ w
s∈S
Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:
The weight update for control for MC and TD-style methods will be
near identical to the policy evaluation steps. Try to see if you can
predict which are the right weight update equations for the different
methods.
(1) is the SARSA control update
(3) is the Q-learning control update
(4) is the MC control update
xn (s, a)
Represent state-action value function with a weighted linear
combination of features
n
X
Q̂(s, a; w ) = x(s, a)T w = xj (s, a)wj
j=1
∆w = α(r + γ max
0
Q̂(s 0 , a0 ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
2
Figure from Sutton and Barto 2018
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 5: Value Function Approximation Winter 2021 58 / 62
What You Should Understand