Lecture 4 - Bellman Equations and DP
Lecture 4 - Bellman Equations and DP
B. Ravindran
Value Functions (Recall)
The value of a state is the is the expected
return when starting in state s and following π
thereafter
DA6400 Lecture 4 2
Bellman Equation for a Policy π
DA6400 Lecture 4 3
An Example
❏ Actions: north, south, east, west (deterministic)
❏ If action would take agent off the grid: no move but
reward = –1
❏ Other actions produce reward = 0, except actions that
move agent out of special states A and B as shown.
State-value
function
for equiprobable
random policy;
γ = 0.9
DA6400 Lecture 4 4
Optimal Value Functions
❏ For finite MDPs, policies can be partially ordered:
DA6400 Lecture 4 5
Example
W E
DA6400 Lecture 4 6
Bellman Optimality Equation for v*
The value of a state under an optimal policy must equal
the expected return for the best action from that state:
DA6400 Lecture 4 7
Bellman Optimality Equation for q*
The expected return for taking action a in state s and
thereafter following an optimal policy
DA6400 Lecture 4 8
Why Optimal State-Value Functions
are Useful?
❏ Any policy that is greedy with respect to v* is an
optimal policy.
DA6400 Lecture 4 9
Why Optimal Action-Value Functions
are More Useful?
❏ Given q* , the agent does not even have to do a
one-step-ahead search.
X
DA6400 Lecture 4 10
Dynamic
Programming
11
Dynamic Programming
❏ DP is the solution method of choice for
MDPs
❏ Requires complete knowledge of system
dynamics (transition matrix and
rewards)
❏ Computationally expensive
❏ Curse of dimensionality
❏ Guaranteed to converge!
DA6400 Lecture 4 12
Policy Evaluation
❏ For a given policy π, compute the state value
function vπ
DA6400 Lecture 4 13
Iterative Policy Evaluation
DA6400 Lecture 4 14
The Bellman Operator
❏ In the previous algo, the update to V(s) can
be interpreted as an operator acting on a
vector V
DA6400 Lecture 4 15
Example of Policy Evaluation
DA6400 Lecture 4 16
Policy Improvement
❏ Suppose we have computed vπ for an arbitrary
deterministic policy π
❏ Question: For a given state s, would it be better to
choose an action a ≠ π(s) ?
DA6400 Lecture 4 17
Policy Improvement Cont.
Do this for all states to get a new policy that is
greedy with respect to vπ
DA6400 Lecture 4 18
Policy Improvement Cont.
DA6400 Lecture 4 19
Policy Iteration
Policy Improvement
Policy Evaluation
(greedification)
DA6400 Lecture 4 20
Policy Iteration Algo.
DA6400 Lecture 4 21
Value Iteration
❏ Policy evaluation step of policy iteration can be
truncated without losing convergence.
❏ If policy evaluation step is stopped after one
update of each state, we get value iteration
❏ Can also be interpreted as turning the Bellman
optimality equation into an update rule.
DA6400 Lecture 4 22
Value iteration Algo.
DA6400 Lecture 4 23
Asynchronous DP
❏ Disadvantage of algorithms discussed is we have to
do the updates over the entire state set
❏ In asynchronous DP, the updates are not done over
the entire state set at each iteration
❏ Have to ensure that every state is visited sufficiently
often for convergence
❏ Gives flexibility to choose order of updates
❏ Can intertwine real time interaction with the
environment and DP updates
❏ Can focus updates on parts of state space relevant
to agent
DA6400 Lecture 4 24
Real-Time DP (RTDP)
❏ On-policy trajectory-sampling version of
value-iteration algorithm.
❏ Updates values of states visited in the actual
trajectory
1. Take action according to π
2. Update Vπ(s)
3. Update π(a|s)
DA6400 Lecture 4 25
Generalized Policy Iteration
❏ GPI refers to the idea of letting policy
evaluation and policy improvement interact,
independent of their granularity.
DA6400 Lecture 4 26
GPI
❏ Almost all RL methods can be viewed as GPI.
DA6400 Lecture 4 27