0% found this document useful (0 votes)
4 views27 pages

Lecture 4 - Bellman Equations and DP

Lecture 4 covers Bellman equations and dynamic programming in the context of Markov Decision Processes (MDPs). It discusses value functions, optimal policies, and the process of policy evaluation and improvement, highlighting methods like policy iteration and value iteration. The lecture emphasizes the importance of optimal state-value and action-value functions in determining long-term optimal actions and introduces concepts like asynchronous dynamic programming and real-time DP.

Uploaded by

pratssunkad08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views27 pages

Lecture 4 - Bellman Equations and DP

Lecture 4 covers Bellman equations and dynamic programming in the context of Markov Decision Processes (MDPs). It discusses value functions, optimal policies, and the process of policy evaluation and improvement, highlighting methods like policy iteration and value iteration. The lecture emphasizes the importance of optimal state-value and action-value functions in determining long-term optimal actions and introduces concepts like asynchronous dynamic programming and real-time DP.

Uploaded by

pratssunkad08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Lecture 4:

Bellman Equations and


Dynamic Programming

B. Ravindran
Value Functions (Recall)
The value of a state is the is the expected
return when starting in state s and following π
thereafter

DA6400 Lecture 4 2
Bellman Equation for a Policy π

❏ Linear equation in |S| variables


❏ An unique solution exists

DA6400 Lecture 4 3
An Example
❏ Actions: north, south, east, west (deterministic)
❏ If action would take agent off the grid: no move but
reward = –1
❏ Other actions produce reward = 0, except actions that
move agent out of special states A and B as shown.

State-value
function
for equiprobable
random policy;
γ = 0.9

DA6400 Lecture 4 4
Optimal Value Functions
❏ For finite MDPs, policies can be partially ordered:

❏ There is always at least one (and possibly many)


policies that is better than or equal to all the
others. This is an optimal policy. We denote them
all π*
❏ Optimal policies share the same optimal
state-value function:

DA6400 Lecture 4 5
Example

W E

Many optimal policies but only one optimal value function

DA6400 Lecture 4 6
Bellman Optimality Equation for v*
The value of a state under an optimal policy must equal
the expected return for the best action from that state:

DA6400 Lecture 4 7
Bellman Optimality Equation for q*
The expected return for taking action a in state s and
thereafter following an optimal policy

DA6400 Lecture 4 8
Why Optimal State-Value Functions
are Useful?
❏ Any policy that is greedy with respect to v* is an
optimal policy.

❏ Therefore, given v* , one-step-lookahead search


produces the long-term optimal actions.

E.g., back to the gridworld:

DA6400 Lecture 4 9
Why Optimal Action-Value Functions
are More Useful?
❏ Given q* , the agent does not even have to do a
one-step-ahead search.

X
DA6400 Lecture 4 10
Dynamic
Programming

11
Dynamic Programming
❏ DP is the solution method of choice for
MDPs
❏ Requires complete knowledge of system
dynamics (transition matrix and
rewards)
❏ Computationally expensive
❏ Curse of dimensionality
❏ Guaranteed to converge!

DA6400 Lecture 4 12
Policy Evaluation
❏ For a given policy π, compute the state value
function vπ

❏ Recall Bellman equation for vπ:

❏ A system of |S| simultaneous linear equations


❏ Solve iteratively

DA6400 Lecture 4 13
Iterative Policy Evaluation

DA6400 Lecture 4 14
The Bellman Operator
❏ In the previous algo, the update to V(s) can
be interpreted as an operator acting on a
vector V

DA6400 Lecture 4 15
Example of Policy Evaluation

DA6400 Lecture 4 16
Policy Improvement
❏ Suppose we have computed vπ for an arbitrary
deterministic policy π
❏ Question: For a given state s, would it be better to
choose an action a ≠ π(s) ?

❏ The value of doing a in state s is:

❏ It is better to switch to action a for state s if and


only if

DA6400 Lecture 4 17
Policy Improvement Cont.
Do this for all states to get a new policy that is
greedy with respect to vπ

DA6400 Lecture 4 18
Policy Improvement Cont.

DA6400 Lecture 4 19
Policy Iteration

Policy Improvement
Policy Evaluation
(greedification)

DA6400 Lecture 4 20
Policy Iteration Algo.

DA6400 Lecture 4 21
Value Iteration
❏ Policy evaluation step of policy iteration can be
truncated without losing convergence.
❏ If policy evaluation step is stopped after one
update of each state, we get value iteration
❏ Can also be interpreted as turning the Bellman
optimality equation into an update rule.

DA6400 Lecture 4 22
Value iteration Algo.

DA6400 Lecture 4 23
Asynchronous DP
❏ Disadvantage of algorithms discussed is we have to
do the updates over the entire state set
❏ In asynchronous DP, the updates are not done over
the entire state set at each iteration
❏ Have to ensure that every state is visited sufficiently
often for convergence
❏ Gives flexibility to choose order of updates
❏ Can intertwine real time interaction with the
environment and DP updates
❏ Can focus updates on parts of state space relevant
to agent
DA6400 Lecture 4 24
Real-Time DP (RTDP)
❏ On-policy trajectory-sampling version of
value-iteration algorithm.
❏ Updates values of states visited in the actual
trajectory
1. Take action according to π
2. Update Vπ(s)
3. Update π(a|s)

❏ Unlike asynchronous-DP, no requirement to update


every state infinitely often.

DA6400 Lecture 4 25
Generalized Policy Iteration
❏ GPI refers to the idea of letting policy
evaluation and policy improvement interact,
independent of their granularity.

DA6400 Lecture 4 26
GPI
❏ Almost all RL methods can be viewed as GPI.

❏ Policy iteration has evaluation running to


completion before improvement begins.
❏ In value iteration, only one step of evaluation is
done before the improvement step.
❏ In Asynchronous DP, the
two are interleaved at a
finer granularity.

DA6400 Lecture 4 27

You might also like