0% found this document useful (0 votes)

4 views27 pages

Lecture 4 - Bellman Equations and DP

Lecture 4 covers Bellman equations and dynamic programming in the context of Markov Decision Processes (MDPs). It discusses value functions, optimal policies, and the process of policy evaluation and improvement, highlighting methods like policy iteration and value iteration. The lecture emphasizes the importance of optimal state-value and action-value functions in determining long-term optimal actions and introduces concepts like asynchronous dynamic programming and real-time DP.

Uploaded by

pratssunkad08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views27 pages

Lecture 4 - Bellman Equations and DP

Uploaded by

pratssunkad08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Lecture 4:

Bellman Equations and

Dynamic Programming

B. Ravindran
Value Functions (Recall)
The value of a state is the is the expected
return when starting in state s and following π
thereafter

DA6400 Lecture 4 2
Bellman Equation for a Policy π

❏ Linear equation in |S| variables

❏ An unique solution exists

DA6400 Lecture 4 3
An Example
❏ Actions: north, south, east, west (deterministic)
❏ If action would take agent off the grid: no move but
reward = –1
❏ Other actions produce reward = 0, except actions that
move agent out of special states A and B as shown.

State-value
function
for equiprobable
random policy;
γ = 0.9

DA6400 Lecture 4 4
Optimal Value Functions
❏ For ﬁnite MDPs, policies can be partially ordered:

❏ There is always at least one (and possibly many)

policies that is better than or equal to all the
others. This is an optimal policy. We denote them
all π*
❏ Optimal policies share the same optimal
state-value function:

DA6400 Lecture 4 5
Example

W E

Many optimal policies but only one optimal value function

DA6400 Lecture 4 6
Bellman Optimality Equation for v*
The value of a state under an optimal policy must equal
the expected return for the best action from that state:

DA6400 Lecture 4 7
Bellman Optimality Equation for q*
The expected return for taking action a in state s and
thereafter following an optimal policy

DA6400 Lecture 4 8
Why Optimal State-Value Functions
are Useful?
❏ Any policy that is greedy with respect to v* is an
optimal policy.

❏ Therefore, given v* , one-step-lookahead search

produces the long-term optimal actions.

E.g., back to the gridworld:

DA6400 Lecture 4 9
Why Optimal Action-Value Functions
are More Useful?
❏ Given q* , the agent does not even have to do a
one-step-ahead search.

X
DA6400 Lecture 4 10
Dynamic
Programming

11
Dynamic Programming
❏ DP is the solution method of choice for
MDPs
❏ Requires complete knowledge of system
dynamics (transition matrix and
rewards)
❏ Computationally expensive
❏ Curse of dimensionality
❏ Guaranteed to converge!

DA6400 Lecture 4 12
Policy Evaluation
❏ For a given policy π, compute the state value
function vπ

❏ Recall Bellman equation for vπ:

❏ A system of |S| simultaneous linear equations

❏ Solve iteratively

DA6400 Lecture 4 13
Iterative Policy Evaluation

DA6400 Lecture 4 14
The Bellman Operator
❏ In the previous algo, the update to V(s) can
be interpreted as an operator acting on a
vector V

DA6400 Lecture 4 15
Example of Policy Evaluation

DA6400 Lecture 4 16
Policy Improvement
❏ Suppose we have computed vπ for an arbitrary
deterministic policy π
❏ Question: For a given state s, would it be better to
choose an action a ≠ π(s) ?

❏ The value of doing a in state s is:

❏ It is better to switch to action a for state s if and

only if

DA6400 Lecture 4 17
Policy Improvement Cont.
Do this for all states to get a new policy that is
greedy with respect to vπ

DA6400 Lecture 4 18
Policy Improvement Cont.

DA6400 Lecture 4 19
Policy Iteration

Policy Improvement
Policy Evaluation
(greediﬁcation)

DA6400 Lecture 4 20
Policy Iteration Algo.

DA6400 Lecture 4 21
Value Iteration
❏ Policy evaluation step of policy iteration can be
truncated without losing convergence.
❏ If policy evaluation step is stopped after one
update of each state, we get value iteration
❏ Can also be interpreted as turning the Bellman
optimality equation into an update rule.

DA6400 Lecture 4 22
Value iteration Algo.

DA6400 Lecture 4 23
Asynchronous DP
❏ Disadvantage of algorithms discussed is we have to
do the updates over the entire state set
❏ In asynchronous DP, the updates are not done over
the entire state set at each iteration
❏ Have to ensure that every state is visited sufficiently
often for convergence
❏ Gives ﬂexibility to choose order of updates
❏ Can intertwine real time interaction with the
environment and DP updates
❏ Can focus updates on parts of state space relevant
to agent
DA6400 Lecture 4 24
Real-Time DP (RTDP)
❏ On-policy trajectory-sampling version of
value-iteration algorithm.
❏ Updates values of states visited in the actual
trajectory
1. Take action according to π
2. Update Vπ(s)
3. Update π(a|s)

❏ Unlike asynchronous-DP, no requirement to update

every state inﬁnitely often.

DA6400 Lecture 4 25
Generalized Policy Iteration
❏ GPI refers to the idea of letting policy
evaluation and policy improvement interact,
independent of their granularity.

DA6400 Lecture 4 26
GPI
❏ Almost all RL methods can be viewed as GPI.

❏ Policy iteration has evaluation running to

completion before improvement begins.
❏ In value iteration, only one step of evaluation is
done before the improvement step.
❏ In Asynchronous DP, the
two are interleaved at a
ﬁner granularity.

DA6400 Lecture 4 27

Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
1.10 Policy Evaluation (Prediction)
No ratings yet
1.10 Policy Evaluation (Prediction)
30 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
Module 04
No ratings yet
Module 04
63 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Lec 09
No ratings yet
Lec 09
51 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
RL Lecture4
No ratings yet
RL Lecture4
16 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
RL Unit-4
No ratings yet
RL Unit-4
18 pages
Lec 12
No ratings yet
Lec 12
60 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
15 MDP
No ratings yet
15 MDP
35 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Solution To Assignment - 4 - Dynamic Programming
No ratings yet
Solution To Assignment - 4 - Dynamic Programming
11 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Module 1 - Introduction To Numerical Methods PDF
No ratings yet
Module 1 - Introduction To Numerical Methods PDF
20 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Lec 3
No ratings yet
Lec 3
15 pages
CS229
No ratings yet
CS229
17 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Dynamic Programming RL Answers Final
No ratings yet
Dynamic Programming RL Answers Final
3 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Q-Basic Numerical Analysis Programs
84% (19)
Q-Basic Numerical Analysis Programs
54 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
G10 Q1 WK 6 Module 8 PDF
100% (2)
G10 Q1 WK 6 Module 8 PDF
21 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
GATE Aerospace Coaching by Team IGC Engineering Mathematics
No ratings yet
GATE Aerospace Coaching by Team IGC Engineering Mathematics
5 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Applied Numerical Methods - (NAFTI - Ir)
No ratings yet
Applied Numerical Methods - (NAFTI - Ir)
593 pages
Mac Foundation
No ratings yet
Mac Foundation
20 pages
Legendre Functions
No ratings yet
Legendre Functions
10 pages
Intro To Polynomials Guided Notes
100% (1)
Intro To Polynomials Guided Notes
24 pages
Summative Test # 1.1
No ratings yet
Summative Test # 1.1
2 pages
05 Nonlinear Dynamics PDF
No ratings yet
05 Nonlinear Dynamics PDF
39 pages
CSE 21 Series Syllabus
No ratings yet
CSE 21 Series Syllabus
76 pages
Optimization-Module Iv
No ratings yet
Optimization-Module Iv
7 pages
2 Mark Type (Polynomials)
No ratings yet
2 Mark Type (Polynomials)
12 pages
Chapter 2
No ratings yet
Chapter 2
16 pages
Unit-3 Matrices 2022-2023
No ratings yet
Unit-3 Matrices 2022-2023
35 pages
Acf Computer Science
No ratings yet
Acf Computer Science
3 pages
Tos Grade 10
No ratings yet
Tos Grade 10
1 page
PHDbrochure - Final Version - 03062020
No ratings yet
PHDbrochure - Final Version - 03062020
20 pages
W1M3-LA Review Matrices
No ratings yet
W1M3-LA Review Matrices
20 pages
Estimation of The Critical Time Step For Peridynamic Models
No ratings yet
Estimation of The Critical Time Step For Peridynamic Models
25 pages
Linear Programming PT 1
No ratings yet
Linear Programming PT 1
20 pages
Numerical Method
No ratings yet
Numerical Method
6 pages
Upwind Secondorder Difference Schemes and Applications in Aerody 1976
No ratings yet
Upwind Secondorder Difference Schemes and Applications in Aerody 1976
9 pages
An Introduction To Artificial Intelligence - Unit 7 - Week 4
No ratings yet
An Introduction To Artificial Intelligence - Unit 7 - Week 4
4 pages
Midterm Exam 20220306
No ratings yet
Midterm Exam 20220306
10 pages
Sensitivity Analysis Sheet: Exercise 1
No ratings yet
Sensitivity Analysis Sheet: Exercise 1
3 pages
Partial Fraction Method
100% (2)
Partial Fraction Method
3 pages
Solving Fuzzy Duffing's Equation by The Laplace
No ratings yet
Solving Fuzzy Duffing's Equation by The Laplace
10 pages
Aops Community 2023 Imc: Mutation
No ratings yet
Aops Community 2023 Imc: Mutation
2 pages
2.4.3 Gaussian Elimination: An Example
No ratings yet
2.4.3 Gaussian Elimination: An Example
3 pages
Finance for Non-Financiers 2: Professional Finances
From Everand
Finance for Non-Financiers 2: Professional Finances
José Saul Velásquez Restrepo
No ratings yet

Lecture 4 - Bellman Equations and DP

Uploaded by

Lecture 4 - Bellman Equations and DP

Uploaded by

Lecture 4:

Bellman Equations and

❏ Linear equation in |S| variables

❏ There is always at least one (and possibly many)

Many optimal policies but only one optimal value function

❏ Therefore, given v* , one-step-lookahead search

E.g., back to the gridworld:

❏ Recall Bellman equation for vπ:

❏ A system of |S| simultaneous linear equations

❏ The value of doing a in state s is:

❏ It is better to switch to action a for state s if and

❏ Unlike asynchronous-DP, no requirement to update

❏ Policy iteration has evaluation running to

You might also like