0% found this document useful (0 votes)

6 views27 pages

Ar514 MDP

The document discusses concepts of reinforcement learning, including Markov Decision Processes (MDP), policies, value functions, and the Bellman equation. It outlines methods for policy evaluation, improvement, and iteration, as well as Monte Carlo approaches for estimating state values. The content is based on the book 'Reinforcement Learning: An Introduction' by Sutton and Barto.

Uploaded by

vacarib711

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views27 pages

Ar514 MDP

Uploaded by

vacarib711

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

AR514: Vision and Learning based Control

Radhe Shyam Sharma

Assistant Professor
Centre for Artificial Intelligence and Robotics
Indian Institute of Technology Mandi
Mandi, Himachal Pradesh - 175075, India

1/27
Dynamics of MDP

p(s’,r | s, a) = pr [St = s ′ , Rt = r | St−1 = s, At−1 = a] ,

for all s ∈ S, r ∈ R, a ∈ A(s)
A(s) Set of all actions available in state s

XX
p(s ′ , r | s, a) = 1
s ′ ∈S r ∈R

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 2/27
Radhe Shyam Sharma IIT Mandi
Policy

Policy is a mapping from states to probabilities of selecting each possible

action.
π(a | s)
if the agent following a policy π at time t then
π(a | s)
is the prob that At = a if St = s.

3/27
Radhe Shyam Sharma IIT Mandi
Value function

State value function of a state s under a policy π

vπ (s)
is the expected return when starting in s and following π thereafter .

vπ (s) = Eπ [Gt | St = s]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s , for all s∈S
k=0

4/27
Radhe Shyam Sharma IIT Mandi
Action Value function

Action value function define the value of taking an action a in a state s

under a policy π
qπ (s, a)
is the expected return when starting in s and following πthereafter .

qπ (s, a) = Eπ [Gt | St = s, At = a]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s, At = a
k=0

5/27
Radhe Shyam Sharma IIT Mandi
Recursive Relationship b/w value of state and the value of
its successor states

vπ (s) = Eπ [Gt | St = s]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s
k=0
= Eπ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 + .. | St = s

= Eπ Rt+1 + γ(Rt+2 + γRt+3 + γ 2 Rt+4 + ..) | St = s

It gives a relationship between the value of state and values of its

successor states.
X X
vπ (s) = π(a | s) p(s ′ , r | s, a) [r + γvπ (s ′ )] , for all s∈S
a s ′ ,r

It states that the value of start state must be equal the (discounted)
value of the expected next state, plus the reward expected along the way.

7/27
Radhe Shyam Sharma IIT Mandi
Optimal Policy and Value functions
A policy π is defined to be better than or equal to a policy π ′ if its
expected return is greater than of that π ′ , i.e.,

π ≥ π′

if and only if

vπ (s) ≥ vπ′ (s)

for all s ∈ S There is always at least one policy that is better than or
equal to all other policies. This is an optimal policy (π∗ ). There may be
more than one. They share same state value function (v∗ ).

v∗ (s) = max vπ (s)

8/27
Radhe Shyam Sharma IIT Mandi
Backup Diagram for vπ

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 9/27
Radhe Shyam Sharma IIT Mandi
Backup Diagram for v∗

X
v∗ (s) = max p(s ′ , r | s, a) [r + γvπ (s ′ )] , for all s∈S
a
s ′ ,r

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 10/27
Radhe Shyam Sharma IIT Mandi
Policy Evaluation

To compute the state-value function vπ for an arbitrary policy π.

If the environment’s dynamics are completely known, then above Eq. is a

system of simultaneous linear equations with unknowns vπ (s), s ∈ S. In
principle, its solution is a straightforward.

11/27
Radhe Shyam Sharma IIT Mandi
Iterative Solution

We can obtain each successive approximation using the Bellman equation

for vπ as an update rule
X X
vk+1 (s) = π(a | s) p(s ′ , r | s, a) [r + γvk (s ′ )]
a s ′ ,r

Choose initial approximation arbitrarily except that the terminal state.

The sequence of vk can be shown in general to converge to vπ as k → ∞

12/27
Radhe Shyam Sharma IIT Mandi
Iterative policy evaluation
For estimating v ≈ vπ
Input π (the policy to be evaluated)
Choose a small threshold θ > 0 Reg acc of estimation
Initialize v(s), for all s ∈ S + arbitrary except v(terminal) =0
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)

X X
v (s) = π(a | s) p(s ′ , r | s, a) [r + γv (s ′ )]
a s ′ ,r

δ = max(δ, |v − v (s)|)

until δ < θ
13/27
Radhe Shyam Sharma IIT Mandi
Iterative Policy Evaluation

The reward is -1 for all transitions until the terminal state is reached.
The action that would take the agent off the grid leaves the state
unchanged. p(6,-1 | 5, rt) =?
p(7,-1 | 7, rt) =?
p(10,-1 | 5, rt) =?

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 14/27
Radhe Shyam Sharma IIT Mandi
Policy improvement

Let π&π ′ be any pair of deterministic policies such that for alls∈ S,

qπ (s, π ′ (s)) ≥ vπ (s)

Then policy π ′ must be as good as, or better than π, i.e.,

vπ′ (s) ≥ vπ (s)

It must obtain greater or equal expected return from all states s ∈ S.

15/27
Radhe Shyam Sharma IIT Mandi
Proof

vπ (s) ≤ qπ (s, π ′ (s))

≤ Eπ′ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 vπ (st+3 ) | St = s

.
.
.
≤ Eπ′ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 + ...) | St = s

= vπ′ (s)
16/27
Radhe Shyam Sharma IIT Mandi
Policy Iteration
For estimating π ≈ π∗
Initialization
v(s) ∈ R, and π(s) ∈ A(s) for all s ∈ S
Policy Evaluation
Choose a small threshold θ > 0 Reg acc of estimation
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)

X
v (s) = p(s ′ , r | s, π(s)) [r + γv (s ′ )]
s ′ ,r

δ = max(δ, |v − v (s)|)

until δ < θ 17/27

Radhe Shyam Sharma IIT Mandi
Policy Iteration

Policy Improvement
Policy stable = true
Loop for each s ∈ S:
Actionold = π(s)P
π(s) = argmaxa s ′ ,r p(s ′ , r | s, a) [r + γv (s ′ )]
if Actionold ̸= π(s):
Policystable = false
if policy-stable:
stop and return v, π
else:
go to Policy Evaluation step

18/27
Radhe Shyam Sharma IIT Mandi
Value Iteration
For estimating π ≈ π∗
Initialization: Initialize v(s), for all s ∈ S + arbitrary except
v(terminal) =0
Policy Evaluation
Choose a small threshold θ > 0 Reg acc of estimation
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)

X
v (s) = maxa p(s ′ , r | s, π(s)) [r + γv (s ′ )]
s ′ ,r

δ = max(δ, |v − v (s)|)

until δ < θ 19/27

Radhe Shyam Sharma IIT Mandi
Value Iteration

o/p a policy π ≈Pπ∗

π(s) = argmaxa s ′ ,r p(s ′ , r | s, a) [r + γv (s ′ )]

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 20/27
Radhe Shyam Sharma IIT Mandi
Generalized Policy Iteration

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 21/27
Radhe Shyam Sharma IIT Mandi
Generalized Policy Iteration

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 22/27
Radhe Shyam Sharma IIT Mandi
Monte Carlo Approach

Visit to s: Each occurrence of state s in an episode is called a visit

to s.
The first-visit MC method: estimates vπ (s) as the average of the
returns following first visits to s.
The every-visit MC method averages the returns following all visits
to s.

23/27
Radhe Shyam Sharma IIT Mandi
Example

24/27
Radhe Shyam Sharma IIT Mandi
The first-visit MC method for estimating vπ (s)

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 25/27
Radhe Shyam Sharma IIT Mandi
References

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT
press,, 2018.

26/27
Radhe Shyam Sharma IIT Mandi
Thank you

27/27
Radhe Shyam Sharma IIT Mandi

RGSHOA Memo For Garbage Collection
100% (1)
RGSHOA Memo For Garbage Collection
1 page
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Civ2212 Soil Mechanics Assignment No.2 Shear Strength of Soils
No ratings yet
Civ2212 Soil Mechanics Assignment No.2 Shear Strength of Soils
2 pages
MSDS-CSP E - 2400 Evamarine Finish
No ratings yet
MSDS-CSP E - 2400 Evamarine Finish
5 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
CS229
No ratings yet
CS229
17 pages
Andy 2
No ratings yet
Andy 2
73 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
An Introduction To Reinforcement Learning
No ratings yet
An Introduction To Reinforcement Learning
63 pages
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
No ratings yet
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
15 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Lec 12
No ratings yet
Lec 12
60 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Lec 4
No ratings yet
Lec 4
16 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Lec 02
No ratings yet
Lec 02
89 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
이명훈 인천대학교 final
No ratings yet
이명훈 인천대학교 final
68 pages
EE401 Class Desc
No ratings yet
EE401 Class Desc
8 pages
Myp1 Teacher Layout
No ratings yet
Myp1 Teacher Layout
5 pages
Work Civility
No ratings yet
Work Civility
7 pages
Prospectus 2023-2024 SSC
No ratings yet
Prospectus 2023-2024 SSC
88 pages
The Nature and Goals of Anthropology, Sociology and Political Science
No ratings yet
The Nature and Goals of Anthropology, Sociology and Political Science
12 pages
The Prodigy of Welding: By: Brandy Ratliff Graduation Project 2010
No ratings yet
The Prodigy of Welding: By: Brandy Ratliff Graduation Project 2010
16 pages
Chapter 4 Vector Spaces - Part 2
No ratings yet
Chapter 4 Vector Spaces - Part 2
31 pages
Ielts Listening 2011, Official 2011
No ratings yet
Ielts Listening 2011, Official 2011
8 pages
Module #2: Transformation of Stresses in 2-D
No ratings yet
Module #2: Transformation of Stresses in 2-D
34 pages
USP-NF Aloe
No ratings yet
USP-NF Aloe
3 pages
Icme 14-TSG 23 Visualizacion
No ratings yet
Icme 14-TSG 23 Visualizacion
111 pages
Targ - Theoretical Mechanics A Short Course - Mir 1988 PDF
100% (2)
Targ - Theoretical Mechanics A Short Course - Mir 1988 PDF
528 pages
Lecture Slides GGR The Role of The Board in Innovation Ver1.0 110224
No ratings yet
Lecture Slides GGR The Role of The Board in Innovation Ver1.0 110224
46 pages
MIcro End-Milling I - Wear and Breakage
No ratings yet
MIcro End-Milling I - Wear and Breakage
18 pages
CISCE
No ratings yet
CISCE
1 page
DLL - Tle - M&B Q3 WK 5
100% (1)
DLL - Tle - M&B Q3 WK 5
7 pages
Mock Analysis
No ratings yet
Mock Analysis
1 page
Tunnel Lining Analysis and Design Using Staad Pro
No ratings yet
Tunnel Lining Analysis and Design Using Staad Pro
4 pages
The Determination of Heat Capacity Ratios
No ratings yet
The Determination of Heat Capacity Ratios
3 pages
Technical Data Sheet Jazeera Maxim Tex JA-26002: Description
No ratings yet
Technical Data Sheet Jazeera Maxim Tex JA-26002: Description
3 pages
Test of Difference and Friedman
No ratings yet
Test of Difference and Friedman
11 pages
Exetastai-The Discourses of Identity in Hellenistic Erythrai
100% (1)
Exetastai-The Discourses of Identity in Hellenistic Erythrai
34 pages
Generator Emergency Purging
No ratings yet
Generator Emergency Purging
1 page
Nonlinear Dynamics and Machine Learning For Roboti
No ratings yet
Nonlinear Dynamics and Machine Learning For Roboti
23 pages
CS Project File
No ratings yet
CS Project File
8 pages
Resume Math
No ratings yet
Resume Math
4 pages
Hollywood Sex Analysis
No ratings yet
Hollywood Sex Analysis
2 pages

Ar514 MDP

Uploaded by

Ar514 MDP

Uploaded by

AR514: Vision and Learning based Control

Radhe Shyam Sharma

p(s’,r | s, a) = pr [St = s ′ , Rt = r | St−1 = s, At−1 = a] ,

Policy is a mapping from states to probabilities of selecting each possible

State value function of a state s under a policy π

Action value function define the value of taking an action a in a state s

= Eπ Rt+1 + γ(Rt+2 + γRt+3 + γ 2 Rt+4 + ..) | St = s

It gives a relationship between the value of state and values of its

vπ (s) ≥ vπ′ (s)

v∗ (s) = max vπ (s)

To compute the state-value function vπ for an arbitrary policy π.

If the environment’s dynamics are completely known, then above Eq. is a

We can obtain each successive approximation using the Bellman equation

Choose initial approximation arbitrarily except that the terminal state.

qπ (s, π ′ (s)) ≥ vπ (s)

Then policy π ′ must be as good as, or better than π, i.e.,

vπ′ (s) ≥ vπ (s)

It must obtain greater or equal expected return from all states s ∈ S.

vπ (s) ≤ qπ (s, π ′ (s))

≤ Eπ′ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 vπ (st+3 ) | St = s

until δ < θ 17/27

until δ < θ 19/27

o/p a policy π ≈Pπ∗

Visit to s: Each occurrence of state s in an episode is called a visit

You might also like