0% found this document useful (0 votes)
6 views27 pages

Ar514 MDP

The document discusses concepts of reinforcement learning, including Markov Decision Processes (MDP), policies, value functions, and the Bellman equation. It outlines methods for policy evaluation, improvement, and iteration, as well as Monte Carlo approaches for estimating state values. The content is based on the book 'Reinforcement Learning: An Introduction' by Sutton and Barto.

Uploaded by

vacarib711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views27 pages

Ar514 MDP

The document discusses concepts of reinforcement learning, including Markov Decision Processes (MDP), policies, value functions, and the Bellman equation. It outlines methods for policy evaluation, improvement, and iteration, as well as Monte Carlo approaches for estimating state values. The content is based on the book 'Reinforcement Learning: An Introduction' by Sutton and Barto.

Uploaded by

vacarib711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

AR514: Vision and Learning based Control

Radhe Shyam Sharma

Assistant Professor
Centre for Artificial Intelligence and Robotics
Indian Institute of Technology Mandi
Mandi, Himachal Pradesh - 175075, India

1/27
Dynamics of MDP

p(s’,r | s, a) = pr [St = s ′ , Rt = r | St−1 = s, At−1 = a] ,


for all s ∈ S, r ∈ R, a ∈ A(s)
A(s) Set of all actions available in state s

XX
p(s ′ , r | s, a) = 1
s ′ ∈S r ∈R

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 2/27
Radhe Shyam Sharma IIT Mandi
Policy

Policy is a mapping from states to probabilities of selecting each possible


action.
π(a | s)
if the agent following a policy π at time t then
π(a | s)
is the prob that At = a if St = s.

3/27
Radhe Shyam Sharma IIT Mandi
Value function

State value function of a state s under a policy π


vπ (s)
is the expected return when starting in s and following π thereafter .

vπ (s) = Eπ [Gt | St = s]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s , for all s∈S
k=0

4/27
Radhe Shyam Sharma IIT Mandi
Action Value function

Action value function define the value of taking an action a in a state s


under a policy π
qπ (s, a)
is the expected return when starting in s and following πthereafter .

qπ (s, a) = Eπ [Gt | St = s, At = a]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s, At = a
k=0

5/27
Radhe Shyam Sharma IIT Mandi
Recursive Relationship b/w value of state and the value of
its successor states

vπ (s) = Eπ [Gt | St = s]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s
k=0
= Eπ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 + .. | St = s
 

= Eπ Rt+1 + γ(Rt+2 + γRt+3 + γ 2 Rt+4 + ..) | St = s


 

= Eπ [Rt+1 + γGt+1 | St = s]
X XX
= π(a | s) p(s ′ , r | s, a) [r + γEπ [Gt+1 | St+1 = s ′ ]]
a s′ r
X X
= π(a | s) p(s ′ , r | s, a) [r + γvπ (s ′ )] , for all s∈S
a s ′ ,r
6/27
Radhe Shyam Sharma IIT Mandi
Bellman Equation

It gives a relationship between the value of state and values of its


successor states.
X X
vπ (s) = π(a | s) p(s ′ , r | s, a) [r + γvπ (s ′ )] , for all s∈S
a s ′ ,r

It states that the value of start state must be equal the (discounted)
value of the expected next state, plus the reward expected along the way.

7/27
Radhe Shyam Sharma IIT Mandi
Optimal Policy and Value functions
A policy π is defined to be better than or equal to a policy π ′ if its
expected return is greater than of that π ′ , i.e.,

π ≥ π′

if and only if

vπ (s) ≥ vπ′ (s)

for all s ∈ S There is always at least one policy that is better than or
equal to all other policies. This is an optimal policy (π∗ ). There may be
more than one. They share same state value function (v∗ ).

v∗ (s) = max vπ (s)


π

8/27
Radhe Shyam Sharma IIT Mandi
Backup Diagram for vπ

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 9/27
Radhe Shyam Sharma IIT Mandi
Backup Diagram for v∗

X
v∗ (s) = max p(s ′ , r | s, a) [r + γvπ (s ′ )] , for all s∈S
a
s ′ ,r

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 10/27
Radhe Shyam Sharma IIT Mandi
Policy Evaluation

To compute the state-value function vπ for an arbitrary policy π.

vπ (s) = Eπ [Gt | St = s]
vπ (s) = Eπ [Rt+1 + γGt+1 | St = s]
vπ (s) = Eπ [Rt+1 + γvπ (st+1 ) | St = s]
X X
vπ (s) = π(a | s) p(s ′ , r | s, a) [r + γvπ (s ′ )]
a s ′ ,r

If the environment’s dynamics are completely known, then above Eq. is a


system of simultaneous linear equations with unknowns vπ (s), s ∈ S. In
principle, its solution is a straightforward.

11/27
Radhe Shyam Sharma IIT Mandi
Iterative Solution

We can obtain each successive approximation using the Bellman equation


for vπ as an update rule
X X
vk+1 (s) = π(a | s) p(s ′ , r | s, a) [r + γvk (s ′ )]
a s ′ ,r

Choose initial approximation arbitrarily except that the terminal state.


The sequence of vk can be shown in general to converge to vπ as k → ∞

12/27
Radhe Shyam Sharma IIT Mandi
Iterative policy evaluation
For estimating v ≈ vπ
Input π (the policy to be evaluated)
Choose a small threshold θ > 0 Reg acc of estimation
Initialize v(s), for all s ∈ S + arbitrary except v(terminal) =0
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)

X X
v (s) = π(a | s) p(s ′ , r | s, a) [r + γv (s ′ )]
a s ′ ,r

δ = max(δ, |v − v (s)|)

until δ < θ
13/27
Radhe Shyam Sharma IIT Mandi
Iterative Policy Evaluation

The reward is -1 for all transitions until the terminal state is reached.
The action that would take the agent off the grid leaves the state
unchanged. p(6,-1 | 5, rt) =?
p(7,-1 | 7, rt) =?
p(10,-1 | 5, rt) =?

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 14/27
Radhe Shyam Sharma IIT Mandi
Policy improvement

Let π&π ′ be any pair of deterministic policies such that for alls∈ S,

qπ (s, π ′ (s)) ≥ vπ (s)

Then policy π ′ must be as good as, or better than π, i.e.,

vπ′ (s) ≥ vπ (s)

It must obtain greater or equal expected return from all states s ∈ S.

15/27
Radhe Shyam Sharma IIT Mandi
Proof

vπ (s) ≤ qπ (s, π ′ (s))


= E [Rt+1 + γvπ (st+1 ) | St = s, At = π ′ (s)]
= Eπ′ [Rt+1 + γvπ (st+1 ) | St = s]
≤ Eπ′ [Rt+1 + γqπ (st+1 , π ′ (st+1 )) | St = s]
= Eπ′ [Rt+1 + γEπ′ [Rt+2 + γvπ (st+2 ) | st+1 , At+1 = π ′ (st+1 )] | St = s]
= Eπ′ Rt+1 + γRt+2 + γ 2 vπ (st+2 ) | St = s
 

≤ Eπ′ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 vπ (st+3 ) | St = s


 

.
.
.
≤ Eπ′ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 + ...) | St = s
 

= vπ′ (s)
16/27
Radhe Shyam Sharma IIT Mandi
Policy Iteration
For estimating π ≈ π∗
Initialization
v(s) ∈ R, and π(s) ∈ A(s) for all s ∈ S
Policy Evaluation
Choose a small threshold θ > 0 Reg acc of estimation
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)

X
v (s) = p(s ′ , r | s, π(s)) [r + γv (s ′ )]
s ′ ,r

δ = max(δ, |v − v (s)|)

until δ < θ 17/27


Radhe Shyam Sharma IIT Mandi
Policy Iteration

Policy Improvement
Policy stable = true
Loop for each s ∈ S:
Actionold = π(s)P
π(s) = argmaxa s ′ ,r p(s ′ , r | s, a) [r + γv (s ′ )]
if Actionold ̸= π(s):
Policystable = false
if policy-stable:
stop and return v, π
else:
go to Policy Evaluation step

18/27
Radhe Shyam Sharma IIT Mandi
Value Iteration
For estimating π ≈ π∗
Initialization: Initialize v(s), for all s ∈ S + arbitrary except
v(terminal) =0
Policy Evaluation
Choose a small threshold θ > 0 Reg acc of estimation
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)

X
v (s) = maxa p(s ′ , r | s, π(s)) [r + γv (s ′ )]
s ′ ,r

δ = max(δ, |v − v (s)|)

until δ < θ 19/27


Radhe Shyam Sharma IIT Mandi
Value Iteration

o/p a policy π ≈Pπ∗


π(s) = argmaxa s ′ ,r p(s ′ , r | s, a) [r + γv (s ′ )]

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 20/27
Radhe Shyam Sharma IIT Mandi
Generalized Policy Iteration

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 21/27
Radhe Shyam Sharma IIT Mandi
Generalized Policy Iteration

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 22/27
Radhe Shyam Sharma IIT Mandi
Monte Carlo Approach

Visit to s: Each occurrence of state s in an episode is called a visit


to s.
The first-visit MC method: estimates vπ (s) as the average of the
returns following first visits to s.
The every-visit MC method averages the returns following all visits
to s.

23/27
Radhe Shyam Sharma IIT Mandi
Example

24/27
Radhe Shyam Sharma IIT Mandi
The first-visit MC method for estimating vπ (s)

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 25/27
Radhe Shyam Sharma IIT Mandi
References

Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT
press,, 2018.

26/27
Radhe Shyam Sharma IIT Mandi
Thank you

27/27
Radhe Shyam Sharma IIT Mandi

You might also like