Ar514 MDP
Ar514 MDP
Assistant Professor
Centre for Artificial Intelligence and Robotics
Indian Institute of Technology Mandi
Mandi, Himachal Pradesh - 175075, India
1/27
Dynamics of MDP
XX
p(s ′ , r | s, a) = 1
s ′ ∈S r ∈R
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 2/27
Radhe Shyam Sharma IIT Mandi
Policy
3/27
Radhe Shyam Sharma IIT Mandi
Value function
vπ (s) = Eπ [Gt | St = s]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s , for all s∈S
k=0
4/27
Radhe Shyam Sharma IIT Mandi
Action Value function
qπ (s, a) = Eπ [Gt | St = s, At = a]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s, At = a
k=0
5/27
Radhe Shyam Sharma IIT Mandi
Recursive Relationship b/w value of state and the value of
its successor states
vπ (s) = Eπ [Gt | St = s]
"∞ #
X
k
= Eπ γ Rt+k+1 | St = s
k=0
= Eπ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 + .. | St = s
= Eπ [Rt+1 + γGt+1 | St = s]
X XX
= π(a | s) p(s ′ , r | s, a) [r + γEπ [Gt+1 | St+1 = s ′ ]]
a s′ r
X X
= π(a | s) p(s ′ , r | s, a) [r + γvπ (s ′ )] , for all s∈S
a s ′ ,r
6/27
Radhe Shyam Sharma IIT Mandi
Bellman Equation
It states that the value of start state must be equal the (discounted)
value of the expected next state, plus the reward expected along the way.
7/27
Radhe Shyam Sharma IIT Mandi
Optimal Policy and Value functions
A policy π is defined to be better than or equal to a policy π ′ if its
expected return is greater than of that π ′ , i.e.,
π ≥ π′
if and only if
for all s ∈ S There is always at least one policy that is better than or
equal to all other policies. This is an optimal policy (π∗ ). There may be
more than one. They share same state value function (v∗ ).
8/27
Radhe Shyam Sharma IIT Mandi
Backup Diagram for vπ
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 9/27
Radhe Shyam Sharma IIT Mandi
Backup Diagram for v∗
X
v∗ (s) = max p(s ′ , r | s, a) [r + γvπ (s ′ )] , for all s∈S
a
s ′ ,r
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 10/27
Radhe Shyam Sharma IIT Mandi
Policy Evaluation
vπ (s) = Eπ [Gt | St = s]
vπ (s) = Eπ [Rt+1 + γGt+1 | St = s]
vπ (s) = Eπ [Rt+1 + γvπ (st+1 ) | St = s]
X X
vπ (s) = π(a | s) p(s ′ , r | s, a) [r + γvπ (s ′ )]
a s ′ ,r
11/27
Radhe Shyam Sharma IIT Mandi
Iterative Solution
12/27
Radhe Shyam Sharma IIT Mandi
Iterative policy evaluation
For estimating v ≈ vπ
Input π (the policy to be evaluated)
Choose a small threshold θ > 0 Reg acc of estimation
Initialize v(s), for all s ∈ S + arbitrary except v(terminal) =0
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)
X X
v (s) = π(a | s) p(s ′ , r | s, a) [r + γv (s ′ )]
a s ′ ,r
δ = max(δ, |v − v (s)|)
until δ < θ
13/27
Radhe Shyam Sharma IIT Mandi
Iterative Policy Evaluation
The reward is -1 for all transitions until the terminal state is reached.
The action that would take the agent off the grid leaves the state
unchanged. p(6,-1 | 5, rt) =?
p(7,-1 | 7, rt) =?
p(10,-1 | 5, rt) =?
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 14/27
Radhe Shyam Sharma IIT Mandi
Policy improvement
Let π&π ′ be any pair of deterministic policies such that for alls∈ S,
15/27
Radhe Shyam Sharma IIT Mandi
Proof
.
.
.
≤ Eπ′ Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 + ...) | St = s
= vπ′ (s)
16/27
Radhe Shyam Sharma IIT Mandi
Policy Iteration
For estimating π ≈ π∗
Initialization
v(s) ∈ R, and π(s) ∈ A(s) for all s ∈ S
Policy Evaluation
Choose a small threshold θ > 0 Reg acc of estimation
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)
X
v (s) = p(s ′ , r | s, π(s)) [r + γv (s ′ )]
s ′ ,r
δ = max(δ, |v − v (s)|)
Policy Improvement
Policy stable = true
Loop for each s ∈ S:
Actionold = π(s)P
π(s) = argmaxa s ′ ,r p(s ′ , r | s, a) [r + γv (s ′ )]
if Actionold ̸= π(s):
Policystable = false
if policy-stable:
stop and return v, π
else:
go to Policy Evaluation step
18/27
Radhe Shyam Sharma IIT Mandi
Value Iteration
For estimating π ≈ π∗
Initialization: Initialize v(s), for all s ∈ S + arbitrary except
v(terminal) =0
Policy Evaluation
Choose a small threshold θ > 0 Reg acc of estimation
Loop:
δ =0
Loop for each s ∈ S:
v = v (s)
X
v (s) = maxa p(s ′ , r | s, π(s)) [r + γv (s ′ )]
s ′ ,r
δ = max(δ, |v − v (s)|)
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 20/27
Radhe Shyam Sharma IIT Mandi
Generalized Policy Iteration
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 21/27
Radhe Shyam Sharma IIT Mandi
Generalized Policy Iteration
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 22/27
Radhe Shyam Sharma IIT Mandi
Monte Carlo Approach
23/27
Radhe Shyam Sharma IIT Mandi
Example
24/27
Radhe Shyam Sharma IIT Mandi
The first-visit MC method for estimating vπ (s)
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT press,, 2018. 25/27
Radhe Shyam Sharma IIT Mandi
References
Sutton, Richard S., and Andrew G. Barto., “Reinforcement learning: An introduction” MIT
press,, 2018.
26/27
Radhe Shyam Sharma IIT Mandi
Thank you
27/27
Radhe Shyam Sharma IIT Mandi