0% found this document useful (0 votes)

53 views76 pages

Fa19 Lecture 15 MDPs II

Yes, we do need to store each Vk to update to Vk+1. No, we don't need to store Vk after computing Vk+1. Vk+1(s) ≥ Vk(s) always holds because we are taking the maximum of the current value and the updated value. The updated value is at least as good as the previous value since it incorporates more information from the Bellman update. So in summary: - We need to store each Vk to compute the update - But we don't need to store past Vk after computing Vk+1 - Vk+1(s) is guaranteed to be ≥ Vk(s) due to the

Uploaded by

Sang Tấn Trương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views76 pages

Fa19 Lecture 15 MDPs II

Uploaded by

Sang Tấn Trương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Announcements

Assignments:
 P3: Optimization; Due today, 10 pm
 HW7 (online) 10/22 Tue, 10 pm

Recitations canceled on: October 18 (Mid-Semester Break) and October 25

(Day for Community Engagement)
We will provide recitation worksheet (reference for midterm/final)

Piazza post for In-class Questions

AI: Representation and Problem Solving
Markov Decision Processes II

Instructors: Fei Fang & Pat Virtue

Slide credits: CMU AI and http://ai.berkeley.edu
Learning Objectives
• Write Bellman Equation for state-value and Q-value for
optimal policy and a given policy
• Describe and implement value iteration algorithm (through
Bellman update) for solving MDPs
• Describe and implement policy iteration algorithm (through
policy evaluation and policy improvement) for solving
MDPs
• Understand convergence for value iteration and policy
iteration
• Understand concept of exploration, exploitation, regret
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

𝑎
𝑠′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

𝑎
𝑠′

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

𝑎
𝑠′
Example: Grid World

 Goal: maximize sum of (discounted) rewards

MDP Quantities
Markov decision processes:
s
 States S
 Actions A a
 Transitions P(s’|s,a) (or T(s,a,s’))
 Rewards R(s,a,s’) (and discount ) s, a
 Start state s0
s,a,s’
MDP quantities: s’
 Policy = map of states to actions
 Utility = sum of (discounted) rewards
 (State) Value = expected utility starting from a state (max node)
 Q-Value = expected utility starting from a state-action pair, i.e., q-state (chance node)
MDP Optimal Quantities
s
 The optimal policy:
*(s) = optimal action from state s a
 The (true) value (or utility) of a state s: s, a
V*(s) = expected utility starting in s and
acting optimally
s,a,s’
 The (true) value (or utility) of a q-state (s,a): s’
Q*(s,a) = expected utility starting out
having taken action a from state s and
(thereafter) acting optimally 𝑉 ∗ (𝑠) < +∞ if 𝛾 < 1 and 𝑅 𝑠, 𝑎, 𝑠 ′ < ∞

Solve MDP: Find 𝜋 ∗ , 𝑉 ∗ and/or 𝑄∗

[Demo: gridworld values (L9D1)]
Piazza Poll 1
Which ones are true about optimal policy 𝜋 ∗ (𝑠), true values 𝑉 ∗ (𝑠) and true Q-Values 𝑄∗ (𝑠, 𝑎)?
A: 𝜋 ∗ 𝑠 = argmax 𝑉 ∗ (𝑠′) B: 𝜋 ∗ 𝑠 = argmax 𝑄∗ (𝑠, 𝑎)
𝑎 𝑎
′ ∗
where 𝑠 = argmax 𝑉 (𝑠, 𝑎, 𝑠′′) C: 𝑉 ∗ 𝑠 = max 𝑄∗ (𝑠, 𝑎)
𝑠 ′′ 𝑎
𝑉 ∗ 𝑠, 𝑎, 𝑠 ′′ : Represent 𝑉 ∗ (𝑠 ′′ ) where 𝑠′′ is reachable
Piazza Poll 1 through (𝑠, 𝑎). Not a standard notation.

𝜋 ∗ 𝑠 = argmax 𝑃 𝑠 ′ |𝑠, 𝑎 ∗ (𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 ∗ 𝑠 ′ ) ≠ argmax 𝑉 ∗ (𝑠′)

𝑎 𝑎
𝑠′
𝜋 ∗ 𝑠 = argmax 𝑄∗ (𝑠, 𝑎)
𝑎
Computing Optimal Policy from Values
Computing Optimal Policy from Values
Let’s imagine we have the optimal values V*(s)

How should we act?

We need to do a mini-expectimax (one step)

s
a
s, a
Sometimes this is called policy extraction, since it
gets the policy implied by the values
s,a,s’
s’
Computing Optimal Policy from Q-Values
Let’s imagine we have the optimal q-values:

How should we act?

 Completely trivial to decide!

Important lesson: actions are easier to select from q-values than values!
The Bellman Equations

How to be optimal:
Step 1: Take correct first action
Step 2: Keep being optimal
The Bellman Equations
s
Definition of “optimal utility” leads to Bellman Equations,
which characterize the relationship amongst optimal a
utility values
s, a

s,a,s’
s’
Expectimax-like computation

Necessary and sufficient conditions for optimality

Solution is unique
The Bellman Equations
s
Definition of “optimal utility” leads to Bellman Equations,
which characterize the relationship amongst optimal a
utility values
s, a

s,a,s’
s’
Expectimax-like computation
with one-step lookahead and
Necessary and sufficient conditions for optimality a “perfect” heuristic at leaf
nodes
Solution is unique
Solving Expectimax
Solving MDP
Limited Lookahead
Solving MDP
Limited Lookahead
Value Iteration
Demo Value Iteration

[Demo: value iteration (L8D6)]

Value Iteration
Start with V0(s) = 0: no time steps left means an expected reward sum of zero

Given vector of Vk(s) values, apply Bellman update once (do one ply of expectimax with R
and 𝛾 from each state):
Vk+1(s)
a
s, a
Repeat until convergence s,a,s’
Will this process converge? Vk(s’)

Yes!
Piazza Poll 2
What is the complexity of each iteration in Value Iteration?
S -- set of states; A -- set of actions
Vk+1(s)
I: 𝑂(|𝑆||𝐴|) a
II: 𝑂( 𝑆 2 |𝐴|) s, a
III: 𝑂(|𝑆| 𝐴 2 ) s,a,s’
IV: 𝑂( 𝑆 2 𝐴 2 ) Vk(s’)

V: 𝑂( 𝑆 2 )
Piazza Poll 2
What is the complexity of each iteration in Value Iteration?
S -- set of states; A -- set of actions
Vk+1(s)
I: 𝑂(|𝑆||𝐴|) a
II: 𝑂( 𝑆 2 |𝐴|) s, a
III: 𝑂(|𝑆| 𝐴 2 ) s,a,s’
IV: 𝑂( 𝑆 2 𝐴 2 ) Vk(s’)

V: 𝑂( 𝑆 2 )
Value Iteration
function VALUE-ITERATION(MDP=(S,A,T,R,𝛾), 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑) returns a state value function
for s in S
𝑉0 (𝑠) ← 0 Do we really need to store the
𝑘←0 value of 𝑉𝑘 for each 𝑘?
repeat
𝛿←0
for s in S
𝑉𝑘+1 𝑠 ← −∞ Does 𝑉𝑘+1 𝑠 ≥ 𝑉𝑘 (𝑠) always hold?
for a in A
𝑣←0
for s’ in S
𝑣 ← 𝑣 + 𝑇 𝑠, 𝑎, 𝑠 ′ (𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 (𝑠′))
𝑉𝑘+1 𝑠 ← max{𝑉𝑘+1 𝑠 , 𝑣}
𝛿 ← max{𝛿, |𝑉𝑘+1 𝑠 − 𝑉𝑘 𝑠 |}
𝑘 ←𝑘+1
until 𝛿 < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
return 𝑉𝑘−1
Value Iteration
function VALUE-ITERATION(MDP=(S,A,T,R,𝛾), 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑) returns a state value function
for s in S
𝑉0 (𝑠) ← 0 Do we really need to store the
𝑘←0 value of 𝑉𝑘 for each 𝑘?
repeat
𝛿←0 No. Use 𝑉 = 𝑉𝑙𝑎𝑠𝑡 and 𝑉 ′ = 𝑉𝑐𝑢𝑟𝑟𝑒𝑛𝑡
for s in S
𝑉𝑘+1 𝑠 ← −∞ Does 𝑉𝑘+1 𝑠 ≥ 𝑉𝑘 (𝑠) always hold?
for a in A No. If 𝑇 𝑠, 𝑎, 𝑠 ′ = 1 and 𝑅 𝑠, 𝑎, 𝑠 ′ < 0,
𝑣←0 ′ <0
for s’ in S then 𝑉1 𝑠 = 𝑅 𝑠, 𝑎, 𝑠
𝑣 ← 𝑣 + 𝑇 𝑠, 𝑎, 𝑠 ′ (𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 (𝑠′))
𝑉𝑘+1 𝑠 ← max{𝑉𝑘+1 𝑠 , 𝑣}
𝛿 ← max{𝛿, |𝑉𝑘+1 𝑠 − 𝑉𝑘 𝑠 |}
𝑘 ←𝑘+1
until 𝛿 < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
return 𝑉𝑘−1
Bellman Equation vs Value Iteration vs Bellman Update
s
Bellman equations characterize the optimal values:
a
s, a
Value iteration computes them by applying Bellman
update repeatedly
s,a,s’

Value iteration is a method for solving Bellman Equation

𝑉𝑘 vectors are also interpretable as time-limited values
Value iteration finds the fixed point of the function
𝑓 𝑉 = max 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉(𝑠 ′ )]
𝑎
𝑠′
Value Iteration Convergence
How do we know the Vk vectors are going to converge?

Case 1: If the tree has maximum depth M, then VM

holds the actual untruncated values

Case 2: If 𝛾 < 1 and 𝑅(𝑠, 𝑎, 𝑠 ′ ) ≤ 𝑅𝑚𝑎𝑥 < ∞

 Intuition: For any state 𝑉𝑘 and 𝑉𝑘+1 can be viewed as depth k+1
expectimax results (with 𝑅 and 𝛾) in nearly identical search trees
If we initialized 𝑉0 (𝑠) differently,
 The difference is that on the bottom layer, 𝑉𝑘+1 has actual
rewards while 𝑉𝑘 has zeros what would happen?
 𝑅(𝑠, 𝑎, 𝑠 ′ ) ≤ 𝑅𝑚𝑎𝑥
 𝑉1 𝑠 − 𝑉0 𝑠 = 𝑉1 𝑠 − 0 ≤ 𝑅𝑚𝑎𝑥
 |𝑉𝑘+1 𝑠 − 𝑉𝑘 𝑠 | ≤ 𝛾 𝑘 𝑅𝑚𝑎𝑥
 So 𝑉𝑘+1 𝑠 − 𝑉𝑘 𝑠 → 0 as 𝑘 → ∞
Value Iteration Convergence
How do we know the Vk vectors are going to converge?

Case 1: If the tree has maximum depth M, then VM holds

the actual untruncated values

Case 2: If 𝛾 < 1 and 𝑅(𝑠, 𝑎, 𝑠 ′ ) ≤ 𝑅𝑚𝑎𝑥 < ∞

 Intuition: For any state 𝑉𝑘 and 𝑉𝑘+1 can be viewed as depth k+1
expectimax results (with 𝑅 and 𝛾) in nearly identical search trees
If we initialized 𝑉0 (𝑠) differently,
 The difference is that on the bottom layer, 𝑉𝑘+1 has actual
rewards while 𝑉𝑘 has zeros what would happen?
 Each value at last layer of 𝑉𝑘+1 tree is at most 𝑅𝑚𝑎𝑥 in magnitude Still converge to 𝑉 ∗ 𝑠 as long as
 But everything is discounted by 𝛾 𝑘 that far out 𝑉0 𝑠 < +∞, but may be slower
𝑘
 So 𝑉𝑘 and 𝑉𝑘+1 are at most 𝛾 𝑅𝑚𝑎𝑥 different
 So as 𝑘 increases, the values converge
Other ways to solve Bellman Equation?

Treat 𝑉 ∗ 𝑠 as variables
Solve Bellman Equation through Linear Programming
Other ways to solve Bellman Equation?

Treat 𝑉 ∗ 𝑠 as variables
Solve Bellman Equation through Linear Programming

min
∗
𝑉 ∗ (𝑠)
𝑉
𝑠
s.t. 𝑉 ∗ 𝑠 ≥ 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 ∗ (𝑠 ′ )],∀𝑠, 𝑎
Policy Iteration for Solving MDPs
Policy Evaluation
Fixed Policies
Do the optimal action Do what  says to do
s s
a (s)
s, a s, (s)

s,a,s’ s, (s),s’
s’ s’

Expectimax trees max over all actions to compute the optimal values
If we fixed some policy (s), then the tree would be simpler
– only one action per state
 … though the tree’s value would depend on which policy we fixed
Utilities for a Fixed Policy
Another basic operation: compute the utility of a state
s
s under a fixed (generally non-optimal) policy
(s)
Define the utility of a state s, under a fixed policy : s, (s)
V(s) = expected total discounted rewards starting in s
and following  s, (s),s’
s’

Recursive relation (one-step look-ahead / Bellman

equation):
Compare
MDP Quantities
 A policy 𝜋: map of states to actions
 The optimal policy *: *(s) = optimal action
from state s s s is a
state
a
 Value function of a policy 𝑉𝜋 𝑠 : expected (s, a) is a
utility starting in s and acting according to 𝜋 s, a state-action
𝜋 ∗
 Optimal value function V*: V*(s) = 𝑉 𝑠 pair
s,a,s’ (s,a,s’) is a
s’ transition
 Q function of a policy Q𝜋
𝑠 : expected utility
starting out having taken action a from state s
and (thereafter) acting according to 𝜋
𝜋∗
 Optimal Q function Q* : Q*(s,a) = Q 𝑠

Solve MDP: Find 𝜋 ∗ , 𝑉 ∗ and/or 𝑄∗

Example: Policy Evaluation
Always Go Right Always Go Forward
Example: Policy Evaluation
Always Go Right Always Go Forward
Policy Evaluation
How do we calculate the V’s for a fixed policy ?
s
Idea 1: Turn recursive Bellman equations into updates (s)
(like value iteration) s, (s)

s, (s),s’
s’
Piazza Poll 3
What is the complexity of each iteration in Policy Evaluation?
S -- set of states; A -- set of actions
Vk+1(s)
I: 𝑂(|𝑆||𝐴|) a
II: 𝑂( 𝑆 2 |𝐴|) s, a
III: 𝑂(|𝑆| 𝐴 2 ) s,a,s’
IV: 𝑂( 𝑆 2 𝐴 2 ) Vk(s’)

V: 𝑂( 𝑆 2 )
Piazza Poll 3
What is the complexity of each iteration in Policy Evaluation?
S -- set of states; A -- set of actions
Vk+1(s)
I: 𝑂(|𝑆||𝐴|) a
II: 𝑂( 𝑆 2 |𝐴|) s, a
III: 𝑂(|𝑆| 𝐴 2 ) s,a,s’
IV: 𝑂( 𝑆 2 𝐴 2 ) Vk(s’)

V: 𝑂( 𝑆 2 )
Policy Evaluation
Idea 2: Bellman Equation w.r.t. a given policy 𝜋 defines a linear system
 Solve with your favorite linear system solver

Treat 𝑉 𝜋 𝑠 as variables

How many variables?

How many constraints?
Policy Iteration
Problems with Value Iteration
Value iteration repeats the Bellman updates:
s
a
s, a
Problem 1: It’s slow – O( 𝑆 2 |𝐴|) per iteration
s,a,s’
s’
Problem 2: The “max” at each state rarely changes

Problem 3: The policy often converges long before the values

[Demo: value iteration (L9D2)]

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
Alternative approach for optimal values:
 Step 1: Policy evaluation: calculate utilities for some fixed policy (may
not be optimal!) until convergence
 Step 2: Policy improvement: update policy using one-step look-ahead
with resulting converged (but not optimal!) utilities as future values
 Repeat steps until policy converges

This is policy iteration

 It’s still optimal!
 Can converge (much) faster under some conditions
Policy Iteration
Policy Evaluation: For fixed current policy , find values w.r.t. the policy
 Iterate until values converge:

Policy Improvement: For fixed values, get a better policy with one-step look-
ahead:

Similar to how you derive optimal policy 𝜋 ∗ given optimal value 𝑉 ∗

Piazza Poll 4
True/False: 𝑉 𝜋𝑖+1 𝑠 ≥ 𝑉 𝜋𝑖 𝑠 , ∀𝑠
Piazza Poll 4
True/False: 𝑉 𝜋𝑖+1 𝑠 ≥ 𝑉 𝜋𝑖 𝑠 , ∀𝑠

𝑉 𝜋𝑖 𝑠 = 𝑇 𝑠, 𝜋 𝑖 (𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑖 (𝑠), 𝑠 ′ + 𝛾𝑉 𝜋𝑖 𝑠′ ]
𝑠′
If I take first step according to 𝜋𝑖+1 and then follow 𝜋𝑖 , we get
an expected utility of
𝑉1 (𝑠) = max 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑖 𝑠′ ]
𝑎
𝑠′
Which is ≥ 𝑉 𝜋𝑖 𝑠
What if I take two steps according to 𝜋𝑖+1 ?
Comparison
Both value iteration and policy iteration compute the same thing
(all optimal values)

In value iteration:
 Every iteration updates both the values and (implicitly) the policy
 We don’t track the policy, but taking the max over actions implicitly recomputes it

In policy iteration:
 We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
 After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
 The new policy will be better (or we’re done)

(Both are dynamic programs for solving MDPs)

Summary: MDP Algorithms
So you want to….
 Turn values into a policy: use one-step lookahead
 Compute optimal values: use value iteration or policy iteration
 Compute values for a particular policy: use policy evaluation

These all look the same!

 They basically are – they are all variations of Bellman updates
 They all use one-step lookahead expectimax fragments
 They differ only in whether we plug in a fixed policy or max over actions
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

𝑎
𝑠′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

𝑎
𝑠′

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

𝑎
𝑠′
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

𝑎
𝑠′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

𝑎
𝑠′

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

𝑎
𝑠′
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

𝑎
𝑠′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

𝑎
𝑠′

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

𝑎
𝑠′
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

𝑎
𝑠′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

𝑎
𝑠′

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

𝑎
𝑠′
Double Bandits
Double-Bandit MDP No discount
100 time steps
Actions: Blue, Red Both states have
States: Win, Lose 0.25 $0 the same value

0.75
W $2 0.25 L
$1 $0 $1
0.75 $2
1.0 1.0

Actually a simple MDP where the current state does not impact transition or reward:
𝑃 𝑠 ′ |𝑠, 𝑎 = 𝑃 𝑠 ′ |𝑎 and 𝑅 𝑠, 𝑎, 𝑠 ′ = 𝑅(𝑎, 𝑠 ′ )
Offline Planning
No discount
Solving MDPs is offline planning 100 time steps
 You determine all quantities through computation Both states have
 You need to know the details of the MDP the same value
 You do not actually play the game!
0.25 $0
Value
0.75
0.25
Play Red 150 W $2 L
$1 $0 $1
0.75 $2
Play Blue 100 1.0 1.0
Let’s Play!

$2 $2 $0 $2 $2
$2 $2 $0 $0 $0
Online Planning
Rules changed! Red’s win chance is different.

?? $0

??
W $2 ?? L
$1 $0 $1
?? $2
1.0 1.0
Let’s Play!

$0 $0 $0 $2 $0
$2 $0 $0 $0 $0
What Just Happened?
That wasn’t planning, it was learning!
 Specifically, reinforcement learning
 There was an MDP, but you couldn’t solve it with just computation
 You needed to actually act to figure it out

Important ideas in reinforcement learning that came up

 Exploration: you have to try unknown actions to get information
 Exploitation: eventually, you have to use what you know
 Regret: even if you learn intelligently, you make mistakes
 Sampling: because of chance, you have to try things repeatedly
 Difficulty: learning can be much harder than solving a known MDP
Next Time: Reinforcement Learning!

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
CPP Interview Questions and Answers For Experienced
100% (1)
CPP Interview Questions and Answers For Experienced
8 pages
Lec 09
No ratings yet
Lec 09
51 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Lec 4
No ratings yet
Lec 4
16 pages
Subtitle
No ratings yet
Subtitle
1 page
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Lec 12
No ratings yet
Lec 12
60 pages
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
No ratings yet
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
18 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Slidedeck 6 MAS 2021 22 RL 2 MDP Model-Based
No ratings yet
Slidedeck 6 MAS 2021 22 RL 2 MDP Model-Based
36 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
Lec 3
No ratings yet
Lec 3
15 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
Subtitle
No ratings yet
Subtitle
2 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
AIS462 - Reinforcement Learning - Spring2025 - Lec4
No ratings yet
AIS462 - Reinforcement Learning - Spring2025 - Lec4
13 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
Lecture SM 1 DP
No ratings yet
Lecture SM 1 DP
71 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Exact (RL IITH)
No ratings yet
Exact (RL IITH)
47 pages
Handling Uncertainty 03 - Solving MDP
No ratings yet
Handling Uncertainty 03 - Solving MDP
11 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Chapter 9 Nested Loop
No ratings yet
Chapter 9 Nested Loop
32 pages
AI Manual (E-Next - In)
No ratings yet
AI Manual (E-Next - In)
57 pages
BDPS
No ratings yet
BDPS
119 pages
Alv Oops One Cell Edit Color
No ratings yet
Alv Oops One Cell Edit Color
7 pages
DSA Mid Lab FALL 24
No ratings yet
DSA Mid Lab FALL 24
2 pages
Python Tuple
No ratings yet
Python Tuple
16 pages
Binary Codes Group 5
No ratings yet
Binary Codes Group 5
8 pages
VHDL Tutorial PDF
No ratings yet
VHDL Tutorial PDF
163 pages
PSP Lab Manual Last Edit
No ratings yet
PSP Lab Manual Last Edit
33 pages
Log
No ratings yet
Log
13 pages
Project File of Payroll Management System
100% (2)
Project File of Payroll Management System
100 pages
Week 1
No ratings yet
Week 1
18 pages
Games
No ratings yet
Games
3 pages
Low Power
No ratings yet
Low Power
2 pages
Formated Search
100% (2)
Formated Search
3 pages
Deterministic QuickSelect (Median of Medians)
No ratings yet
Deterministic QuickSelect (Median of Medians)
2 pages
EPITA International Bachelor of Computer Science - 2020
No ratings yet
EPITA International Bachelor of Computer Science - 2020
2 pages
Nitin Sharma Python Report
No ratings yet
Nitin Sharma Python Report
68 pages
Multi-Objective Evolutionary Algorithms: DR Sujit Das
No ratings yet
Multi-Objective Evolutionary Algorithms: DR Sujit Das
22 pages
Bitmap Index vs. B-Tree Index: Which and When?: Published 2005
No ratings yet
Bitmap Index vs. B-Tree Index: Which and When?: Published 2005
29 pages
How To Edit A Script Code
No ratings yet
How To Edit A Script Code
6 pages
Documentation Sample
No ratings yet
Documentation Sample
26 pages
EMU8086 Solutions
0% (1)
EMU8086 Solutions
10 pages
Hardware Sales and Servicing Information System
0% (1)
Hardware Sales and Servicing Information System
39 pages
C++ Unit 1-Qb With Ans
No ratings yet
C++ Unit 1-Qb With Ans
22 pages
Python Chapter 1
No ratings yet
Python Chapter 1
39 pages
FW 2 SNMP
No ratings yet
FW 2 SNMP
3 pages
Oracle Application Development Framework (Oracle Adf) 11G: Components For Rich Enterprise Applications
No ratings yet
Oracle Application Development Framework (Oracle Adf) 11G: Components For Rich Enterprise Applications
3 pages
Encapsulation
No ratings yet
Encapsulation
13 pages

Fa19 Lecture 15 MDPs II

Uploaded by

Fa19 Lecture 15 MDPs II

Uploaded by

Announcements

Recitations canceled on: October 18 (Mid-Semester Break) and October 25

Piazza post for In-class Questions

Instructors: Fei Fang & Pat Virtue

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

 Goal: maximize sum of (discounted) rewards

Solve MDP: Find 𝜋 ∗ , 𝑉 ∗ and/or 𝑄∗

𝜋 ∗ 𝑠 = argmax 𝑃 𝑠 ′ |𝑠, 𝑎 ∗ (𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 ∗ 𝑠 ′ ) ≠ argmax 𝑉 ∗ (𝑠′)

How should we act?

We need to do a mini-expectimax (one step)

How should we act?

Necessary and sufficient conditions for optimality

[Demo: value iteration (L8D6)]

Value iteration is a method for solving Bellman Equation

Case 1: If the tree has maximum depth M, then VM

Case 2: If 𝛾 < 1 and 𝑅(𝑠, 𝑎, 𝑠 ′ ) ≤ 𝑅𝑚𝑎𝑥 < ∞

Case 1: If the tree has maximum depth M, then VM holds

Case 2: If 𝛾 < 1 and 𝑅(𝑠, 𝑎, 𝑠 ′ ) ≤ 𝑅𝑚𝑎𝑥 < ∞

Recursive relation (one-step look-ahead / Bellman

Solve MDP: Find 𝜋 ∗ , 𝑉 ∗ and/or 𝑄∗

How many variables?

Problem 3: The policy often converges long before the values

[Demo: value iteration (L9D2)]

This is policy iteration

Similar to how you derive optimal policy 𝜋 ∗ given optimal value 𝑉 ∗

(Both are dynamic programs for solving MDPs)

These all look the same!

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′

Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠

Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max

Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠

Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠

Important ideas in reinforcement learning that came up

You might also like