Announcements
Assignments:
P3: Optimization; Due today, 10 pm
HW7 (online) 10/22 Tue, 10 pm
Recitations canceled on: October 18 (Mid-Semester Break) and October 25
(Day for Community Engagement)
We will provide recitation worksheet (reference for midterm/final)
Piazza post for In-class Questions
AI: Representation and Problem Solving
Markov Decision Processes II
Instructors: Fei Fang & Pat Virtue
Slide credits: CMU AI and http://ai.berkeley.edu
Learning Objectives
• Write Bellman Equation for state-value and Q-value for
optimal policy and a given policy
• Describe and implement value iteration algorithm (through
Bellman update) for solving MDPs
• Describe and implement policy iteration algorithm (through
policy evaluation and policy improvement) for solving
MDPs
• Understand convergence for value iteration and policy
iteration
• Understand concept of exploration, exploitation, regret
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′
Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′
𝑎
𝑠′
Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠
𝑎
𝑠′
Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max
′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′
Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠
𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′
Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠
𝑎
𝑠′
Example: Grid World
Goal: maximize sum of (discounted) rewards
MDP Quantities
Markov decision processes:
s
States S
Actions A a
Transitions P(s’|s,a) (or T(s,a,s’))
Rewards R(s,a,s’) (and discount ) s, a
Start state s0
s,a,s’
MDP quantities: s’
Policy = map of states to actions
Utility = sum of (discounted) rewards
(State) Value = expected utility starting from a state (max node)
Q-Value = expected utility starting from a state-action pair, i.e., q-state (chance node)
MDP Optimal Quantities
s
The optimal policy:
*(s) = optimal action from state s a
The (true) value (or utility) of a state s: s, a
V*(s) = expected utility starting in s and
acting optimally
s,a,s’
The (true) value (or utility) of a q-state (s,a): s’
Q*(s,a) = expected utility starting out
having taken action a from state s and
(thereafter) acting optimally 𝑉 ∗ (𝑠) < +∞ if 𝛾 < 1 and 𝑅 𝑠, 𝑎, 𝑠 ′ < ∞
Solve MDP: Find 𝜋 ∗ , 𝑉 ∗ and/or 𝑄∗
[Demo: gridworld values (L9D1)]
Piazza Poll 1
Which ones are true about optimal policy 𝜋 ∗ (𝑠), true values 𝑉 ∗ (𝑠) and true Q-Values 𝑄∗ (𝑠, 𝑎)?
A: 𝜋 ∗ 𝑠 = argmax 𝑉 ∗ (𝑠′) B: 𝜋 ∗ 𝑠 = argmax 𝑄∗ (𝑠, 𝑎)
𝑎 𝑎
′ ∗
where 𝑠 = argmax 𝑉 (𝑠, 𝑎, 𝑠′′) C: 𝑉 ∗ 𝑠 = max 𝑄∗ (𝑠, 𝑎)
𝑠 ′′ 𝑎
𝑉 ∗ 𝑠, 𝑎, 𝑠 ′′ : Represent 𝑉 ∗ (𝑠 ′′ ) where 𝑠′′ is reachable
Piazza Poll 1 through (𝑠, 𝑎). Not a standard notation.
𝜋 ∗ 𝑠 = argmax 𝑃 𝑠 ′ |𝑠, 𝑎 ∗ (𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 ∗ 𝑠 ′ ) ≠ argmax 𝑉 ∗ (𝑠′)
𝑎 𝑎
𝑠′
𝜋 ∗ 𝑠 = argmax 𝑄∗ (𝑠, 𝑎)
𝑎
Computing Optimal Policy from Values
Computing Optimal Policy from Values
Let’s imagine we have the optimal values V*(s)
How should we act?
We need to do a mini-expectimax (one step)
s
a
s, a
Sometimes this is called policy extraction, since it
gets the policy implied by the values
s,a,s’
s’
Computing Optimal Policy from Q-Values
Let’s imagine we have the optimal q-values:
How should we act?
Completely trivial to decide!
Important lesson: actions are easier to select from q-values than values!
The Bellman Equations
How to be optimal:
Step 1: Take correct first action
Step 2: Keep being optimal
The Bellman Equations
s
Definition of “optimal utility” leads to Bellman Equations,
which characterize the relationship amongst optimal a
utility values
s, a
s,a,s’
s’
Expectimax-like computation
Necessary and sufficient conditions for optimality
Solution is unique
The Bellman Equations
s
Definition of “optimal utility” leads to Bellman Equations,
which characterize the relationship amongst optimal a
utility values
s, a
s,a,s’
s’
Expectimax-like computation
with one-step lookahead and
Necessary and sufficient conditions for optimality a “perfect” heuristic at leaf
nodes
Solution is unique
Solving Expectimax
Solving MDP
Limited Lookahead
Solving MDP
Limited Lookahead
Value Iteration
Demo Value Iteration
[Demo: value iteration (L8D6)]
Value Iteration
Start with V0(s) = 0: no time steps left means an expected reward sum of zero
Given vector of Vk(s) values, apply Bellman update once (do one ply of expectimax with R
and 𝛾 from each state):
Vk+1(s)
a
s, a
Repeat until convergence s,a,s’
Will this process converge? Vk(s’)
Yes!
Piazza Poll 2
What is the complexity of each iteration in Value Iteration?
S -- set of states; A -- set of actions
Vk+1(s)
I: 𝑂(|𝑆||𝐴|) a
II: 𝑂( 𝑆 2 |𝐴|) s, a
III: 𝑂(|𝑆| 𝐴 2 ) s,a,s’
IV: 𝑂( 𝑆 2 𝐴 2 ) Vk(s’)
V: 𝑂( 𝑆 2 )
Piazza Poll 2
What is the complexity of each iteration in Value Iteration?
S -- set of states; A -- set of actions
Vk+1(s)
I: 𝑂(|𝑆||𝐴|) a
II: 𝑂( 𝑆 2 |𝐴|) s, a
III: 𝑂(|𝑆| 𝐴 2 ) s,a,s’
IV: 𝑂( 𝑆 2 𝐴 2 ) Vk(s’)
V: 𝑂( 𝑆 2 )
Value Iteration
function VALUE-ITERATION(MDP=(S,A,T,R,𝛾), 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑) returns a state value function
for s in S
𝑉0 (𝑠) ← 0 Do we really need to store the
𝑘←0 value of 𝑉𝑘 for each 𝑘?
repeat
𝛿←0
for s in S
𝑉𝑘+1 𝑠 ← −∞ Does 𝑉𝑘+1 𝑠 ≥ 𝑉𝑘 (𝑠) always hold?
for a in A
𝑣←0
for s’ in S
𝑣 ← 𝑣 + 𝑇 𝑠, 𝑎, 𝑠 ′ (𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 (𝑠′))
𝑉𝑘+1 𝑠 ← max{𝑉𝑘+1 𝑠 , 𝑣}
𝛿 ← max{𝛿, |𝑉𝑘+1 𝑠 − 𝑉𝑘 𝑠 |}
𝑘 ←𝑘+1
until 𝛿 < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
return 𝑉𝑘−1
Value Iteration
function VALUE-ITERATION(MDP=(S,A,T,R,𝛾), 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑) returns a state value function
for s in S
𝑉0 (𝑠) ← 0 Do we really need to store the
𝑘←0 value of 𝑉𝑘 for each 𝑘?
repeat
𝛿←0 No. Use 𝑉 = 𝑉𝑙𝑎𝑠𝑡 and 𝑉 ′ = 𝑉𝑐𝑢𝑟𝑟𝑒𝑛𝑡
for s in S
𝑉𝑘+1 𝑠 ← −∞ Does 𝑉𝑘+1 𝑠 ≥ 𝑉𝑘 (𝑠) always hold?
for a in A No. If 𝑇 𝑠, 𝑎, 𝑠 ′ = 1 and 𝑅 𝑠, 𝑎, 𝑠 ′ < 0,
𝑣←0 ′ <0
for s’ in S then 𝑉1 𝑠 = 𝑅 𝑠, 𝑎, 𝑠
𝑣 ← 𝑣 + 𝑇 𝑠, 𝑎, 𝑠 ′ (𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 (𝑠′))
𝑉𝑘+1 𝑠 ← max{𝑉𝑘+1 𝑠 , 𝑣}
𝛿 ← max{𝛿, |𝑉𝑘+1 𝑠 − 𝑉𝑘 𝑠 |}
𝑘 ←𝑘+1
until 𝛿 < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
return 𝑉𝑘−1
Bellman Equation vs Value Iteration vs Bellman Update
s
Bellman equations characterize the optimal values:
a
s, a
Value iteration computes them by applying Bellman
update repeatedly
s,a,s’
Value iteration is a method for solving Bellman Equation
𝑉𝑘 vectors are also interpretable as time-limited values
Value iteration finds the fixed point of the function
𝑓 𝑉 = max 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉(𝑠 ′ )]
𝑎
𝑠′
Value Iteration Convergence
How do we know the Vk vectors are going to converge?
Case 1: If the tree has maximum depth M, then VM
holds the actual untruncated values
Case 2: If 𝛾 < 1 and 𝑅(𝑠, 𝑎, 𝑠 ′ ) ≤ 𝑅𝑚𝑎𝑥 < ∞
Intuition: For any state 𝑉𝑘 and 𝑉𝑘+1 can be viewed as depth k+1
expectimax results (with 𝑅 and 𝛾) in nearly identical search trees
If we initialized 𝑉0 (𝑠) differently,
The difference is that on the bottom layer, 𝑉𝑘+1 has actual
rewards while 𝑉𝑘 has zeros what would happen?
𝑅(𝑠, 𝑎, 𝑠 ′ ) ≤ 𝑅𝑚𝑎𝑥
𝑉1 𝑠 − 𝑉0 𝑠 = 𝑉1 𝑠 − 0 ≤ 𝑅𝑚𝑎𝑥
|𝑉𝑘+1 𝑠 − 𝑉𝑘 𝑠 | ≤ 𝛾 𝑘 𝑅𝑚𝑎𝑥
So 𝑉𝑘+1 𝑠 − 𝑉𝑘 𝑠 → 0 as 𝑘 → ∞
Value Iteration Convergence
How do we know the Vk vectors are going to converge?
Case 1: If the tree has maximum depth M, then VM holds
the actual untruncated values
Case 2: If 𝛾 < 1 and 𝑅(𝑠, 𝑎, 𝑠 ′ ) ≤ 𝑅𝑚𝑎𝑥 < ∞
Intuition: For any state 𝑉𝑘 and 𝑉𝑘+1 can be viewed as depth k+1
expectimax results (with 𝑅 and 𝛾) in nearly identical search trees
If we initialized 𝑉0 (𝑠) differently,
The difference is that on the bottom layer, 𝑉𝑘+1 has actual
rewards while 𝑉𝑘 has zeros what would happen?
Each value at last layer of 𝑉𝑘+1 tree is at most 𝑅𝑚𝑎𝑥 in magnitude Still converge to 𝑉 ∗ 𝑠 as long as
But everything is discounted by 𝛾 𝑘 that far out 𝑉0 𝑠 < +∞, but may be slower
𝑘
So 𝑉𝑘 and 𝑉𝑘+1 are at most 𝛾 𝑅𝑚𝑎𝑥 different
So as 𝑘 increases, the values converge
Other ways to solve Bellman Equation?
Treat 𝑉 ∗ 𝑠 as variables
Solve Bellman Equation through Linear Programming
Other ways to solve Bellman Equation?
Treat 𝑉 ∗ 𝑠 as variables
Solve Bellman Equation through Linear Programming
min
∗
𝑉 ∗ (𝑠)
𝑉
𝑠
s.t. 𝑉 ∗ 𝑠 ≥ 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 ∗ (𝑠 ′ )],∀𝑠, 𝑎
Policy Iteration for Solving MDPs
Policy Evaluation
Fixed Policies
Do the optimal action Do what says to do
s s
a (s)
s, a s, (s)
s,a,s’ s, (s),s’
s’ s’
Expectimax trees max over all actions to compute the optimal values
If we fixed some policy (s), then the tree would be simpler
– only one action per state
… though the tree’s value would depend on which policy we fixed
Utilities for a Fixed Policy
Another basic operation: compute the utility of a state
s
s under a fixed (generally non-optimal) policy
(s)
Define the utility of a state s, under a fixed policy : s, (s)
V(s) = expected total discounted rewards starting in s
and following s, (s),s’
s’
Recursive relation (one-step look-ahead / Bellman
equation):
Compare
MDP Quantities
A policy 𝜋: map of states to actions
The optimal policy *: *(s) = optimal action
from state s s s is a
state
a
Value function of a policy 𝑉𝜋 𝑠 : expected (s, a) is a
utility starting in s and acting according to 𝜋 s, a state-action
𝜋 ∗
Optimal value function V*: V*(s) = 𝑉 𝑠 pair
s,a,s’ (s,a,s’) is a
s’ transition
Q function of a policy Q𝜋
𝑠 : expected utility
starting out having taken action a from state s
and (thereafter) acting according to 𝜋
𝜋∗
Optimal Q function Q* : Q*(s,a) = Q 𝑠
Solve MDP: Find 𝜋 ∗ , 𝑉 ∗ and/or 𝑄∗
Example: Policy Evaluation
Always Go Right Always Go Forward
Example: Policy Evaluation
Always Go Right Always Go Forward
Policy Evaluation
How do we calculate the V’s for a fixed policy ?
s
Idea 1: Turn recursive Bellman equations into updates (s)
(like value iteration) s, (s)
s, (s),s’
s’
Piazza Poll 3
What is the complexity of each iteration in Policy Evaluation?
S -- set of states; A -- set of actions
Vk+1(s)
I: 𝑂(|𝑆||𝐴|) a
II: 𝑂( 𝑆 2 |𝐴|) s, a
III: 𝑂(|𝑆| 𝐴 2 ) s,a,s’
IV: 𝑂( 𝑆 2 𝐴 2 ) Vk(s’)
V: 𝑂( 𝑆 2 )
Piazza Poll 3
What is the complexity of each iteration in Policy Evaluation?
S -- set of states; A -- set of actions
Vk+1(s)
I: 𝑂(|𝑆||𝐴|) a
II: 𝑂( 𝑆 2 |𝐴|) s, a
III: 𝑂(|𝑆| 𝐴 2 ) s,a,s’
IV: 𝑂( 𝑆 2 𝐴 2 ) Vk(s’)
V: 𝑂( 𝑆 2 )
Policy Evaluation
Idea 2: Bellman Equation w.r.t. a given policy 𝜋 defines a linear system
Solve with your favorite linear system solver
Treat 𝑉 𝜋 𝑠 as variables
How many variables?
How many constraints?
Policy Iteration
Problems with Value Iteration
Value iteration repeats the Bellman updates:
s
a
s, a
Problem 1: It’s slow – O( 𝑆 2 |𝐴|) per iteration
s,a,s’
s’
Problem 2: The “max” at each state rarely changes
Problem 3: The policy often converges long before the values
[Demo: value iteration (L9D2)]
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
Alternative approach for optimal values:
Step 1: Policy evaluation: calculate utilities for some fixed policy (may
not be optimal!) until convergence
Step 2: Policy improvement: update policy using one-step look-ahead
with resulting converged (but not optimal!) utilities as future values
Repeat steps until policy converges
This is policy iteration
It’s still optimal!
Can converge (much) faster under some conditions
Policy Iteration
Policy Evaluation: For fixed current policy , find values w.r.t. the policy
Iterate until values converge:
Policy Improvement: For fixed values, get a better policy with one-step look-
ahead:
Similar to how you derive optimal policy 𝜋 ∗ given optimal value 𝑉 ∗
Piazza Poll 4
True/False: 𝑉 𝜋𝑖+1 𝑠 ≥ 𝑉 𝜋𝑖 𝑠 , ∀𝑠
Piazza Poll 4
True/False: 𝑉 𝜋𝑖+1 𝑠 ≥ 𝑉 𝜋𝑖 𝑠 , ∀𝑠
𝑉 𝜋𝑖 𝑠 = 𝑇 𝑠, 𝜋 𝑖 (𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑖 (𝑠), 𝑠 ′ + 𝛾𝑉 𝜋𝑖 𝑠′ ]
𝑠′
If I take first step according to 𝜋𝑖+1 and then follow 𝜋𝑖 , we get
an expected utility of
𝑉1 (𝑠) = max 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑖 𝑠′ ]
𝑎
𝑠′
Which is ≥ 𝑉 𝜋𝑖 𝑠
What if I take two steps according to 𝜋𝑖+1 ?
Comparison
Both value iteration and policy iteration compute the same thing
(all optimal values)
In value iteration:
Every iteration updates both the values and (implicitly) the policy
We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
The new policy will be better (or we’re done)
(Both are dynamic programs for solving MDPs)
Summary: MDP Algorithms
So you want to….
Turn values into a policy: use one-step lookahead
Compute optimal values: use value iteration or policy iteration
Compute values for a particular policy: use policy evaluation
These all look the same!
They basically are – they are all variations of Bellman updates
They all use one-step lookahead expectimax fragments
They differ only in whether we plug in a fixed policy or max over actions
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′
Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′
𝑎
𝑠′
Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠
𝑎
𝑠′
Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max
′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′
Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠
𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′
Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠
𝑎
𝑠′
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′
Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′
𝑎
𝑠′
Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠
𝑎
𝑠′
Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max
′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′
Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠
𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′
Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠
𝑎
𝑠′
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′
Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′
𝑎
𝑠′
Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠
𝑎
𝑠′
Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max
′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′
Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠
𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′
Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠
𝑎
𝑠′
MDP Notation
Standard expectimax: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎)𝑉(𝑠 ′ )
𝑎
𝑠′
Bellman equations: 𝑉 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′
𝑎
𝑠′
Value iteration: 𝑉𝑘+1 𝑠 = max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉𝑘 𝑠 ′ , ∀𝑠
𝑎
𝑠′
Q-iteration: 𝑄𝑘+1 𝑠, 𝑎 = 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾 max
′
𝑄𝑘 (𝑠 ′ , 𝑎 ′ )] , ∀ 𝑠, 𝑎
𝑎
𝑠′
Policy extraction: 𝜋𝑉 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ] , ∀𝑠
𝑎
𝑠′
𝜋
Policy evaluation: 𝑉𝑘+1 𝑠 = 𝑃 𝑠 ′ 𝑠, 𝜋 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉𝑘𝜋 𝑠 ′ ] , ∀𝑠
𝑠′
Policy improvement: 𝜋𝑛𝑒𝑤 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝜋𝑜𝑙𝑑 𝑠 ′ , ∀𝑠
𝑎
𝑠′
Double Bandits
Double-Bandit MDP No discount
100 time steps
Actions: Blue, Red Both states have
States: Win, Lose 0.25 $0 the same value
0.75
W $2 0.25 L
$1 $0 $1
0.75 $2
1.0 1.0
Actually a simple MDP where the current state does not impact transition or reward:
𝑃 𝑠 ′ |𝑠, 𝑎 = 𝑃 𝑠 ′ |𝑎 and 𝑅 𝑠, 𝑎, 𝑠 ′ = 𝑅(𝑎, 𝑠 ′ )
Offline Planning
No discount
Solving MDPs is offline planning 100 time steps
You determine all quantities through computation Both states have
You need to know the details of the MDP the same value
You do not actually play the game!
0.25 $0
Value
0.75
0.25
Play Red 150 W $2 L
$1 $0 $1
0.75 $2
Play Blue 100 1.0 1.0
Let’s Play!
$2 $2 $0 $2 $2
$2 $2 $0 $0 $0
Online Planning
Rules changed! Red’s win chance is different.
?? $0
??
W $2 ?? L
$1 $0 $1
?? $2
1.0 1.0
Let’s Play!
$0 $0 $0 $2 $0
$2 $0 $0 $0 $0
What Just Happened?
That wasn’t planning, it was learning!
Specifically, reinforcement learning
There was an MDP, but you couldn’t solve it with just computation
You needed to actually act to figure it out
Important ideas in reinforcement learning that came up
Exploration: you have to try unknown actions to get information
Exploitation: eventually, you have to use what you know
Regret: even if you learn intelligently, you make mistakes
Sampling: because of chance, you have to try things repeatedly
Difficulty: learning can be much harder than solving a known MDP
Next Time: Reinforcement Learning!