Sp14 Cs188 Lecture 9 - Mdps II
Sp14 Cs188 Lecture 9 - Mdps II
Sp14 Cs188 Lecture 9 - Mdps II
Instructors: Dan Klein and Pieter Abbeel --- University of California, Berkeley
[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at
Recap: MDPs
Markov decision processes:
States S
Actions A
Transitions P(s|s,a) (or T(s,a,s))
Rewards R(s,a,s) (and discount )
Start state s0
s
a
s, a
s,a,s
s
Quantities:
Optimal Quantities
The value (utility) of a state s:
V*(s) = expected utility starting in
s and acting optimally
s is a
state
(s, a) is a
q-state
s, a
s,a,s
s
(s,a,s) is a
transition
Gridworld Values V*
Gridworld: Q*
a
s, a
s,a,s
s
Value Iteration
Bellman equations characterize the optimal values:
V(s)
a
s, a
s,a,s
V(s)
Convergence*
How do we know the Vk vectors are going to
converge?
Case 1: If the tree has maximum depth M,
then VM holds the actual untruncated values
Case 2: If the discount is less than 1
Sketch: For any state Vk and Vk+1 can be
viewed as depth k+1 expectimax results in
nearly identical search trees
The difference is that on the bottom layer, Vk+1
has actual rewards while Vk has zeros
That last layer is at best all RMAX
It is at worst RMIN
But everything is discounted by k that far out
So Vk and Vk+1 are at most k max|R| different
Policy Methods
Policy Evaluation
Fixed Policies
Do the optimal action
Do what says to do
s
(s)
s, (s)
s, a
s,a,s
s
s,
(s),s
Expectimax trees max over all actions to compute the optimal values
If we fixed some policy (s), then the tree would be simpler only one
action per state
though the trees value would depend on which policy we fixed
s
(s)
s, (s)
s,
(s),s
Always Go Forward
Always Go Forward
Policy Evaluation
s
(s)
s, (s)
s,
(s),s
Policy Extraction
Policy Iteration
s
a
s, a
s,a,s
s
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
Alternative approach for optimal values:
Step 1: Policy evaluation: calculate utilities for some fixed policy
(not optimal utilities!) until convergence
Step 2: Policy improvement: update policy using one-step lookahead with resulting converged (but not optimal!) utilities as future
values
Repeat steps until policy converges
Policy Iteration
Evaluation: For fixed current policy , find values with policy
evaluation:
Iterate until values converge:
Comparison
Both value iteration and policy iteration compute the same thing (all
optimal values)
In value iteration:
Every iteration updates both the values and (implicitly) the policy
We dont track the policy, but taking the max over actions implicitly
recomputes it
In policy iteration:
We do several passes that update utilities with fixed policy (each pass is
fast because we consider only one action, not all of them)
After the policy is evaluated, a new policy is chosen (slow like a value
iteration pass)
The new policy will be better (or were done)
Double Bandits
Double-Bandit MDP
Actions: Blue, Red
States: Win, Lose
0.25 $0
0.75
$2
0.25
$0
$1
1.
0
No discount
100 time
steps
Both states
have the
same value
0.75 $2
L
$1
1.
0
Offline Planning
No discount
100 time
steps
Both states
have the
same value
Value
Play Red
Play Blue
150
100
$1
1.0
0.75
$2
0.25
$0
0.75 $2
$1
1.0
Lets Play!
$
2
$
2
$
2
$
2
$
0
$
0
$
2
$
0
$
2
$
0
Online Planning
Rules changed! Reds win chance is different.
?? $0
??
$2
??
$0
$1
1.
0
?? $2
L
$1
1.
0
Lets Play!
$
0
$
2
$
0
$
0
$
0
$
0
$
2
$
0
$
0
$
0