06 MDP
06 MDP
2015 AlphaGo
[Deepmind]
2015 AlphaGo
[Deepmind]
2015 AlphaGo
[Deepmind]
2015 AlphaGo
[Deepmind]
Andrey Markov
(1856-1922)
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
o Each MDP state projects an expectimax-like search tree
s s is a state
(s, a) is a q-
s, a
state
(s,a,s’) called a transition
o Why discount?
o Reward now is better than later
o Can also think of it as a 1-gamma
chance of ending the process at
every step
o Also helps our algorithms
converge
22
23
24
25
26
27
28
29
30
31
Rewards and Utility
32
Rewards and Utility
33
Discounted Rewards and Utility
34
Stochastic World
35
Stochastic World
o the utility in stochastic environment as
36
Stochastic World
o With stochastic, the utility of a state under a certain policy can
also be represented as the sum of its immediate reward and the
utility of its successor state, of course, following a probability
distribution.
37
38
Example: Racing
Example: Racing
o A robot car wants to travel far, quickly
o Three states: Cool, Warm, Overheated
o Two actions: Slow, Fast
0.5 +1
o Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
o Each MDP state projects an expectimax-like search tree
s s is a state
(s, a) is a q-
s, a
state
(s,a,s’) called a transition
o Why discount?
o Reward now is better than later
o Can also think of it as a 1-gamma
chance of ending the process at
every step
o Also helps our algorithms
converge
o Quiz 2: For = 0.1, what is the optimal policy? <- <- ->
o Quiz 3: For which are West and East equally good when in state d?
10 3
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite
rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies ( depends on time left)
▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “overheated” for racing)
Recap: Defining MDPs
o Markov decision processes: s
o Set of states S
o Start state s0 a
o Set of actions A s, a
o Transitions P(s’|s,a) (or T(s,a,s’))
o Rewards R(s,a,s’) (and discount ) s,a,s’
s’
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
Racing Search Tree
o We’re doing way too much
work with expectimax!
Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Values of States (Bellman Eqns)
o Recursive definition of value:
s
a
s, a
s,a,s’
s’
Policy Evaluation
60
Policy Improvement
61
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
o In policy iteration, we iteratively alternate policy evaluation and
policy improvement. In policy evaluation, we keep policy
constant and update utility based on that policy.
76
Policy Iteration
o The whole process of policy iteration is then
o We start with any arbitrary policy π₁, gets its utility v₁ by policy
evaluation, gets a new policy π₂ from v₁ by policy improvement,
gets utility v₂ of π₂ by policy evaluation, … until we converge on
our optimal policy π*.
77
Time-Limited Values
o Key idea: time-limited values
o Proof Sketch:
o For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
o The difference is that on the bottom layer, Vk+1 has
actual rewards while Vk has zeros
o That last layer is at best all RMAX
o It is at worst RMIN
o But everything is discounted by γk that far out
o So Vk and Vk+1 are at most γk max|R| different
o So as k increases, the values converge