0% found this document useful (0 votes)
24 views88 pages

Markov Decision Process II

Uploaded by

Navid Alvee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views88 pages

Markov Decision Process II

Uploaded by

Navid Alvee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Markov Decision Process II

CSE 4711: Artificial Intelligence

Md. Bakhtiar Hasan


Assistant Professor
Department of Computer Science and Engineering
Islamic University of Technology
Problems with Value Iteration

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=1

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=2

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=3

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=4

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=5

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=6

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=7

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=8

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=9

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k = 10

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k = 11

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k = 12

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k = 100

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration
Value iteration repeats the Bellman updates:

Vk+1 (s) ← max T (s, a, s′ ) [R(s, a, s′ ) + γVk (s′ )]
a s′
Problem 1: It’s slow - O(S 2 A) per iteration

2/
20
2/20
Problems with Value Iteration
Value iteration repeats the Bellman updates:

Vk+1 (s) ← max T (s, a, s′ ) [R(s, a, s′ ) + γVk (s′ )]
a s′
Problem 1: It’s slow - O(S 2 A) per iteration
Problem 2: The “max” at each state rarely changes

2/
20
2/20
Problems with Value Iteration
Value iteration repeats the Bellman updates:

Vk+1 (s) ← max T (s, a, s′ ) [R(s, a, s′ ) + γVk (s′ )]
a s′
Problem 1: It’s slow - O(S 2 A) per iteration
Problem 2: The “max” at each state rarely changes
Problem 3: The policy often converges long before the
values

2/
20
2/20
Policy Methods

3/
20
3/20
Policy Evaluation

4/
20
4/20
Fixed Policies

Expectimax trees max over all actions to compute the optimal values

5/
20
5/20
Fixed Policies

Expectimax trees max over all actions to compute the optimal values
If we fixed some policy π(s), then the tree would be simpler – only one action
per state
• . . . though the tree’s value would depend on which policy we fixed
5/
20
5/20
Utilities for a Fixed Policy
Compute the utility of a state s under a fixed (generally
non-optimal) policy
Define the utility of a state s, under a fixed policy π:
V π (s) =expected total discounted rewards starting in s and
following π

6/
20
6/20
Utilities for a Fixed Policy
Compute the utility of a state s under a fixed (generally
non-optimal) policy
Define the utility of a state s, under a fixed policy π:
V π (s) =expected total discounted rewards starting in s and
following π
Recursive relation (one-step look-ahead/Bellman equation):

V π (s) = T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
s′

6/
20
6/20
Example: Policy Evaluation

7/
20
7/20
Example: Policy Evaluation

7/
20
7/20
Example: Policy Evaluation

7/
20
7/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?


Idea 1: Turn recursive Bellman equations into updates
(like value iteration)

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?


Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
V0π (s) = 0

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?


Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?


Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′
• Efficiency: O(S 2 ) per iteration

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?


Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′
• Efficiency: O(S 2 ) per iteration

Idea 2: Without the maxes, the Bellman equations are just a linear system
• Solve with Matlab (or your favorite linear system solver)

8/
20
8/20
Policy Extraction

9/
20
9/20
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?
• It’s not obvious!

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?
• It’s not obvious!
We need to do a mini-expectimax (one step)

π ∗ (s) = argmax T (s, a, s′ ) [R(s, a, s′ ) + γV ∗ (s′ )]
a s′

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?
• It’s not obvious!
We need to do a mini-expectimax (one step)

π ∗ (s) = argmax T (s, a, s′ ) [R(s, a, s′ ) + γV ∗ (s′ )]
a s′
This is called policy extraction, since it gets the
policy implied by the values

10 /
10/2020
Computing Actions from Q-Values

Let’s imagine we have the optimal q-values:

11 /
11/2020
Computing Actions from Q-Values

Let’s imagine we have the optimal q-values:


How should we act?

11 /
11/2020
Computing Actions from Q-Values

Let’s imagine we have the optimal q-values:


How should we act?
• Completely trivial to decide!
π ∗ (s) = argmax Q∗ (s, a)
a

11 /
11/2020
Computing Actions from Q-Values

Let’s imagine we have the optimal q-values:


How should we act?
• Completely trivial to decide!
π ∗ (s) = argmax Q∗ (s, a)
a
Important: actions are easier to select from
q-values than values!

11 /
11/2020
Policy Iteration

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:
• Step 1: Policy Evaluation → calculate utilities for some fixed policy (not optimal
utilities!) until convergence

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:
• Step 1: Policy Evaluation → calculate utilities for some fixed policy (not optimal
utilities!) until convergence
• Step 2: Policy Improvement → update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:
• Step 1: Policy Evaluation → calculate utilities for some fixed policy (not optimal
utilities!) until convergence
• Step 2: Policy Improvement → update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
• Repeat steps until policy converges

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:
• Step 1: Policy Evaluation → calculate utilities for some fixed policy (not optimal
utilities!) until convergence
• Step 2: Policy Improvement → update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
• Repeat steps until policy converges
This is policy iteration
• Optimal
• Can converge (much) faster under some conditions

12 /
12/2020
Policy Iteration
Evaluation: For fixed current policy π, find values with policy evaluation:

12 /
12/2020
Policy Iteration
Evaluation: For fixed current policy π, find values with policy evaluation:
• Iterate until values converge:
∑ [ ]
πi
Vk+1 (s) ← T (s, πi (s), s′ ) R(s, πi (s), s′ ) + γVkπi (s′ )
s′

12 /
12/2020
Policy Iteration
Evaluation: For fixed current policy π, find values with policy evaluation:
• Iterate until values converge:
∑ [ ]
πi
Vk+1 (s) ← T (s, πi (s), s′ ) R(s, πi (s), s′ ) + γVkπi (s′ )
s′
Improvement: For fixed values, get a better policy using policy extraction

12 /
12/2020
Policy Iteration
Evaluation: For fixed current policy π, find values with policy evaluation:
• Iterate until values converge:
∑ [ ]
πi
Vk+1 (s) ← T (s, πi (s), s′ ) R(s, πi (s), s′ ) + γVkπi (s′ )
s′
Improvement: For fixed values, get a better policy using policy extraction
• One-step look-ahead:

πi+1 (s) = argmax T (s, a, s′ ) [R(s, a, s′ ) + γV πi (s′ )]
a s′

12 /
12/2020
Comparison
Both value iteration and policy iteration compute the same thing (all optimal
values)

13 /
13/2020
Comparison
Both value iteration and policy iteration compute the same thing (all optimal
values)
In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it

13 /
13/2020
Comparison
Both value iteration and policy iteration compute the same thing (all optimal
values)
In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
• We do several passes that update utilities with fixed policy (each pass is fast because
we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)

13 /
13/2020
Comparison
Both value iteration and policy iteration compute the same thing (all optimal
values)
In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
• We do several passes that update utilities with fixed policy (each pass is fast because
we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)
Both are dynamic programming approaches for solving MDPs

13 /
13/2020
Double Bandits

14 /
14/2020
Double Bandits

Actions: Blue, Red


States: Win, Lose

14 /
14/2020
Offline Planning
Solving MDPs is offline planning
• You determine all quantities through computation
• You need to know the details of the MDP
• You do not actually play the game!

15 /
15/2020
Let’s Play!

16 /
16/2020
Let’s Play!

$2

16 /
16/2020
Let’s Play!

$2 $2

16 /
16/2020
Let’s Play!

$2 $2 $0

16 /
16/2020
Let’s Play!

$2 $2 $0 $2

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2 $2

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2 $2 $0

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2 $2 $0 $0

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2 $2 $0 $0 $0

16 /
16/2020
Online Planning
Rules changed! Red’s win chance is different

17 /
17/2020
Let’s Play Again!

18 /
18/2020
Let’s Play Again!

$0

18 /
18/2020
Let’s Play Again!

$0 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2 $0 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2 $0 $0 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2 $0 $0 $0 $0

18 /
18/2020
What Just Happened?

19 /
19/2020
What Just Happened?
That wasn’t planning, it was learning!
• Specifically, reinforcement learning
• There was an MDP, but you couldn’t solve it with just
computation
• You needed to actually act to figure it out

19 /
19/2020
What Just Happened?
That wasn’t planning, it was learning!
• Specifically, reinforcement learning
• There was an MDP, but you couldn’t solve it with just
computation
• You needed to actually act to figure it out
Important ideas in reinforcement learning that came up
• Exploration: you have to try unknown actions to get information
• Exploitation: eventually, you have to use what you know
• Regret: even if you learn intelligently, you make mistakes
• Sampling: because of chance, you have to try things repeatedly
• Difficulty: learning can be much harder than solving a known MDP

19 /
19/2020
Suggested Reading
Russell & Norvig: Chapter 17.1-17.3
Sutton and Barto: 3-4

20 /
20/2020

You might also like