Open navigation menu

Scribd

0% found this document useful (0 votes)

24 views88 pages

Markov Decision Process II

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views88 pages

Markov Decision Process II

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

Markov Decision Process II

CSE 4711: Artificial Intelligence

Md. Bakhtiar Hasan

Assistant Professor
Department of Computer Science and Engineering
Islamic University of Technology
Problems with Value Iteration

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=1

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=2

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=3

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=4

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=5

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=6

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=7

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=8

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k=9

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k = 10

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k = 11

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k = 12

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration

k = 100

Noise = 0.2
Discount = 0.9
Living reward = 0

2/
20
2/20
Problems with Value Iteration
Value iteration repeats the Bellman updates:
∑
Vk+1 (s) ← max T (s, a, s′ ) [R(s, a, s′ ) + γVk (s′ )]
a s′
Problem 1: It’s slow - O(S 2 A) per iteration

2/
20
2/20
Problems with Value Iteration
Value iteration repeats the Bellman updates:
∑
Vk+1 (s) ← max T (s, a, s′ ) [R(s, a, s′ ) + γVk (s′ )]
a s′
Problem 1: It’s slow - O(S 2 A) per iteration
Problem 2: The “max” at each state rarely changes

2/
20
2/20
Problems with Value Iteration
Value iteration repeats the Bellman updates:
∑
Vk+1 (s) ← max T (s, a, s′ ) [R(s, a, s′ ) + γVk (s′ )]
a s′
Problem 1: It’s slow - O(S 2 A) per iteration
Problem 2: The “max” at each state rarely changes
Problem 3: The policy often converges long before the
values

2/
20
2/20
Policy Methods

3/
20
3/20
Policy Evaluation

4/
20
4/20
Fixed Policies

Expectimax trees max over all actions to compute the optimal values

5/
20
5/20
Fixed Policies

Expectimax trees max over all actions to compute the optimal values
If we fixed some policy π(s), then the tree would be simpler – only one action
per state
• . . . though the tree’s value would depend on which policy we fixed
5/
20
5/20
Utilities for a Fixed Policy
Compute the utility of a state s under a fixed (generally
non-optimal) policy
Define the utility of a state s, under a fixed policy π:
V π (s) =expected total discounted rewards starting in s and
following π

6/
20
6/20
Utilities for a Fixed Policy
Compute the utility of a state s under a fixed (generally
non-optimal) policy
Define the utility of a state s, under a fixed policy π:
V π (s) =expected total discounted rewards starting in s and
following π
Recursive relation (one-step look-ahead/Bellman equation):
∑
V π (s) = T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
s′

6/
20
6/20
Example: Policy Evaluation

7/
20
7/20
Example: Policy Evaluation

7/
20
7/20
Example: Policy Evaluation

7/
20
7/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?

Idea 1: Turn recursive Bellman equations into updates
(like value iteration)

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?

Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
V0π (s) = 0

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?

Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?

Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′
• Efficiency: O(S 2 ) per iteration

8/
20
8/20
Policy Evaluation

How do we calculate the V ’s for a fixed policy π?

Idea 1: Turn recursive Bellman equations into updates
(like value iteration)
V0π (s) = 0
π (s) ← ∑ T (s, π(s), s′ ) [R(s, π(s), s′ ) + γV π (s′ )]
Vk+1 k
s′
• Efficiency: O(S 2 ) per iteration

Idea 2: Without the maxes, the Bellman equations are just a linear system
• Solve with Matlab (or your favorite linear system solver)

8/
20
8/20
Policy Extraction

9/
20
9/20
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?
• It’s not obvious!

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?
• It’s not obvious!
We need to do a mini-expectimax (one step)
∑
π ∗ (s) = argmax T (s, a, s′ ) [R(s, a, s′ ) + γV ∗ (s′ )]
a s′

10 /
10/2020
Computing Actions from Values
Let’s imagine we have the optimal values V ∗ (s)
How should we act?
• It’s not obvious!
We need to do a mini-expectimax (one step)
∑
π ∗ (s) = argmax T (s, a, s′ ) [R(s, a, s′ ) + γV ∗ (s′ )]
a s′
This is called policy extraction, since it gets the
policy implied by the values

10 /
10/2020
Computing Actions from Q-Values

Let’s imagine we have the optimal q-values:

11 /
11/2020
Computing Actions from Q-Values

Let’s imagine we have the optimal q-values:

How should we act?

11 /
11/2020
Computing Actions from Q-Values

Let’s imagine we have the optimal q-values:

How should we act?
• Completely trivial to decide!
π ∗ (s) = argmax Q∗ (s, a)
a

11 /
11/2020
Computing Actions from Q-Values

Let’s imagine we have the optimal q-values:

How should we act?
• Completely trivial to decide!
π ∗ (s) = argmax Q∗ (s, a)
a
Important: actions are easier to select from
q-values than values!

11 /
11/2020
Policy Iteration

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:
• Step 1: Policy Evaluation → calculate utilities for some fixed policy (not optimal
utilities!) until convergence

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:
• Step 1: Policy Evaluation → calculate utilities for some fixed policy (not optimal
utilities!) until convergence
• Step 2: Policy Improvement → update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:
• Step 1: Policy Evaluation → calculate utilities for some fixed policy (not optimal
utilities!) until convergence
• Step 2: Policy Improvement → update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
• Repeat steps until policy converges

12 /
12/2020
Policy Iteration
Alternative approach for optimal values:
• Step 1: Policy Evaluation → calculate utilities for some fixed policy (not optimal
utilities!) until convergence
• Step 2: Policy Improvement → update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities as future values
• Repeat steps until policy converges
This is policy iteration
• Optimal
• Can converge (much) faster under some conditions

12 /
12/2020
Policy Iteration
Evaluation: For fixed current policy π, find values with policy evaluation:

12 /
12/2020
Policy Iteration
Evaluation: For fixed current policy π, find values with policy evaluation:
• Iterate until values converge:
∑ [ ]
πi
Vk+1 (s) ← T (s, πi (s), s′ ) R(s, πi (s), s′ ) + γVkπi (s′ )
s′

12 /
12/2020
Policy Iteration
Evaluation: For fixed current policy π, find values with policy evaluation:
• Iterate until values converge:
∑ [ ]
πi
Vk+1 (s) ← T (s, πi (s), s′ ) R(s, πi (s), s′ ) + γVkπi (s′ )
s′
Improvement: For fixed values, get a better policy using policy extraction

12 /
12/2020
Policy Iteration
Evaluation: For fixed current policy π, find values with policy evaluation:
• Iterate until values converge:
∑ [ ]
πi
Vk+1 (s) ← T (s, πi (s), s′ ) R(s, πi (s), s′ ) + γVkπi (s′ )
s′
Improvement: For fixed values, get a better policy using policy extraction
• One-step look-ahead:
∑
πi+1 (s) = argmax T (s, a, s′ ) [R(s, a, s′ ) + γV πi (s′ )]
a s′

12 /
12/2020
Comparison
Both value iteration and policy iteration compute the same thing (all optimal
values)

13 /
13/2020
Comparison
Both value iteration and policy iteration compute the same thing (all optimal
values)
In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it

13 /
13/2020
Comparison
Both value iteration and policy iteration compute the same thing (all optimal
values)
In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
• We do several passes that update utilities with fixed policy (each pass is fast because
we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)

13 /
13/2020
Comparison
Both value iteration and policy iteration compute the same thing (all optimal
values)
In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
• We do several passes that update utilities with fixed policy (each pass is fast because
we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)
Both are dynamic programming approaches for solving MDPs

13 /
13/2020
Double Bandits

14 /
14/2020
Double Bandits

Actions: Blue, Red

States: Win, Lose

14 /
14/2020
Offline Planning
Solving MDPs is offline planning
• You determine all quantities through computation
• You need to know the details of the MDP
• You do not actually play the game!

15 /
15/2020
Let’s Play!

16 /
16/2020
Let’s Play!

$2

16 /
16/2020
Let’s Play!

$2 $2

16 /
16/2020
Let’s Play!

$2 $2 $0

16 /
16/2020
Let’s Play!

$2 $2 $0 $2

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2 $2

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2 $2 $0

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2 $2 $0 $0

16 /
16/2020
Let’s Play!

$2 $2 $0 $2 $2
$2 $2 $0 $0 $0

16 /
16/2020
Online Planning
Rules changed! Red’s win chance is different

17 /
17/2020
Let’s Play Again!

18 /
18/2020
Let’s Play Again!

$0

18 /
18/2020
Let’s Play Again!

$0 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2 $0 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2 $0 $0 $0

18 /
18/2020
Let’s Play Again!

$0 $0 $0 $2 $0
$2 $0 $0 $0 $0

18 /
18/2020
What Just Happened?

19 /
19/2020
What Just Happened?
That wasn’t planning, it was learning!
• Specifically, reinforcement learning
• There was an MDP, but you couldn’t solve it with just
computation
• You needed to actually act to figure it out

19 /
19/2020
What Just Happened?
That wasn’t planning, it was learning!
• Specifically, reinforcement learning
• There was an MDP, but you couldn’t solve it with just
computation
• You needed to actually act to figure it out
Important ideas in reinforcement learning that came up
• Exploration: you have to try unknown actions to get information
• Exploitation: eventually, you have to use what you know
• Regret: even if you learn intelligently, you make mistakes
• Sampling: because of chance, you have to try things repeatedly
• Difficulty: learning can be much harder than solving a known MDP

19 /
19/2020
Suggested Reading
Russell & Norvig: Chapter 17.1-17.3
Sutton and Barto: 3-4

20 /
20/2020

You might also like

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
Lec 09
No ratings yet
Lec 09
51 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
Lec 4
No ratings yet
Lec 4
16 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
Subtitle
No ratings yet
Subtitle
1 page
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
No ratings yet
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
18 pages
RL Unit-4
No ratings yet
RL Unit-4
18 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Macro2 HW2 Solution v1
No ratings yet
Macro2 HW2 Solution v1
15 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Module 04
No ratings yet
Module 04
63 pages
Assignment 2
No ratings yet
Assignment 2
13 pages
M 2
No ratings yet
M 2
12 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
01 Module 1 Early Reinforcement Learning
No ratings yet
01 Module 1 Early Reinforcement Learning
134 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
RL Lecture4
No ratings yet
RL Lecture4
16 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
1.10 Policy Evaluation (Prediction)
No ratings yet
1.10 Policy Evaluation (Prediction)
30 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
15 MDP
No ratings yet
15 MDP
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Pomdps
No ratings yet
Pomdps
76 pages
Lecture 4 - Bellman Equations and DP
No ratings yet
Lecture 4 - Bellman Equations and DP
27 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages