DRL #4-5 - Introducing MDP and Dynamic Programming Solution
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
Session #4:
Markov Decision Processes
1
Agenda for the class
Announcement !!!
We have our Teaching Assistants now !!! You will see their names in the course home
page.
Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 2
Agent-Environment Interface
3
Grid World Example
● A maze-like problem
○ The agent lives in a grid
○ Walls block the agent’s path
● Noisy movement: actions do not always go as planned
○ 80% of the time, the action North takes the agent North
(if there is no wall there)
○ 10% of the time, North takes the agent West; 10% East
○ If there is a wall in the direction the agent would have
been taken, the agent stays put
● The agent receives rewards each time step
○ -0.1 per step (battery loss)
○ +1 if arriving at (4,3) ; -1 for arriving at (4,2) ;1 for arriving
at (2,2)
● Goal: maximize accumulated rewards
4
Markov Decision Processes
● An MDP is defined by
○ A set of states
○ A set of actions
○ State-transition probabilities
■ Probability of arriving to after performing at
■ Also called the model dynamics
○ A reward function
■ The utility gained from arriving to after
performing at
■ Sometimes just or even
○ A start state
○ Maybe a terminal state
5
Markov Decision Processes
Model Dynamics
State-transition probabilities
6
Markov Decision Processes - Discussion
7
Markov Decision Processes - Discussion
8
MDP Formalization : Video Games
● State:
○ raw pixels
● Actions:
○ game controls
● Reward:
○ change in score
● State-transition probabilities:
○ defined by stochasticity in game evolution
Ref: Playing Atari with deep reinforcement learning”, Mnih et al., 2013 9
MDP Formalization : Traffic Signal Control
● State:
○ Current signal assignment (green, yellow,
and red assignment for each phase)
○ For each lane: number of approaching
vehicles, accumulated waiting time,
number of stopped vehicles, and average
speed of approaching vehicles
● Actions:
○ signal assignment
● Reward:
○ Reduction in traffic delay
● State-transition probabilities:
○ defined by stochasticity in approaching
demandTraffic Signal Control Policy”, Ault et al., 2020
Ref: “Learning an Interpretable 10
MDP Formalization : Recycling Robot (Detailed Ex.)
● Robot has
○ sensors for detecting cans
○ arm and gripper that can pick the cans and place in an
onboard bin;
● Runs on a rechargeable battery
● Its control system has components for interpreting sensory
information, for navigating, and for controlling the arm and
gripper
● Task for the RL Agent: Make high-level decisions about how
to search for cans based on the current charge level of the
battery
11
MDP Formalization : Recycling Robot (Detailed Ex.)
● State:
○ Assume that only two charge levels can be distinguished
○ S = {high, low}
● Actions:
○ A(high) = {search, wait}
○ A(low) = {search, wait, recharge}
● Reward:
○ Zero most of the time, except when securing a can
○ Cans are secured by searching and waiting, but rsearch > rwait
● State-transition probabilities:
○ [Next Slide]
12
MDP Formalization : Recycling Robot (Detailed Ex.)
13
MDP Formalization : Recycling Robot (Detailed Ex.)
14
Note on Goals & Rewards
● Reward Hypothesis:
All of what we mean by goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a
received scalar signal (called reward).
● The rewards we set up truly indicate what we want accomplished,
○ not the place to impart prior knowledge on how we want it to do
● Ex: Chess Playing Agent
○ If the agent is rewarded for taking opponents pieces, the agent might fall
for the opponent's trap.
● Ex: Vacuum Cleaner Agent
○ If the agent is rewarded for each unit of dirt it sucks, it can repeatedly
deposit and suck the dirt for larger reward
15
Returns & Episodes
● Goal is to maximize the expected return
● Return (Gt) is defined as some specific function of the reward
sequence
● Episodic tasks vs. Continuing tasks
● When there is a notion of final time step, say T, return can be
16
Returns & Episodes
● Generally T = ∞
○ What if the agent receive a reward
of +1 for each timestep?
○ Discounted Return:
Note:
17
Returns & Episodes
● What if 𝛾 is 0?
● What if 𝛾 is 1?
● Computing discounted rewards incrementally
18
Returns & Episodes
➔ Objective: To apply forces to a cart
moving along a track so as to keep a
pole hinged to the cart from falling over
➔ Discuss:
➔ Consider the task as episodic, that is
try/maintain balance until failure.
What could be the reward function?
➔ Repeat prev. assuming task is
continuous.
19
Policy
20
Defining Value Functions
21
Defining Value Functions
22
May skip to the next slide !
Bellman Equation for V𝝅
● Dynamic programming equation associated with discrete-time optimization
problems
○ Expressing Vℼ recursively i.e. relating V𝝅(s) to V𝝅(s’) for all s’ ∈ succ(s)
23
Bellman Equation for V𝝅
Backup Diagram
24
Understanding V𝝅(s) with Gridworld
Reward:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else
Verify V𝝅(s) using Bellman equation for this state with 𝛾 = 0.9,
and equiprobable random policy
26
Understanding V𝝅(s) with Gridworld
27
Ex-1
Recollect the reward function used for Gridworld as below:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else
Let us add a constant c ( say 10) to the rewards of all the actions. Will it change
anything?
28
Optimal Policies and Optimal Value Functions
29
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state
Bellman optimality equation for V*
30
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state
31
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state
34
Notation
•
35
Race car example
•
36
Race car example
•
42
Value iteration
43
Value Iteration
0 0 0
2 1 0
3.35 2.35 0
45
k=0
Noise = 0.2
Discount = 0.9
Living reward = 0
46
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
47
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
48
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
49
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
50
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
51
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
52
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
53
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
54
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
55
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
56
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
57
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
58
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
59
Problems with Value Iteration
•
60
Solutions (briefly, more later…)
•
61
• Asynchronous value iteration
• In value iteration, we update every state in each iteration
• Actually, any sequences of Bellman updates will converge if every state is visited
infinitely often regardless of the visitation order
• Idea: prioritize states whose value we expect to change significantly
62
Asynchronous Value Iteration
• Which states should be prioritized for an update?
A single
update per
iteration
63
Double the work?
•
64
Issue 2: A policy cannot be easily extracted
•
65
Q-learning
•
66
Q-learning as value iteration
•
67
Issue 3: The policy often converges long
before the values
•
68
Policy Iteration
•
69
Policy Evaluation
•
70
Policy value as a Linear program
71
Policy iteration
0.7
72
Comparison
• Both value iteration and policy iteration compute the same thing (optimal
state values)
• In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly define it
• In policy iteration:
• We do several passes that update utilities with fixed policies (each pass is fast
because we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)
73
Issue 4: requires knowing the model and the
reward function
•
75
Issue 6: infeasible in large (or continues)
state spaces
• Most real-life problems contain very large state spaces (practically
infinite)
• It is infeasible to learn and store a value for every state
• Moreover, doing so is not useful as the chance of encountering a state
more than once is very small
• We will learn to generalize our learning to apply to unseen states
• We will use value function approximators that can generalize the
acquired knowledge and provide a value to any state (even if it was
not previously seen)
76
Notation
•
77
Required Readings
78
Thank you
79