0% found this document useful (0 votes)
10 views

DRL #4-5 - Introducing MDP and Dynamic Programming Solution

Uploaded by

pivam12168
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

DRL #4-5 - Introducing MDP and Dynamic Programming Solution

Uploaded by

pivam12168
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Deep Reinforcement Learning

2023-24 Second Semester, M.Tech (AIML)

Session #4:
Markov Decision Processes

DRL Course Instructors

1
Agenda for the class

• Agent-Environment Interface (Sequential Decision Problem)


• MDP
Defining MDP,
Rewards,
Returns, Policy & Value Function,
Optimal Policy and Value Functions
• Approaches to solve MDP

Announcement !!!
We have our Teaching Assistants now !!! You will see their names in the course home
page.

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 2
Agent-Environment Interface

● Agent - Learner & the decision


maker
● Environment - Everything
outside the agent
● Interaction:
○ Agent performs an action
○ Environment responds by
■ presenting a new situation Note:
(change in state) ● Interaction occurs in discrete time steps
■ presents numerical reward
● Objective (of the interaction):
○ Maximize the return
(cumulative rewards) over time

3
Grid World Example

● A maze-like problem
○ The agent lives in a grid
○ Walls block the agent’s path
● Noisy movement: actions do not always go as planned
○ 80% of the time, the action North takes the agent North
(if there is no wall there)
○ 10% of the time, North takes the agent West; 10% East
○ If there is a wall in the direction the agent would have
been taken, the agent stays put
● The agent receives rewards each time step
○ -0.1 per step (battery loss)
○ +1 if arriving at (4,3) ; -1 for arriving at (4,2) ;1 for arriving
at (2,2)
● Goal: maximize accumulated rewards
4
Markov Decision Processes

● An MDP is defined by
○ A set of states
○ A set of actions
○ State-transition probabilities
■ Probability of arriving to after performing at
■ Also called the model dynamics
○ A reward function
■ The utility gained from arriving to after
performing at
■ Sometimes just or even
○ A start state
○ Maybe a terminal state

5
Markov Decision Processes
Model Dynamics

State-transition probabilities

Expected rewards for state–action–next-state triples

6
Markov Decision Processes - Discussion

● MDP framework is abstract and flexible


○ Time steps need not refer to fixed intervals of real time
○ The actions can be
■ at low-level controls or high-level decisions
■ totally mental or computational
○ States can take a wide variety of forms
■ Determined by low-level sensations or high-level and abstract (ex.
symbolic descriptions of objects in a room)
● The agent–environment boundary represents the limit of the agent’s absolute
control, not of its knowledge.
○ The boundary can be located at different places for different purposes

7
Markov Decision Processes - Discussion

● MDP framework is a considerable abstraction of the problem of goal-directed


learning from interaction.
● It proposes that whatever the details of the sensory, memory, and control
apparatus, and whatever objective one is trying to achieve, any problem of
learning goal-directed behavior can be reduced to three signals passing back
and forth between an agent and its environment:
○ one signal to represent the choices made by the agent (the actions)
○ one signal to represent the basis on which the choices are made (the
states),
○ and one signal to define the agent’s goal (the rewards).

8
MDP Formalization : Video Games

● State:
○ raw pixels
● Actions:
○ game controls
● Reward:
○ change in score
● State-transition probabilities:
○ defined by stochasticity in game evolution

Ref: Playing Atari with deep reinforcement learning”, Mnih et al., 2013 9
MDP Formalization : Traffic Signal Control

● State:
○ Current signal assignment (green, yellow,
and red assignment for each phase)
○ For each lane: number of approaching
vehicles, accumulated waiting time,
number of stopped vehicles, and average
speed of approaching vehicles
● Actions:
○ signal assignment
● Reward:
○ Reduction in traffic delay
● State-transition probabilities:
○ defined by stochasticity in approaching
demandTraffic Signal Control Policy”, Ault et al., 2020
Ref: “Learning an Interpretable 10
MDP Formalization : Recycling Robot (Detailed Ex.)
● Robot has
○ sensors for detecting cans
○ arm and gripper that can pick the cans and place in an
onboard bin;
● Runs on a rechargeable battery
● Its control system has components for interpreting sensory
information, for navigating, and for controlling the arm and
gripper
● Task for the RL Agent: Make high-level decisions about how
to search for cans based on the current charge level of the
battery

11
MDP Formalization : Recycling Robot (Detailed Ex.)

● State:
○ Assume that only two charge levels can be distinguished
○ S = {high, low}
● Actions:
○ A(high) = {search, wait}
○ A(low) = {search, wait, recharge}
● Reward:
○ Zero most of the time, except when securing a can
○ Cans are secured by searching and waiting, but rsearch > rwait
● State-transition probabilities:
○ [Next Slide]

12
MDP Formalization : Recycling Robot (Detailed Ex.)

● State-transition probabilities (contd…):

13
MDP Formalization : Recycling Robot (Detailed Ex.)

● State-transition probabilities (contd…):

14
Note on Goals & Rewards
● Reward Hypothesis:
All of what we mean by goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a
received scalar signal (called reward).
● The rewards we set up truly indicate what we want accomplished,
○ not the place to impart prior knowledge on how we want it to do
● Ex: Chess Playing Agent
○ If the agent is rewarded for taking opponents pieces, the agent might fall
for the opponent's trap.
● Ex: Vacuum Cleaner Agent
○ If the agent is rewarded for each unit of dirt it sucks, it can repeatedly
deposit and suck the dirt for larger reward
15
Returns & Episodes
● Goal is to maximize the expected return
● Return (Gt) is defined as some specific function of the reward
sequence
● Episodic tasks vs. Continuing tasks
● When there is a notion of final time step, say T, return can be

○ Applicable when agent-environment interaction breaks into


episodes
○ Ex: Playing Game, Trips through maze etc. [ called episodic tasks]

16
Returns & Episodes
● Generally T = ∞
○ What if the agent receive a reward
of +1 for each timestep?
○ Discounted Return:

Note:

○ Discount rate determines the present


value of future rewards

17
Returns & Episodes
● What if 𝛾 is 0?
● What if 𝛾 is 1?
● Computing discounted rewards incrementally

• Sum of an infinite number of terms, it is still finite if the reward is nonzero


and constant and if 𝛾 < 1.
• Ex: reward is +1 constant

18
Returns & Episodes
➔ Objective: To apply forces to a cart
moving along a track so as to keep a
pole hinged to the cart from falling over
➔ Discuss:
➔ Consider the task as episodic, that is
try/maintain balance until failure.
What could be the reward function?
➔ Repeat prev. assuming task is
continuous.

19
Policy

● A mapping from states to


probabilities of selecting each
possible action.
○ 𝛑 (a|s) is the probability that At
= a if St = s
● The purpose of learning is to
improve the agent's policy with its
experience

20
Defining Value Functions

State-value function for policy 𝝿

Action-value function for policy 𝝿

21
Defining Value Functions

State Value function in terms of Action-value function for policy 𝝿

Action Value function in terms of State value function for policy 𝝿

22
May skip to the next slide !
Bellman Equation for V𝝅
● Dynamic programming equation associated with discrete-time optimization
problems
○ Expressing Vℼ recursively i.e. relating V𝝅(s) to V𝝅(s’) for all s’ ∈ succ(s)

23
Bellman Equation for V𝝅

Value of the start state must equal


(1) the (discounted) value of the expected next state,
plus
(2) the reward expected along the way

Backup Diagram
24
Understanding V𝝅(s) with Gridworld
Reward:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else

Exceptional reward dynamics State-value function for the equiprobable


random policy with 𝛾 = 0.9 25
Understanding V𝝅(s) with Gridworld

Verify V𝝅(s) using Bellman equation for this state with 𝛾 = 0.9,
and equiprobable random policy

26
Understanding V𝝅(s) with Gridworld

27
Ex-1
Recollect the reward function used for Gridworld as below:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else
Let us add a constant c ( say 10) to the rewards of all the actions. Will it change
anything?

28
Optimal Policies and Optimal Value Functions

● 𝝿 ≥ 𝝿’ if and only if v𝝿(s) ≥ v𝝿(s) for all s ∊ S


● There is always at least one policy that is better than or
equal to all other policies → optimal policy (denoted as 𝝿*)
○ There could be more than one optimal policy !!!
Optimal state-value function

Optimal action-value function

29
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state
Bellman optimality equation for V*

30
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state

Bellman optimality equation for q*

31
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state

Backup diagrams for v* and q* 32


Optimal solutions to the gridworld example

Backup diagrams for v* and q* 33


MDP - Objective

34
Notation

35
Race car example

36
Race car example

42
Value iteration

43
Value Iteration

0 0 0

2 1 0

3.35 2.35 0

Check this computation on paper.


44
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path

▪ Noisy movement: actions do not always go as planned


▪ 80% of the time, the action North takes the agent North
▪ 10% of the time, North takes the agent West; 10% East
▪ If there is a wall in the direction the agent would have
been taken, the agent stays put

▪ The agent receives rewards each time step


▪ Small negative reward each step (battery drain)
▪ Big rewards come at the end (good or bad)

▪ Goal: maximize sum of (discounted) rewards

45
k=0

Noise = 0.2
Discount = 0.9
Living reward = 0

46
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0

47
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0

48
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0

49
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0

50
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0

51
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0

52
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0

53
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0

54
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0

55
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0

56
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0

57
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0

58
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0

59
Problems with Value Iteration

60
Solutions (briefly, more later…)

61
• Asynchronous value iteration
• In value iteration, we update every state in each iteration
• Actually, any sequences of Bellman updates will converge if every state is visited
infinitely often regardless of the visitation order
• Idea: prioritize states whose value we expect to change significantly

62
Asynchronous Value Iteration
• Which states should be prioritized for an update?

A single
update per
iteration

63
Double the work?

64
Issue 2: A policy cannot be easily extracted

65
Q-learning

66
Q-learning as value iteration

67
Issue 3: The policy often converges long
before the values

68
Policy Iteration

69
Policy Evaluation

70
Policy value as a Linear program

71
Policy iteration

0.7 0.8 0.9

0.7

0.6 0.4 0.6 0.5

72
Comparison
• Both value iteration and policy iteration compute the same thing (optimal
state values)
• In value iteration:
• Every iteration updates both the values and (implicitly) the policy
• We don’t track the policy, but taking the max over actions implicitly define it
• In policy iteration:
• We do several passes that update utilities with fixed policies (each pass is fast
because we consider only one action, not all of them)
• After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
• The new policy will be better (or we’re done)

73
Issue 4: requires knowing the model and the
reward function

Offline optimization Online Learning 74


Issue 5: requires discrete (finite) set of
actions
• We will explore policy gradient approaches that are suitable for
continuous actions, e.g., throttle and steering for a vehicle
• Can such approaches be relevant for discrete action spaces?
• Yes! We can always define a continues action space as a distribution over the
discreate actions (e.g., using the softmax function)
• Can we combine value-based approaches and policy gradient
approaches and get the best of both?
• Yes! Actor-critic methods

75
Issue 6: infeasible in large (or continues)
state spaces
• Most real-life problems contain very large state spaces (practically
infinite)
• It is infeasible to learn and store a value for every state
• Moreover, doing so is not useful as the chance of encountering a state
more than once is very small
• We will learn to generalize our learning to apply to unseen states
• We will use value function approximators that can generalize the
acquired knowledge and provide a value to any state (even if it was
not previously seen)

76
Notation

77
Required Readings

1. Chapter-3,4 of Introduction to Reinforcement Learning,2nd Ed., Sutton


& Barto

78
Thank you

79

You might also like