0% found this document useful (0 votes)

10 views74 pages

DRL #4-5 - Introducing MDP and Dynamic Programming Solution

Uploaded by

pivam12168

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views74 pages

DRL #4-5 - Introducing MDP and Dynamic Programming Solution

Uploaded by

pivam12168

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 74

Deep Reinforcement Learning

2023-24 Second Semester, M.Tech (AIML)

Session #4:
Markov Decision Processes

DRL Course Instructors

1
Agenda for the class

• Agent-Environment Interface (Sequential Decision Problem)

• MDP
Defining MDP,
Rewards,
Returns, Policy & Value Function,
Optimal Policy and Value Functions
• Approaches to solve MDP

Announcement !!!
We have our Teaching Assistants now !!! You will see their names in the course home
page.

Acknowledgements: Some of the slides were adopted with permission from the course CSCE-689 (Texas A&M University) by Prof. Guni Sharon 2
Agent-Environment Interface

● Agent - Learner & the decision

maker
● Environment - Everything
outside the agent
● Interaction:
○ Agent performs an action
○ Environment responds by
■ presenting a new situation Note:
(change in state) ● Interaction occurs in discrete time steps
■ presents numerical reward
● Objective (of the interaction):
○ Maximize the return
(cumulative rewards) over time

3
Grid World Example

● A maze-like problem
○ The agent lives in a grid
○ Walls block the agent’s path
● Noisy movement: actions do not always go as planned
○ 80% of the time, the action North takes the agent North
(if there is no wall there)
○ 10% of the time, North takes the agent West; 10% East
○ If there is a wall in the direction the agent would have
been taken, the agent stays put
● The agent receives rewards each time step
○ -0.1 per step (battery loss)
○ +1 if arriving at (4,3) ; -1 for arriving at (4,2) ;1 for arriving
at (2,2)
● Goal: maximize accumulated rewards
4
Markov Decision Processes

● An MDP is defined by
○ A set of states
○ A set of actions
○ State-transition probabilities
■ Probability of arriving to after performing at
■ Also called the model dynamics
○ A reward function
■ The utility gained from arriving to after
performing at
■ Sometimes just or even
○ A start state
○ Maybe a terminal state

5
Markov Decision Processes
Model Dynamics

State-transition probabilities

Expected rewards for state–action–next-state triples

6
Markov Decision Processes - Discussion

● MDP framework is abstract and flexible

○ Time steps need not refer to fixed intervals of real time
○ The actions can be
■ at low-level controls or high-level decisions
■ totally mental or computational
○ States can take a wide variety of forms
■ Determined by low-level sensations or high-level and abstract (ex.
symbolic descriptions of objects in a room)
● The agent–environment boundary represents the limit of the agent’s absolute
control, not of its knowledge.
○ The boundary can be located at different places for different purposes

7
Markov Decision Processes - Discussion

● MDP framework is a considerable abstraction of the problem of goal-directed

learning from interaction.
● It proposes that whatever the details of the sensory, memory, and control
apparatus, and whatever objective one is trying to achieve, any problem of
learning goal-directed behavior can be reduced to three signals passing back
and forth between an agent and its environment:
○ one signal to represent the choices made by the agent (the actions)
○ one signal to represent the basis on which the choices are made (the
states),
○ and one signal to define the agent’s goal (the rewards).

8
MDP Formalization : Video Games

● State:
○ raw pixels
● Actions:
○ game controls
● Reward:
○ change in score
● State-transition probabilities:
○ defined by stochasticity in game evolution

Ref: Playing Atari with deep reinforcement learning”, Mnih et al., 2013 9
MDP Formalization : Traffic Signal Control

● State:
○ Current signal assignment (green, yellow,
and red assignment for each phase)
○ For each lane: number of approaching
vehicles, accumulated waiting time,
number of stopped vehicles, and average
speed of approaching vehicles
● Actions:
○ signal assignment
● Reward:
○ Reduction in traffic delay
● State-transition probabilities:
○ defined by stochasticity in approaching
demandTraffic Signal Control Policy”, Ault et al., 2020
Ref: “Learning an Interpretable 10
MDP Formalization : Recycling Robot (Detailed Ex.)
● Robot has
○ sensors for detecting cans
○ arm and gripper that can pick the cans and place in an
onboard bin;
● Runs on a rechargeable battery
● Its control system has components for interpreting sensory
information, for navigating, and for controlling the arm and
gripper
● Task for the RL Agent: Make high-level decisions about how
to search for cans based on the current charge level of the
battery

11
MDP Formalization : Recycling Robot (Detailed Ex.)

● State:
○ Assume that only two charge levels can be distinguished
○ S = {high, low}
● Actions:
○ A(high) = {search, wait}
○ A(low) = {search, wait, recharge}
● Reward:
○ Zero most of the time, except when securing a can
○ Cans are secured by searching and waiting, but rsearch > rwait
● State-transition probabilities:
○ [Next Slide]

12
MDP Formalization : Recycling Robot (Detailed Ex.)

● State-transition probabilities (contd…):

13
MDP Formalization : Recycling Robot (Detailed Ex.)

● State-transition probabilities (contd…):

14
Note on Goals & Rewards
● Reward Hypothesis:
All of what we mean by goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a
received scalar signal (called reward).
● The rewards we set up truly indicate what we want accomplished,
○ not the place to impart prior knowledge on how we want it to do
● Ex: Chess Playing Agent
○ If the agent is rewarded for taking opponents pieces, the agent might fall
for the opponent's trap.
● Ex: Vacuum Cleaner Agent
○ If the agent is rewarded for each unit of dirt it sucks, it can repeatedly
deposit and suck the dirt for larger reward
15
Returns & Episodes
● Goal is to maximize the expected return
● Return (Gt) is defined as some specific function of the reward
sequence
● Episodic tasks vs. Continuing tasks
● When there is a notion of final time step, say T, return can be

○ Applicable when agent-environment interaction breaks into

episodes
○ Ex: Playing Game, Trips through maze etc. [ called episodic tasks]

16
Returns & Episodes
● Generally T = ∞
○ What if the agent receive a reward
of +1 for each timestep?
○ Discounted Return:

Note:

○ Discount rate determines the present

value of future rewards

17
Returns & Episodes
● What if 𝛾 is 0?
● What if 𝛾 is 1?
● Computing discounted rewards incrementally

• Sum of an infinite number of terms, it is still finite if the reward is nonzero

and constant and if 𝛾 < 1.
• Ex: reward is +1 constant

18
Returns & Episodes
➔ Objective: To apply forces to a cart
moving along a track so as to keep a
pole hinged to the cart from falling over
➔ Discuss:
➔ Consider the task as episodic, that is
try/maintain balance until failure.
What could be the reward function?
➔ Repeat prev. assuming task is
continuous.

19
Policy

● A mapping from states to

probabilities of selecting each
possible action.
○ 𝛑 (a|s) is the probability that At
= a if St = s
● The purpose of learning is to
improve the agent's policy with its
experience

20
Defining Value Functions

State-value function for policy 𝝿

Action-value function for policy 𝝿

21
Defining Value Functions

State Value function in terms of Action-value function for policy 𝝿

Action Value function in terms of State value function for policy 𝝿

22
May skip to the next slide !
Bellman Equation for V𝝅
● Dynamic programming equation associated with discrete-time optimization
problems
○ Expressing Vℼ recursively i.e. relating V𝝅(s) to V𝝅(s’) for all s’ ∈ succ(s)

23
Bellman Equation for V𝝅

Value of the start state must equal

(1) the (discounted) value of the expected next state,
plus
(2) the reward expected along the way

Backup Diagram
24
Understanding V𝝅(s) with Gridworld
Reward:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else

Exceptional reward dynamics State-value function for the equiprobable

random policy with 𝛾 = 0.9 25
Understanding V𝝅(s) with Gridworld

Verify V𝝅(s) using Bellman equation for this state with 𝛾 = 0.9,
and equiprobable random policy

26
Understanding V𝝅(s) with Gridworld

27
Ex-1
Recollect the reward function used for Gridworld as below:
○ -1 if an action takes agent off the grid
○ Exceptional reward from A and B for all actions taking agent to A’ and B’ resp.
○ 0, everywhere else
Let us add a constant c ( say 10) to the rewards of all the actions. Will it change
anything?

28
Optimal Policies and Optimal Value Functions

● 𝝿 ≥ 𝝿’ if and only if v𝝿(s) ≥ v𝝿(s) for all s ∊ S

● There is always at least one policy that is better than or
equal to all other policies → optimal policy (denoted as 𝝿*)
○ There could be more than one optimal policy !!!
Optimal state-value function

Optimal action-value function

29
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state
Bellman optimality equation for V*

30
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state

Bellman optimality equation for q*

31
Optimal Policies and Optimal Value Functions
Bellman optimality equation - expresses that the value of a state under
an optimal policy must equal the expected return for the best action
from that state

Backup diagrams for v* and q* 32

Optimal solutions to the gridworld example

Backup diagrams for v* and q* 33

MDP - Objective
•

34
Notation
•

35
Race car example
•

36
Race car example
•

42
Value iteration

43
Value Iteration

0 0 0

2 1 0

3.35 2.35 0

Check this computation on paper.

44
Example: Grid World
▪ A maze-like problem
▪ The agent lives in a grid
▪ Walls block the agent’s path

▪ Noisy movement: actions do not always go as planned

▪ 80% of the time, the action North takes the agent North
▪ 10% of the time, North takes the agent West; 10% East
▪ If there is a wall in the direction the agent would have
been taken, the agent stays put