0% found this document useful (0 votes)
10 views54 pages

Lecture2-MRP (RL IITH)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views54 pages

Lecture2-MRP (RL IITH)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Markov Decision Process

Easwar Subramanian
TCS Innovation Labs, Hyderabad

Email : [email protected]

August 10, 2024


Administrivia

▶ Please consult Prof. Konda Reddy, for all queries related to registration and other
administrative issues
▶ If need be, register for CS 5500 instead of AI 3000 (relevant for MDS / CS students)
▶ The Piazza course page is ready; Enrollments are to be done
▶ Tentative schedule for assignments and exams are in Google sheet

Easawr Subramanian, IIT Hyderabad 2 of 54


Overview

1 Review

2 Mathematical Framework for Decision Making

3 Markov Chains

4 Markov Reward Process

5 Markov Decision Process

Easawr Subramanian, IIT Hyderabad 3 of 54


Review

Easawr Subramanian, IIT Hyderabad 4 of 54


Types of Learning : Summary

Easawr Subramanian, IIT Hyderabad 5 of 54 Figure Source: Saggie


Characteristics of Reinforcement Learning

▶ Observations are non i.i.d and are sequential in nature


▶ Agent’s action (may) affect the subsequent observation seen
▶ There is no supervisor; Only reward signal (feedback)
▶ Reward or feedback can be delayed

Easawr Subramanian, IIT Hyderabad 6 of 54


Reinforcement Learning : History

Easawr Subramanian, IIT Hyderabad 7 of 54 Slide Credit: RL Course, Abir Das


Course Setup

Easawr Subramanian, IIT Hyderabad 8 of 54


Mathematical Framework for Decision Making

Easawr Subramanian, IIT Hyderabad 9 of 54


RL Framework : Notations

Easawr Subramanian, IIT Hyderabad 10 of 54 Figure Source: Sutton and Barto


Markov Decision Process

▶ Markov Decision Process (MDP) provides a mathematical framework for modeling


decision making process

▶ Can formally describe the working of the environment and agent in the RL setting

▶ Can handle huge variety of interesting settings


⋆ Multi-arm Bandits - Single state MDPs
⋆ Optimal Control - Continuous MDPs

▶ Core problem in solving an MDP is to find an ’optimal’ policy (or behaviour) for the
decision maker (agent) in order to maximize the total future reward

Easawr Subramanian, IIT Hyderabad 11 of 54


Markov Chains

Easawr Subramanian, IIT Hyderabad 12 of 54


Random Variables and Stochastic Process

Random Variable (Non-mathematical definition)


A random variable is a variable whose value depend on the outcome of a random
phenomenon
▶ Outcome of a coin toss
▶ Outcome of roll of a dice

Stochastic Process
A stochastic or random process, denoted by {st }t∈T , can be defined as a collection of
random variables that is indexed by some mathematical set T
▶ Index set has the interpretation of time
▶ The set T is, typically, N or R

Easawr Subramanian, IIT Hyderabad 13 of 54


Notations

▶ Typically, in optimal control problems, the index set is continuous (say R)


▶ Throughout this course (RL), the index set is always discrete (say N)
▶ Let {st }t∈T be a stochastic process
▶ Let st be the state at time t of the stochastic process {st }t∈T

Easawr Subramanian, IIT Hyderabad 14 of 54


Markov Property

Markov Property
A state st of a stochastic process {st }t∈T is said to have Markov property if

P (st+1 |st ) = P (st+1 |s1 , · · · , st )

The state st at time t captures all relevant information from history and is a sufficient
statistic of the future

Easawr Subramanian, IIT Hyderabad 15 of 54


Transition Probability

State Transition Probability


For a Markov state s and a successor state s′ , the state transition probability is defined by

Pss′ = P (st+1 = s′ |st = s)

State transition matrix P then denotes the transition probabilities from all states s to all
successor states s′ (with each row summing to 1)
 
P11 P12 · · · P1n
P =  ...
 

Pn1 Pn2 · · · Pnn

Easawr Subramanian, IIT Hyderabad 16 of 54


Markov Chain
A stochastic process {st }t∈T is a Markov process or Markov Chain if the sequence of
random states satisfy the Markov property. It is represented by tuple < S, P > where S
denote the set of states and P denote the state transition probablity
Example 1 : Simple Two State Markov Chain

▶ State S = {Sunny, Rainy}


▶ Transition Probability Matrix
 
.8 .2
P=
.7 .3
Figure Source:
Easawr Subramanian, IIT Hyderabad 17 of 54 https://bookdown.org/probability
Markov Chain : Example Revisited

State S = {Sunny, Rainy} and Transition Probability Matrix


 
.8 .2
P=
.7 .3

▶ Probability that tomorrow will be ’Rainy’ given today is ’Sunny’ = 0.2

Figure Source:
Easawr Subramanian, IIT Hyderabad 18 of 54 https://bookdown.org/probability
Multi-Step Transitions

▶ Probability that day-after-tomorrow will be ’Rainy’ given today is ’Sunny’ is given


by 0.2 * 0.3 + 0.8 * 0.2 = 0.22
In general, if one step transition matrix is given by,
 
Pss Psr
P=
Prs Prr
then the two step transition matrix is given by,
 
Pss ∗ Pss + Psr ∗ Prs Pss ∗ Psr + Psr ∗ Prr
P(2) = = P2
Prr ∗ Prs + Prs ∗ Pss Prr ∗ Prr + Prs ∗ Psr
Figure Source:
Easawr Subramanian, IIT Hyderabad 19 of 54 https://bookdown.org/probability
Multi-Step Transitions

In general, n-step transition matrix is given by,

P(n) = P n

Assumption
We made an important assumption in arriving at the above expression. That the one-step
transition matrix stays constant through time or independent of time

▶ Markov chains generated using such transition matrices are called homogeneous
Markov chains
▶ For much of this course, we will consider homogeneous Markov chains, for which the
transition probabilities depend on the length of time interval [t1 , t2 ] but not on the
exact time instants

Easawr Subramanian, IIT Hyderabad 20 of 54


Markov Chains : Examples
Example 2 : One dimensional random walk
A walker flips a coin every time slot to decide which ’way’ to go.
(
st + 1 with probability p
st+1 =
st − 1 with probability 1 − p

Easawr Subramanian, IIT Hyderabad 21 of 54


Example 3 : Simple Grid World

▶ S = {s1 , s2 , s3 , s4 , s5 , s6 , s6 }
▶ P as shown above
▶ Example Markov Chains with s2 as start state
⋆ {s2 , s3 , s2 , s1 , s2 , · · · }
⋆ {s2 , s2 , s3 , s4 , s3 , · · · }

Slide Credit: Emma Brunskill


Easawr Subramanian, IIT Hyderabad 22 of 54 CS234 Stanford
Markov Chains : Examples

Example 4 : Dice roll experiment


Let {st }t∈T model the stochastic process representing the cumulative sum of a fair
six-sided die rolls

Example 5 : Natural Language Processing


Let {st }t∈T model the stochastic process that keeps track of the chain of letters in a
sentence. Consider an example
Tomorrow is a sunny day

▶ We normally don’t ask the question what is probability of character ’a’ appearing
given previous character is ’d’
▶ Sentence formation is typically non-Markovian

Easawr Subramanian, IIT Hyderabad 23 of 54


Notion of Absorbing State

Absorbing State
A state s ∈ S is called absorbing state if it is impossible to leave the state. That is,

if s = s′
 
1,
Pss′ =
0, otherwise

Easawr Subramanian, IIT Hyderabad 24 of 54


Markov Reward Process

Easawr Subramanian, IIT Hyderabad 25 of 54


Markov Reward Process

Markov Reward Process


A Markov reward process is a tuple < S, P, R, γ > is a Markov chain with values
▶ S : (Finite) set of states
▶ P : State transition probablity
▶ R : Reward for being in state st is given by a deterministic function R

rt+1 = R(st )

▶ γ : Discount factor such that γ ∈ [0, 1]

Easawr Subramanian, IIT Hyderabad 26 of 54


Simple Grid World : Revisited

▶ For the Markov chain {s2 , s3 , s2 , s1 , s2 , · · · } the corresponding reward sequence is


{−1, 0, −1, −6, −1, · · · }
No notion of action

Slide Credit: Emma Brunskill


Easawr Subramanian, IIT Hyderabad 27 of 54 CS234 Stanford
Example : Snakes and Ladders

Easawr Subramanian, IIT Hyderabad 28 of 54


Example : Snakes and Ladders

▶ States S : {s1 , s2 , · · · , s100 }


▶ Transition Probability P :
⋆ What is the probability to move from state 2 to 6 in one step ?
⋆ What are the states that can be visited in one-step from state 2 ?
⋆ What is the probability to move from state 2 to 4 ?
⋆ Can we transition from state 15 to 7 in one step ?
Question : Is transition matrix independent of time ?

Question : Can we formulate the game of Snake and Ladders as a MRP ?

Need to define suitable reward function and discounting factor

Easawr Subramanian, IIT Hyderabad 29 of 54


On Rewards : Total Return

▶ At each time step t, there is a reward rt+1 associated with being in state st
▶ Ideally, we would like the agent to pick such trajectories in which the cumulative
reward accumulated by traversing such a path is high

Question : How can we formalize this ?

Answer : If the reward sequence is given by {rt+1 , rt+2 , rt+3 , · · · }, then, we want to
maximize the sum
rt+1 + rt+2 + rt+3 + · · ·
Define Gt to be

X
Gt = rt+1 + rt+2 + rt+3 + · · · = rt+k+1
k=0

The goal of the agent is to pick such paths that maximize Gt

Easawr Subramanian, IIT Hyderabad 30 of 54


Total (Discounted) Return
Recall that,

X
Gt = rt+1 + rt+2 + rt+3 + · · · = rt+k+1
k=0

▶ In the case that the underlying stochastic process has infinite terms the above
summation could be divergent
Therefore, we introduce discount factor γ ∈ [0, 1] and redefine Gt as

X
2
Gt = rt+1 + γrt+2 + γ rt+3 + · · · = γ k rt+k+1
k=0

▶ Gt is the total discounted return starting from time t


▶ If γ < 1 then the infinite sum has a finite value if the reward sequence is bounded
▶ γ close to 0 the agent is concerned only with immediate reward(s) (myopic)
▶ γ close to 1 the agent considers future reward more strongly (far-sighted)
Easawr Subramanian, IIT Hyderabad 31 of 54
Few Remarks on Discounting

▶ Mathematically convienient to discount rewards


▶ Avoids infinite returns in cyclic and infinite horizon setting
▶ Discount rate determines the present value of future reward
▶ Offers trade-off between being ’myopic’ and ’far-sighted’ reward
▶ In finite MDPs, it is sometimes possible to use undiscounted reward (i.e. γ = 1), for
example, if all sequences terminate

Easawr Subramanian, IIT Hyderabad 32 of 54


Snakes and Ladders : Revisited

Question : What can be a suitable reward function and discount factor to describe
’Snake and Ladders’ as a Markov reward process ?
▶ Goal : From any given state reach s100 in as few steps as possible
▶ Reward R : R(s) = −1 for s ∈ s1 , · · · , s99 and for R(s100 ) = 0
▶ Discount Factor γ = 1
Easawr Subramanian, IIT Hyderabad 33 of 54
Snakes and Ladders : Revisited

Question : Are all intermediate states equally ’valuable ’ just because they have equal
reward ?

Easawr Subramanian, IIT Hyderabad 34 of 54


Value Function

The value function V (s) gives the long-term value of state s ∈ S



!
X
V (s) = E (Gt |st = s) = E γ k rt+k+1 |st = s
k=0

▶ Value function V (s) determines the value of being in state s


▶ V (s) measures the potential future rewards we may get from being in state s
▶ V (s) is independent of t

Easawr Subramanian, IIT Hyderabad 35 of 54


Value Function Computation : Example
Consider the following MRP. Assume γ = 1

▶ V (s1 ) = 6.8
▶ V (s2 ) = 1 + γ ∗ 6 = 7
▶ V (s3 ) = 3 + γ ∗ 6 = 9
▶ V (s4 ) = 6
Easawr Subramanian, IIT Hyderabad 36 of 54
Example : Snakes and Ladders

Question : How can we evaluate the value of each state in a large MRP such as ’Snakes
and Ladders ’ ?

Easawr Subramanian, IIT Hyderabad 37 of 54


Decomposition of Value Function

Let s and s′ be successor states at time steps t and t + 1, the value function can be
decomposed into sum of two parts
▶ Immediate reward rt+1
▶ Discounted value of next state s′ (i.e. γV (s′ ))


!
X
k
V (s) = E (Gt |st = s) = E γ rt+k+1 |st = s
k=0
= E (rt+1 + γV (st+1 )|st = s)

Easawr Subramanian, IIT Hyderabad 38 of 54


Decomposition of Value Function
Recall that,

 X
Gt = rt+1 + γrt+2 + γ 2 rt+3 + · · · = γ k rt+k+1


k=0


!
X
V (s) = E (Gt |st = s) = E γ k rt+k+1 |st = s
k=0
= E rt+1 + γrt+2 + γ 2 rt+3 + · · · |st = s


X
= E(rt+1 |st = s) + γ k E (rt+k+1 |st = s)
k=1
X ∞
X
= E(rt+1 |st = s) + γ P (s′ |s) γ k E (rt+k+1 |st = s, st+1 = s′ )
s′ ∈S k=0
X X∞
= E(rt+1 |st = s) + γ P (s′ |s) γ k E (rt+k+1 |st+1 = s′ ) (Markov property)
s′ ∈S k=0
= E(rt+1 + γV (st+1 )|st = s)
Easawr Subramanian, IIT Hyderabad 39 of 54
Value Function : Evaluation
We have
V (s) = E(rt+1 + γV (st+1 )|st = s)

h ′ ′ ′ ′
i
V (s) = R(s) + γ Pss′a V (sa ) + Pss′ V (sb ) + Pss′c V (sc ) + Pss′ V (sd )
b d

Easawr Subramanian, IIT Hyderabad 40 of 54


Value Function Computation : Example
Consider the following MRP. Assume γ = 1

▶ V (s4 ) = 6
▶ V (s3 ) = 3 + γ ∗ 6 = 9
▶ V (s2 ) = 1 + γ ∗ 6 = 7
▶ V (s1 ) = − 1 + γ ∗ (0.6 ∗ 7 + 0.4 ∗ 9) = 6.8
Easawr Subramanian, IIT Hyderabad 41 of 54
Bellman Equation for Markov Reward Process

V (s) = E(rt+1 + γV (st+1 )|st = s)



For any s ∈ S a successor state of s with transition probability Pss′ , we can rewrite the
above equation as (using definition of Expectation)
X
V (s) = E(rt+1 |st = s) + γ Pss′ V (s′ )
s′ ∈S

This is the Bellman Equation for value functions

Easawr Subramanian, IIT Hyderabad 42 of 54


Snakes and Ladders
Question : How can we evaluate the value of (all) states using the value function
decomposition ? X
V (s) = E(rt+1 |st = s) + γ Pss′ V (s′ )
s′ ∈S

Easawr Subramanian, IIT Hyderabad 43 of 54


Bellman Equation in Matrix Form

Let S = {1, 2, · · · , n} and P be known. Then one can write the Bellman equation can as,

V = R + γPV

where        
V (1) R(1) P11 P12 ··· P1n V (1)
 V (2)   R(2)   P21 P22 ··· P2n 
  V (2) 
 
 ..  =  ..  + γ  ..  ×  .. 
    
 .   .   .   . 
V (n) R(n) Pn1 Pn2 ··· Pnn V (n)
Solving for V , we get,
V = (I − γP)−1 R
The discount factor should be γ < 1 for the inverse to exist

Easawr Subramanian, IIT Hyderabad 44 of 54


Example : Snakes and Ladders

▶ We can now compute the value of states in such ’large’ MRP using the matrix form of
Bellman equation
▶ Value function computed for a particular state provides the expected number of
plays to reach the goal state s100 from that state
Easawr Subramanian, IIT Hyderabad 45 of 54
Markov Decision Process

Easawr Subramanian, IIT Hyderabad 46 of 54


Markov Decision Process

Markov decision process is a tuple < S, A, P, R, γ > where


▶ S : (Finite) set of states
▶ A : (Finite) set of actions
▶ P : State transition probability
a ′
Pss′ = P(st+1 = s |st = s, at = a), at ∈ A

▶ R : Reward for taking action at at state st and transitioning to state st+1 is given by
the deterministic function R

rt+1 = R(st , at , st+1 )

▶ γ : Discount factor such that γ ∈ [0, 1]

Easawr Subramanian, IIT Hyderabad 47 of 54


Wealth Management Problem

▶ States S : Current value of the portfolio and current valuation of instruments in the
portfolio
▶ Actions A : Buy / Sell instruments of the portfolio
▶ Reward R : Return on portfolio compared to previous decision epoch

Easawr Subramanian, IIT Hyderabad 48 of 54


Navigation Problem

▶ States S : Squares of the grid


▶ Actions A : Any of the four directions possible
▶ Reward R : -1 for every move made until reaching goal state
Easawr Subramanian, IIT Hyderabad 49 of 54
Example : Atari Games

▶ States S : Possible set of all (Atari) images


▶ Actions A : Move the paddle up or down
▶ Reward R : +1 for making the opponent miss the ball; -1 if the agent miss the ball; 0
otherwise;
Easawr Subramanian, IIT Hyderabad 50 of 54
Flow Diagram

▶ The goal is to choose a sequence of actions such that the expected total discounted
future reward E(Gt |st = s) is maximized where

X
γ k rt+k+1

Gt =
k=0

Easawr Subramanian, IIT Hyderabad 51 of 54


Windy Grid World : Stochastic Environment
Recall given an MDP < S, A, P, R, γ >, we have the state transition probability P defined
as
a ′
Pss ′ = P(st+1 = s |st = s, at = a), at ∈ A

▶ In general, note that even after choosing action a at state s (as prescribed by the
policy) the next state s′ need not be a fixed state

Easawr Subramanian, IIT Hyderabad 52 of 54


Finite and Infinite Horizon MDPs

▶ If T is fixed and finite, the resultant MDP is a finite horizon MDP


⋆ Wealth management problem
▶ If T is infinite, the resultant MDP is infinite horizon MDP
⋆ Certain Atari games
▶ When |S| is finite, the MDP is called finite state MDPs

Easawr Subramanian, IIT Hyderabad 53 of 54


Grid World Example

Question : Is Grid world finite / infinite horizon problem ? Why ?

(Stochastic shortest path MDPs)


▶ For finite horizon MDPs and stochastic shortest path MDPs, one can use γ = 1

Easawr Subramanian, IIT Hyderabad 54 of 54

You might also like