Lecture2-MRP (RL IITH)
Lecture2-MRP (RL IITH)
Easwar Subramanian
TCS Innovation Labs, Hyderabad
Email : [email protected]
▶ Please consult Prof. Konda Reddy, for all queries related to registration and other
administrative issues
▶ If need be, register for CS 5500 instead of AI 3000 (relevant for MDS / CS students)
▶ The Piazza course page is ready; Enrollments are to be done
▶ Tentative schedule for assignments and exams are in Google sheet
1 Review
3 Markov Chains
▶ Can formally describe the working of the environment and agent in the RL setting
▶ Core problem in solving an MDP is to find an ’optimal’ policy (or behaviour) for the
decision maker (agent) in order to maximize the total future reward
Stochastic Process
A stochastic or random process, denoted by {st }t∈T , can be defined as a collection of
random variables that is indexed by some mathematical set T
▶ Index set has the interpretation of time
▶ The set T is, typically, N or R
Markov Property
A state st of a stochastic process {st }t∈T is said to have Markov property if
The state st at time t captures all relevant information from history and is a sufficient
statistic of the future
State transition matrix P then denotes the transition probabilities from all states s to all
successor states s′ (with each row summing to 1)
P11 P12 · · · P1n
P = ...
Pn1 Pn2 · · · Pnn
Figure Source:
Easawr Subramanian, IIT Hyderabad 18 of 54 https://bookdown.org/probability
Multi-Step Transitions
P(n) = P n
Assumption
We made an important assumption in arriving at the above expression. That the one-step
transition matrix stays constant through time or independent of time
▶ Markov chains generated using such transition matrices are called homogeneous
Markov chains
▶ For much of this course, we will consider homogeneous Markov chains, for which the
transition probabilities depend on the length of time interval [t1 , t2 ] but not on the
exact time instants
▶ S = {s1 , s2 , s3 , s4 , s5 , s6 , s6 }
▶ P as shown above
▶ Example Markov Chains with s2 as start state
⋆ {s2 , s3 , s2 , s1 , s2 , · · · }
⋆ {s2 , s2 , s3 , s4 , s3 , · · · }
▶ We normally don’t ask the question what is probability of character ’a’ appearing
given previous character is ’d’
▶ Sentence formation is typically non-Markovian
Absorbing State
A state s ∈ S is called absorbing state if it is impossible to leave the state. That is,
if s = s′
1,
Pss′ =
0, otherwise
rt+1 = R(st )
▶ At each time step t, there is a reward rt+1 associated with being in state st
▶ Ideally, we would like the agent to pick such trajectories in which the cumulative
reward accumulated by traversing such a path is high
Answer : If the reward sequence is given by {rt+1 , rt+2 , rt+3 , · · · }, then, we want to
maximize the sum
rt+1 + rt+2 + rt+3 + · · ·
Define Gt to be
∞
X
Gt = rt+1 + rt+2 + rt+3 + · · · = rt+k+1
k=0
▶ In the case that the underlying stochastic process has infinite terms the above
summation could be divergent
Therefore, we introduce discount factor γ ∈ [0, 1] and redefine Gt as
∞
X
2
Gt = rt+1 + γrt+2 + γ rt+3 + · · · = γ k rt+k+1
k=0
Question : What can be a suitable reward function and discount factor to describe
’Snake and Ladders’ as a Markov reward process ?
▶ Goal : From any given state reach s100 in as few steps as possible
▶ Reward R : R(s) = −1 for s ∈ s1 , · · · , s99 and for R(s100 ) = 0
▶ Discount Factor γ = 1
Easawr Subramanian, IIT Hyderabad 33 of 54
Snakes and Ladders : Revisited
Question : Are all intermediate states equally ’valuable ’ just because they have equal
reward ?
▶ V (s1 ) = 6.8
▶ V (s2 ) = 1 + γ ∗ 6 = 7
▶ V (s3 ) = 3 + γ ∗ 6 = 9
▶ V (s4 ) = 6
Easawr Subramanian, IIT Hyderabad 36 of 54
Example : Snakes and Ladders
Question : How can we evaluate the value of each state in a large MRP such as ’Snakes
and Ladders ’ ?
Let s and s′ be successor states at time steps t and t + 1, the value function can be
decomposed into sum of two parts
▶ Immediate reward rt+1
▶ Discounted value of next state s′ (i.e. γV (s′ ))
∞
!
X
k
V (s) = E (Gt |st = s) = E γ rt+k+1 |st = s
k=0
= E (rt+1 + γV (st+1 )|st = s)
k=0
∞
!
X
V (s) = E (Gt |st = s) = E γ k rt+k+1 |st = s
k=0
= E rt+1 + γrt+2 + γ 2 rt+3 + · · · |st = s
∞
X
= E(rt+1 |st = s) + γ k E (rt+k+1 |st = s)
k=1
X ∞
X
= E(rt+1 |st = s) + γ P (s′ |s) γ k E (rt+k+1 |st = s, st+1 = s′ )
s′ ∈S k=0
X X∞
= E(rt+1 |st = s) + γ P (s′ |s) γ k E (rt+k+1 |st+1 = s′ ) (Markov property)
s′ ∈S k=0
= E(rt+1 + γV (st+1 )|st = s)
Easawr Subramanian, IIT Hyderabad 39 of 54
Value Function : Evaluation
We have
V (s) = E(rt+1 + γV (st+1 )|st = s)
h ′ ′ ′ ′
i
V (s) = R(s) + γ Pss′a V (sa ) + Pss′ V (sb ) + Pss′c V (sc ) + Pss′ V (sd )
b d
▶ V (s4 ) = 6
▶ V (s3 ) = 3 + γ ∗ 6 = 9
▶ V (s2 ) = 1 + γ ∗ 6 = 7
▶ V (s1 ) = − 1 + γ ∗ (0.6 ∗ 7 + 0.4 ∗ 9) = 6.8
Easawr Subramanian, IIT Hyderabad 41 of 54
Bellman Equation for Markov Reward Process
Let S = {1, 2, · · · , n} and P be known. Then one can write the Bellman equation can as,
V = R + γPV
where
V (1) R(1) P11 P12 ··· P1n V (1)
V (2) R(2) P21 P22 ··· P2n
V (2)
.. = .. + γ .. × ..
. . . .
V (n) R(n) Pn1 Pn2 ··· Pnn V (n)
Solving for V , we get,
V = (I − γP)−1 R
The discount factor should be γ < 1 for the inverse to exist
▶ We can now compute the value of states in such ’large’ MRP using the matrix form of
Bellman equation
▶ Value function computed for a particular state provides the expected number of
plays to reach the goal state s100 from that state
Easawr Subramanian, IIT Hyderabad 45 of 54
Markov Decision Process
▶ R : Reward for taking action at at state st and transitioning to state st+1 is given by
the deterministic function R
▶ States S : Current value of the portfolio and current valuation of instruments in the
portfolio
▶ Actions A : Buy / Sell instruments of the portfolio
▶ Reward R : Return on portfolio compared to previous decision epoch
▶ The goal is to choose a sequence of actions such that the expected total discounted
future reward E(Gt |st = s) is maximized where
∞
X
γ k rt+k+1
Gt =
k=0
▶ In general, note that even after choosing action a at state s (as prescribed by the
policy) the next state s′ need not be a fixed state