0% found this document useful (0 votes)
5 views

Machine Learning

Machine learning

Uploaded by

Mr. RAVI KUMAR I
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Machine Learning

Machine learning

Uploaded by

Mr. RAVI KUMAR I
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lecture 5: Markov Decision Processes

Akshay Krishnamurthy
[email protected]
February 21, 2022

1 Recap and Introduction


So far, we have been discussing simple exploration problems in various “bandit” formulations. In all of these
problems the agent chooses actions which influence the immediate reward but, in the bandit protocols, these actions
have no further influence on future interactions. This means that our actions have no long-term consequences, or
equivalently, credit assignment is relatively straightforward. However, many sequential decision making scenarios
cannot be effectively modeled as bandit problems because actions do have long-term consequences. Developing
algorithms for these settings requires introducing new interaction protocols/models.
While we have seen relatively sophisticated algorithms for bandit problems, such as LinUCB and SquareCB, to
incorporate credit assignment, we’ll have to take a step back and simplify considerably. In particular, today we’ll
think only about the credit assignment capability, in the absence of both exploration and generalization.

2 Markov Decision Processes


A Markov decision process formalizes a decision making problem with state that evolves as a consequence of the
agents actions. The schematic is displayed in Figure 1

r0 r1 r2

s0 s1 s2 s3

a0 a1 a2

Figure 1: A schematic of a Markov decision process

Here the basic objects are:


• A state space S, which could be finite or infinite. For now let us think of this as finite and relatively small.
• An action space A, which could also be finite or infinite. Again let’s assume this is small for now.
• A reward function R : S × A → ∆([0, 1]) that associates a distribution over rewards to each state-action pair.
We’ll say that r ∼ R(s, a) and also that r(s, a) := E[r | s, a] which is a slight abuse of notation.
• A transition operator P : S × A → ∆(S) that associates a distribution over next states to each state-action
pair. As with the reward, we’ll say that s0 ∼ P (s, a) to denote a sample drawn from this operator.
• An initial state distribution µ ∈ ∆(S) that describes how the initial state s0 will be chosen.
A key property is that the dynamics are Markovian, meaning that the distribution of the next state s0 and reward
r depend only on the most recent state s and action a. With these basic objects there are many formulations.

1
3 Finite horizon, episodic setting
Let us start with the simpler finite horizon setting. In this setting, there is also a horizon H ∈ N that describes how
long an episode lasts. With horizon H an episode produces a trajectory τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . . , sH−1 , aH−1 , rH−1 )
in the following manner: s0 ∼ µ, sh ∼ P (sh−1 , ah−1 ) and rh ∼ R(sh , ah ) for each h, and all actions a0:H−1 are
chosen by the agent. The Markov property simply means that (sh , rh ) ⊥ sh−2 | (sh−1 , ah−1 ) so the entire past is
summarized in the current state. Note however that we can choose actions in a non-Markovian manner.

3.1 Policies, Value functions, and Objective


To set up the objective, we must introduce the concept of a policy. In general, a policy is a decision making strategy
that makes a (possibly randomized) decision based on the current trajectory, that is π : H → ∆(A) where H is the
set of all partial trajectories. A non-stationary Markov policy instead compresses the history to the current state
and time-step only, that is π : S × [H] → ∆(A). For such policies, we use the notation πh (s) ∈ ∆(A) to denote the
action distribution on state s and time step h. If the policy is deterministic, then the range is just A.

Objective. Every policy π has a value, which is the total reward we expect to accumulate if we deploy policy π.
This is defined as
"H−1 #
X
J(π) := E rh | s0 ∼ µ, a0:H−1 ∼ π
h=0

The objective in an MDP is to find a policy π that maximizes J(π).

Value functions. For a Markov policy π, we can consider a more-refined notion: how much reward would we
get if we started in state s and time h and executed policy π until the end of the episode? This defines the value
function or the state-value function
" H−1 #
X
π
Vh (s) := E rh0 | sh = s, ah:H−1 ∼ π .
h0 =h

The state-action value function or just action-value function is similar. It describes how much reward we would get
if we started in state s at time h, took action a first, then followed π for the subsequent steps. This is defined as
" H−1 #
X
π
Qh (s, a) := E rh0 | sh = s, ah = a, ah+1:H−1 ∼ π .
h0 =h

It is worth making a couple of observations. First, we can write Vhπ (s) = Eah ∼πh (s) [Qπh (s, ah )] to relate the
two types of value functions. A related and more fundamental property is that these functions satisfy a recursive
relationship, known as Bellman equations (or Bellman equations for policy evaluation):

Vhπ (s) = E rh + Vh+1


π
 
(sh+1 ) | sh = s, ah ∼ π
Qπh (s, a) = E rh + Qπh+1 (sh+1 , ah+1 ) | sh = s, ah = a, ah+1 ∼ π
 
(1)

Intuitively, the reward we expect to get by following π from (s, h) is the immediate reward rh plus the reward we
expect to get by following π from the next state and the next time step.
Next, if π is Markov, then J(π) = Es0 ∼µ [V0π (s0 )] which relates the MDP objective to the value functions.
Finally, for non-Markov policies, we cannot really define these functions since the policy’s actions may depend on
information from the past that is not available. However, we will next see that the optimal policy is Markovian,
which justifies our effort in setting up these definitions.

2
3.2 Optimality and Bellman equations
A fundamental result in the theory of Markov Decision processes is that the optimal policy is Markovian and the
optimal value functions satisfy an elegant recursive definition. This is captured in the following theorem.
Theorem 1. Define the value function (V0? , . . . , VH? ) recursively as
VH? ≡ 0, ∀(s, h) : Vh? (s) = max r(s, a) + Es0 ∼P (s,a) Vh+1 (s0 ) .
 ? 
(2)
a

Then supπ J(π) = Es0 ∼µ [V0? (s0 )] where the supremum is taken over all, possibly non-Markovian, policies. Addi-
tionally, define the policy π ? := (π0? , . . . , πH−1
?
) as

∀(s, h) : πh? (s) = argmax r(s, a) + Es0 ∼P (s,a) Vh+1 (s0 )


 ? 
a
? ?
Then π achieves value V for all (s, h) pairs and hence π ? is an optimal policy.
The two important aspects of this theorem are: (a) we need not consider non-Markovian policies since the
optimal policy is Markov, (b) there is a simple recursive formula for computing the optimal value function and the
optimal policy can be concisely described in terms of this value function.
An analogous results holds for the action-value functions if we define them as
h i
0 0
Q?h (s, a) = r(s, a) + Es0 ∼P (s,a) max
0
Q?
h+1 (s , a ) . (3)
a

Then we can simply define the optimal policy as πh?


: s 7→ argmaxa Q?h (s, a). Both (2) and (3) are referred to as
Bellman optimality equations. Note that these equations show us how to address credit assignment in some sense.
Indeed it could be that r(s, a)  Q?h (s, a), since taking action a in state s leads to some very favorable conditions
later in the episode. So defining πh? to maximize Q?h leads to “non-myopic” behavior which is essential for decision
making with a long horizon.
Proof. The proof is based on the dynamic programming principle or an induction argument. The key fact is that
π ? is optimal pointwise, meaning from every state-time pair. This means that we can ignore how we arrive at a
distribution over states (i.e., the past) when optimizing for the future.
As the base case, consider time step H. There are no further actions and no further rewards, so all policies
accumulate 0 reward from here on. This means that VH? = 0 correctly describes the optimal reward achievable at
the H th time step.
?
For the induction, assume that Vh+1 (s) satisfies the following two properties:
• The Markovian policy (πh+1
? ?
, . . . , πH−1 ?
) achieves value Vh+1 for each s ∈ S at time h + 1.
hP i
H−1
• For all possibly non-Markovian policies π̃ we have E
 ? 
0 rh 0 | π̃ ≤ E Vh+1 (sh+1 ) | π̃ .
h =h+1

(Note that these two properties hold for VH? trivially.) The second property allows us to study non-Markovian
?
policies, since we can upper bound their future value via our function Vh+1 .
?
We establish these two properties for the value function Vh defined via (2). With πh? (s) = argmaxa r(s, a) +
0
 ? 
Es0 ∼P (s,a) Vh+1 (s ) , the first property holds simply by the definition and the first inductive hypothesis.
For the second property consider some possibly non-Markovian policy π̃. We have
" H−1 # " H−1
#
X X
E rh0 | π̃ = E rh + rh0 | π̃
h0 =h h0 =h+1
?
 
≤ E rh + Vh+1 (sh+1 ) | π̃ (Inductive hypothesis)
?
   
= E E r(s, a) + Vh+1 (sh+1 ) | sh = s, ah = a | π̃ (Iterated expectation)
(s0 ) | sh = s, ah = a | π̃
   ?   
= E E r(s, a) + Es0 ∼P (s,a) Vh+1 (Markov property of sh+1 )
   
(s0 ) | sh = s | π̃
 ? 
≤ E E max r(s, a) + Es0 ∼P (s,a) Vh+1
a∈A

= E [Vh? (sh ) | π̃] (Definition of Vh? )

3
Concluding the induction, we have value functions (V0? , . . . , VH? ) and a policy (π0? , . . . , πH−1
?
) that achieves the
value function at every (s, h) pair. Additionally, for any π̃:
"H−1 #
X
J(π̃) = E rh | π̃ ≤ E [V0? (s0 ) | π̃] = J(π ? ),
h=0

since s0 ∼ µ does not depend on the choice of policy. This proves the theorem.

3.3 Planning algorithms for the episodic setting


Theorem 1 directly motivates one strategy for computing the optimal value function and policy. This is known as
value iteration. The algorithm is to simply apply (2) or (3) from time step H down to time step 0. Then we can
directly obtain the optimal policy from the value functions we compute.
Another algorithm is called policy iteration. Starting with any non-stationary Markov policy π (0) , in the tth
iteration we update via
(t−1)
1. Compute Qπ via (1).
(t−1) (t) (t−1)
2. Update π (t) to be the greedy policy with respect to Qπ that is πh (s) := argmaxa Qhπ (s, a).
It can be easily seen that this algorithm converges in H iterations. Indeed, observe that even though π (0) is arbitrary,
(1) ?
we have that πH−1 = πH−1 , since QπH−1 actually does not depend on π and is equal to Q?H−1 .

4 Infinite horizon discounted setting


Let us shift gears and consider a different formulation of reinforcement learning in an MDP. This is referred to as
the discounted setting, or infinite horizon discounted setting. Here, instead of a horizon H, we have a discount
factor γ ∈ (0, 1) that captures how much we prefer immediate rewards over future rewards. Here trajectories are
infinitely long τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . .) but follow the same probabilistic model as before. For any policy π we
can define the objective and value function as
"∞ # "∞ # "∞ #
X X X
h π h π h
J(π) := E γ rh | π , V (s) = E γ rh | s0 = s, π Q (s, a) = E γ rh | s0 = s, a0 = a, π
h=0 h=0 h=0

(Note that we are now defining value functions for all policies, including non-Markovian ones.)
Roughly speaking all of the results for the episodic setting carry over here, with appropriate modifications. For
example the optimal policy is Markov and, in fact, stationary (meaning it does not depend on the time step) and
the optimal value function is the unique solution to the following fixed point equation

V ? (s) = max r(s, a) + γEs0 ∼P (s,a) [V ? (s0 )] (4)


a

This should be viewed as the infinite horizon version of (2), but note that we are not time-indexing V ? and it
appears on both sides of the equation. That said, an analogous result to Theorem 1 holds in this setting, although
it is a bit more subtle.

4.1 Planning algorithms


Both value iteration and policy iteration can be modified for the discounted setting. To describe these algorithms,
and for use in subsequent lectures, it is helpful to define the Bellman operator T as
h i
T f : (s, a) 7→ r(s, a) + γEs0 ∼P (s,a) max f (s0 , a0 ) .
a

Note that this operator takes Q-functions as input and produces a Q-function as output. Then the Bellman
optimality equation (the analog of (3)) is simply that Q? = T Q? .

4
Value iteration. Motivated by the fixed point relationship for Q? , value iteration simply iterates the operation
f (t+1) ← T f (t) starting from an arbitrary initial Q-function f (0) . The key to the convergence of this algorithm is
a certain contraction property for the bellman operator
Lemma 2 (Contraction). For any two Q-functions f, f 0 , we have kT f − T f 0 k∞ ≤ γkf − f 0 k∞ .
Proof. Considering any (s, a) pair we have
 
|(T f )s,a − (T f 0 )s,a | = |γ Es0 ∼P (s,a) max
0
f (s0 , a0 ) − max
00
f 0 (s0 , a00 ) |
a a
≤ γEs0 ∼P (s,a) max
0
|f (s0 , a0 ) − f 0 (s0 , a0 )|
a
≤ γkf − f 0 k∞

The first inequality follows since, if maxa0 f (s0 , a0 ) ≥ maxa00 f (s0 , a00 ) then:

f (s0 , a0 ) − max
00
f 0 (s0 , a00 ) ≤ f (s0 , a0 ) − f 0 (s0 , a0 ) ≤ max |f (s0 , a) − f 0 (s0 , a)|,
a a

with a similar calculation for the other case.

This will immediately give us a geometric convergence to Q? by iterating the bellman operator, but we need to
translate error in Q? to policy performance. This is given in the next lemma.
The notation here is a bit confusing. We will use f to denote any function of the type S × A → R, which as the
same type as the Q-functions. For a policy π, Qπ is the true action value function for π in the MDP. The confusing
part is that if πf : s 7→ argmaxa f (s, a) is the greedy policy with respect to function f , then Qπf will in general be
distinct from f . This is why we are using f to denote general functions with this type.
2
Lemma 3 (Policy error). For any function f we have J(π ? ) ≤ J(πf ) + 1−γ kf − Q? k∞ .

Proof. Consider state s and let a = πf (s) = argmaxa0 f (s, a0 ). Then

V ? (s) − V πf (s) = Q? (s, π ? (s)) − Q? (s, a) + Q? (s, a) − Qπf (s, a)


≤ Q? (s, π ? (s)) − f (s, π ? (s)) + f (s, a) − Q? (s, a) + Q? (s, a) − Qπf (s, a)
≤ 2kQ? − f k∞ + γEs0 ∼P (s,a) [V ? (s0 ) − V πf (s0 )]
≤ 2kQ? − f k∞ + γkV ? − V πf k∞

Re-arranging this inequality actually proves a stronger statement, namely that V ? and V πf are close, which implies
that J(π ? ) and J(πf ) are close.
Taking these two together, we immediately have an iteration complexity bound for value iteration.

Theorem 4. Set f (0) = 0, run value iteration for T iterations and define π̂ = πf (T ) . Then

2γ T kQ? k∞
J(π ? ) − J(π̂) ≤ .
1−γ

You might also like