Machine Learning
Machine Learning
Akshay Krishnamurthy
[email protected]
February 21, 2022
r0 r1 r2
s0 s1 s2 s3
a0 a1 a2
1
3 Finite horizon, episodic setting
Let us start with the simpler finite horizon setting. In this setting, there is also a horizon H ∈ N that describes how
long an episode lasts. With horizon H an episode produces a trajectory τ = (s0 , a0 , r0 , s1 , a1 , r1 , . . . , sH−1 , aH−1 , rH−1 )
in the following manner: s0 ∼ µ, sh ∼ P (sh−1 , ah−1 ) and rh ∼ R(sh , ah ) for each h, and all actions a0:H−1 are
chosen by the agent. The Markov property simply means that (sh , rh ) ⊥ sh−2 | (sh−1 , ah−1 ) so the entire past is
summarized in the current state. Note however that we can choose actions in a non-Markovian manner.
Objective. Every policy π has a value, which is the total reward we expect to accumulate if we deploy policy π.
This is defined as
"H−1 #
X
J(π) := E rh | s0 ∼ µ, a0:H−1 ∼ π
h=0
Value functions. For a Markov policy π, we can consider a more-refined notion: how much reward would we
get if we started in state s and time h and executed policy π until the end of the episode? This defines the value
function or the state-value function
" H−1 #
X
π
Vh (s) := E rh0 | sh = s, ah:H−1 ∼ π .
h0 =h
The state-action value function or just action-value function is similar. It describes how much reward we would get
if we started in state s at time h, took action a first, then followed π for the subsequent steps. This is defined as
" H−1 #
X
π
Qh (s, a) := E rh0 | sh = s, ah = a, ah+1:H−1 ∼ π .
h0 =h
It is worth making a couple of observations. First, we can write Vhπ (s) = Eah ∼πh (s) [Qπh (s, ah )] to relate the
two types of value functions. A related and more fundamental property is that these functions satisfy a recursive
relationship, known as Bellman equations (or Bellman equations for policy evaluation):
Intuitively, the reward we expect to get by following π from (s, h) is the immediate reward rh plus the reward we
expect to get by following π from the next state and the next time step.
Next, if π is Markov, then J(π) = Es0 ∼µ [V0π (s0 )] which relates the MDP objective to the value functions.
Finally, for non-Markov policies, we cannot really define these functions since the policy’s actions may depend on
information from the past that is not available. However, we will next see that the optimal policy is Markovian,
which justifies our effort in setting up these definitions.
2
3.2 Optimality and Bellman equations
A fundamental result in the theory of Markov Decision processes is that the optimal policy is Markovian and the
optimal value functions satisfy an elegant recursive definition. This is captured in the following theorem.
Theorem 1. Define the value function (V0? , . . . , VH? ) recursively as
VH? ≡ 0, ∀(s, h) : Vh? (s) = max r(s, a) + Es0 ∼P (s,a) Vh+1 (s0 ) .
?
(2)
a
Then supπ J(π) = Es0 ∼µ [V0? (s0 )] where the supremum is taken over all, possibly non-Markovian, policies. Addi-
tionally, define the policy π ? := (π0? , . . . , πH−1
?
) as
(Note that these two properties hold for VH? trivially.) The second property allows us to study non-Markovian
?
policies, since we can upper bound their future value via our function Vh+1 .
?
We establish these two properties for the value function Vh defined via (2). With πh? (s) = argmaxa r(s, a) +
0
?
Es0 ∼P (s,a) Vh+1 (s ) , the first property holds simply by the definition and the first inductive hypothesis.
For the second property consider some possibly non-Markovian policy π̃. We have
" H−1 # " H−1
#
X X
E rh0 | π̃ = E rh + rh0 | π̃
h0 =h h0 =h+1
?
≤ E rh + Vh+1 (sh+1 ) | π̃ (Inductive hypothesis)
?
= E E r(s, a) + Vh+1 (sh+1 ) | sh = s, ah = a | π̃ (Iterated expectation)
(s0 ) | sh = s, ah = a | π̃
?
= E E r(s, a) + Es0 ∼P (s,a) Vh+1 (Markov property of sh+1 )
(s0 ) | sh = s | π̃
?
≤ E E max r(s, a) + Es0 ∼P (s,a) Vh+1
a∈A
3
Concluding the induction, we have value functions (V0? , . . . , VH? ) and a policy (π0? , . . . , πH−1
?
) that achieves the
value function at every (s, h) pair. Additionally, for any π̃:
"H−1 #
X
J(π̃) = E rh | π̃ ≤ E [V0? (s0 ) | π̃] = J(π ? ),
h=0
since s0 ∼ µ does not depend on the choice of policy. This proves the theorem.
(Note that we are now defining value functions for all policies, including non-Markovian ones.)
Roughly speaking all of the results for the episodic setting carry over here, with appropriate modifications. For
example the optimal policy is Markov and, in fact, stationary (meaning it does not depend on the time step) and
the optimal value function is the unique solution to the following fixed point equation
This should be viewed as the infinite horizon version of (2), but note that we are not time-indexing V ? and it
appears on both sides of the equation. That said, an analogous result to Theorem 1 holds in this setting, although
it is a bit more subtle.
Note that this operator takes Q-functions as input and produces a Q-function as output. Then the Bellman
optimality equation (the analog of (3)) is simply that Q? = T Q? .
4
Value iteration. Motivated by the fixed point relationship for Q? , value iteration simply iterates the operation
f (t+1) ← T f (t) starting from an arbitrary initial Q-function f (0) . The key to the convergence of this algorithm is
a certain contraction property for the bellman operator
Lemma 2 (Contraction). For any two Q-functions f, f 0 , we have kT f − T f 0 k∞ ≤ γkf − f 0 k∞ .
Proof. Considering any (s, a) pair we have
|(T f )s,a − (T f 0 )s,a | = |γ Es0 ∼P (s,a) max
0
f (s0 , a0 ) − max
00
f 0 (s0 , a00 ) |
a a
≤ γEs0 ∼P (s,a) max
0
|f (s0 , a0 ) − f 0 (s0 , a0 )|
a
≤ γkf − f 0 k∞
The first inequality follows since, if maxa0 f (s0 , a0 ) ≥ maxa00 f (s0 , a00 ) then:
f (s0 , a0 ) − max
00
f 0 (s0 , a00 ) ≤ f (s0 , a0 ) − f 0 (s0 , a0 ) ≤ max |f (s0 , a) − f 0 (s0 , a)|,
a a
This will immediately give us a geometric convergence to Q? by iterating the bellman operator, but we need to
translate error in Q? to policy performance. This is given in the next lemma.
The notation here is a bit confusing. We will use f to denote any function of the type S × A → R, which as the
same type as the Q-functions. For a policy π, Qπ is the true action value function for π in the MDP. The confusing
part is that if πf : s 7→ argmaxa f (s, a) is the greedy policy with respect to function f , then Qπf will in general be
distinct from f . This is why we are using f to denote general functions with this type.
2
Lemma 3 (Policy error). For any function f we have J(π ? ) ≤ J(πf ) + 1−γ kf − Q? k∞ .
Re-arranging this inequality actually proves a stronger statement, namely that V ? and V πf are close, which implies
that J(π ? ) and J(πf ) are close.
Taking these two together, we immediately have an iteration complexity bound for value iteration.
Theorem 4. Set f (0) = 0, run value iteration for T iterations and define π̂ = πf (T ) . Then
2γ T kQ? k∞
J(π ? ) − J(π̂) ≤ .
1−γ