3 Markov Decision Processes
3 Markov Decision Processes
3 Markov Decision Processes
Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB
2023.01.20
Markov Decision Processes and Dynamic Programming
Primary sources
I The agent and the environment interact at each of a sequence of discrete time steps.
I Let {St }t ∈N be a sequence of random states indexed by time; for all t ∈ N0 , St ∈ S .
I A state s has the Markov property if for all states s 0 ∈ S and all rewards r ∈ R
p (Rt +1 = r , St +1 = s 0 | St = s ) = p (Rt +1 = r , St +1 = s 0 | S1 , . . . , St −1 , St = s )
6 November 1910
My dear Alexander Alexandrovich:
Of course I was also surprised by your reference to [Ersnt Heinrich] Bruns whom I consider a negligible
quantity.
I can judge all work only from a strictly mathematical point of view and from this viewpoint it is clear to me that
neither Bruns nor [Pavel Alekseevich] Nekrasov nor [Karl] Pearson has done anything worthy of note. You
speak about some kinds of most general constructions, but I cannot find these constructions in their work.
Meanwhile I do find highly general theorems from authors whom you have entirely forgotten: [Aleksandr
Mikhailovich] Liapunov and A. A. Markov. The unique service of P. A. Nekrasov, in my opinion, is namely this:
he brings out sharply his delusion, shared, I believe, by many, that independence is a necessary condition for
the law of large numbers. This circumstance prompted me to explain, in a series of articles, that the law of
large numbers and Laplace’s formula can apply also to dependent variables. In this way a construction of a
highly general character was actually arrived at, which P. A. Nekrasov can not even dream about.
I considered variables connected in a simple chain and from this came the idea of the possibility of extending
the limit theorems of the calculus of probability also to a complex chain.
Independence is not required for the application of these theorems, but on the other hand it is necessary
to assume existence of certain constant quantities. This existence is already assumed by the theory and
therefore it is impossible to deduce this from the theory. And so I will stick to my opinion that your reference
to Bruns and Nekrasov is wrong, as long as you do not cite for me their general constructions.
With complete respect,
A. Markov
Convention
p ( s 0 | s , a ) : = P [ St = s 0 | St − 1 = s , At − 1 = a ] = ∑ p (s 0 , r | s , a ).
r ∈R
r ( s , a ) = E [ R t | St − 1 = s , At − 1 = a ] = ∑r ∑ p (s 0 , r | s , a ).
r ∈R s 0 ∈S
p (s 0 , r | s , a )
r ( s , a , s 0 ) = E [ R t | St − 1 = s , At − 1 = a , St = s 0 ] = ∑r p (s 0 | s , a )
.
r ∈R
Markov Decision Processes and Dynamic Programming
I A mobile robot has the job of collecting empty soda cans in an office environment.
I High-level decisions about how to search for cans are made by a RL agent based on
the current charge level of the battery.
I There are two charge levels (two states), S = {high, low}.
I In each of these states, the agent can decide whether to
1. actively search for a can for a certain period of time,
2. remain stationary and wait for someone to bring it a can, or
3. head back to its home base to recharge its battery.
(These are the actions.)
I When the energy level is high, recharging would always be foolish, so we do not
include it in the action set for this state.
I The action sets are then
I A(high) = {search, wait} and
I A(low) = {search, wait, recharge}.
Markov Decision Processes and Dynamic Programming
I The best way to find cans is to actively search for them, but this runs down the robot’s
battery, whereas waiting does not.
I Whenever the robot is searching, the possibility exists that its battery will become
depleted. In this case the robot must shut down and wait to be rescued (producing a
low reward).
I If the energy level is high, then a period of active search can always be completed
without a risk of depleting the battery.
I A period of searching that begins with a high energy level leaves the energy level
high with probability α and reduces it to low with probability 1 − α. A period of
searching undertaken when the energy level is low leaves it low with probability β and
depletes the battery with probability 1 − β.
I In the latter case the robot must be rescued, and the battery is then recharged back to
high.
Markov Decision Processes and Dynamic Programming
I The rewards are zero most of the time, but become positive when the robot secures an
empty can, or large and negative if the battery runs all the way down.
I Each can collected by the robot counts as a unit reward, whereas a reward of −3
results whenever the robot has to be rescued.
I Let rsearch > rwait , respectively, denote the expected number of cans the robot will
collect (and hence the expected reward) while searching and while waiting.
I No cans can be collected during a run home for recharging, and that no cans can be
collected on a step in which the battery is depleted.
Markov Decision Processes and Dynamic Programming
s a s0 p (s 0 | s , a ) r (s , a , s 0 )
high search high α rsearch
high search low 1−α rsearch
low search high 1−β -3
low search low β rsearch
high wait high 1 rwait
high wait low 0 rwait
low wait high 0 rwait
low wait low 1 rwait
low recharge high 1 0
low recharge low 0 0
Note that there is a row in the table for each possible combination of current state, s, action
a ∈ A(s ), and next state, s 0 .
Markov Decision Processes and Dynamic Programming
s a s0 r p (s 0 , r | s , a )
high search high rsearch α
high search low rsearch 1−α
low search high -3 1−β
low search low rsearch β
high wait high rwait 1
high wait low rwait 0
low wait high rwait 0
low wait low rwait 1
low recharge high 0 1
low recharge low 0 0
1 http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html
Markov Decision Processes and Dynamic Programming
Examples of rewards
I To make a robot learn to find and collect empty soda cans for recycling, one might give
it a reward of zero most of the time, and then a reward of +1 for each can collected.
One might also give the robot negative rewards when it bumps into things or when
somebody shouts at it.
I For an agent to learn to play checkers or chess, the natural rewards are +1 for winning,
-1 for losing, and 0 for drawing and for all nonterminal positions.
Markov Decision Processes and Dynamic Programming
I The reward signal is not the place to impart to the agent prior knowledge about how to
achieve what we want it to do.
I Better places for imparting this kind of prior knowledge are the initial policy or initial
value function, or in influences on these.
I For example, a chess-playing agent should be rewarded only for actually winning, not
for achieving subgoals such as taking its opponent’s pieces or gaining control of the
centre of the board.
I If achieving these sorts of subgoals were rewarded, then the agent might find a way to
achieve them without achieving the real goal. For example, it might find a way to take
the opponent’s pieces even at the cost of losing the game.
I The reward signal is your way of communicating to the robot what you want it to
achieve, not how you want it achieved.
Markov Decision Processes and Dynamic Programming
Rt +1 , Rt +2 , Rt +3 , . . . ,
Gt := Rt +1 + Rt +2 + . . . + RT ,
S1 , S2 , . . . , ST .
Discounted return
I The discounted return is given by
∞
Gt := Rt +1 + γRt +2 + γ2 Rt +3 + . . . = ∑ γ k Rt + k + 1 ,
k =0
Gt := Rt +1 + γRt +2 + γ2 Rt +3 + γ3 Rt +3 + . . .
= Rt + 1 + γ ( Rt + 2 + γ Rt + 3 + . . . )
= Rt + 1 + γ Gt + 1 .
I Although the return is a sum of an infinite number of terms, it is still finite if the reward
is nonzero and constant — if γ < 1.
I For example, if the reward is a constant +1, then the return is
∞
1
Gt = ∑ γk =
1−γ
.
k =0
Markov Decision Processes and Dynamic Programming
Why discount?
The objective in this task is to apply forces to a cart moving along a track so as to keep a pole
hinged to the cart from falling over: A failure is said to occur if the pole falls past a given angle
from vertical or if the cart runs off the track. The pole is reset to vertical after each failure.
The task could be treated as episodic, where the natural episodes are the repeated attempts
to balance the pole. The reward in this case could be +1 for every time step on which failure
did not occur, so that the return at each time would be the number of steps until failure. In
this case, successful balancing forever would mean a return of infinity. Alternatively, we could
treat pole-balancing as a continuing task, using discounting. In this case the reward would
be −1 on each failure and zero at all other times. The return at each time would then be
related to −γK , where K is the number of time steps before failure. In either case, the return
is maximized by keeping the pole balanced for as long as possible.
Markov Decision Processes and Dynamic Programming
Policy
∞
" #
vπ (s ) := Eπ [Gt | St = s ] = Eπ ∑ γk Rt +k +1 | St = s ,
k =0
for all s ∈ S , where Eπ denotes the expected value of a random variable given that
the agent follows policy π , and t is any time step.
I Note that the value of a terminal state, if any, is always zero.
I Similarly, we define the action-value function of taking action a in state s under a
policy π as the expected return starting from s, taking the action a, and thereafter
following policy π :
∞
" #
qπ (s , a ) := Eπ [Gt | St = s , At = a ] = Eπ ∑ γk Rt +k +1 | St = s , At = a .
k =0
vπ (s ) = ∑ π (a | s )qπ (s , a ).
a ∈A
qπ (s , a ) = ∑ ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) .
s 0 ∈S r ∈R
Markov Decision Processes and Dynamic Programming
v π ( s ) : = E π [ Gt | S t = s ]
= E π [ Rt + 1 + γ Gt + 1 | S t = s ]
= ∑ π (a | s ) ∑ ∑ p (s 0 , r | s , a ) r + γEπ Gt +1 | St +1 = s 0
a s0 r
= ∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) ,
for all s ∈ S ,
a s 0 ,r
where it is implicit that the actions, a, are taken from the set A(s ), that the next states,
s 0 , are taken from the set S (or from S + in the case of an episodic problem), and that
the rewards, r, are taken from the set R.
I We have merged the two sums ∑s 0 ∑r into one sum ∑s 0 ,r over all the possible values
of both s 0 and r. We use this kind of merged sum to simplify formulas.
I The final expression is an expected value: it is a sum over all values of the three
variables, a, s 0 , and r. For each triple, we compute its probability
π (a | s )p (s 0 , r | s , a ), weight the quantity in brackets by that probability, then sum over
all possibilities to get an expected value.
Markov Decision Processes and Dynamic Programming
I Each open circle represents a state and each solid circle represents a state-action pair.
I Starting from state s, the root node at the top, the agent could take any of some set of
actions—three are shown in the diagram—based on its policy π . From each of these, the
environment could respond with one of several next states, s 0 (two are shown in the figure), along
with a reward, r, depending on its dynamics given by the function p.
I The Bellman equation averages over all the possibilities, weighting each by its probability of
occurring. It states that the value of the start state must equal the (discounted) value of the
expected next state, plus the reward expected along the way.
Markov Decision Processes and Dynamic Programming
We can derive the corresponding Bellman equation for state–action values, that is, for qπ :
qπ (s , a ) = Eπ [Gt | St = s , At = a ]
= Eπ [Rt +1 + γGt +1 | St = s , At = a ]
= ∑ ∑ p (s 0 , r | s , a ) r + γEπ [Gt +1 | St +1 = s 0 ]
s0 r
" #
= ∑ p (s , r | s , a )
0
r + γ ∑ π (a | s )Eπ [Gt +1 | St +1 = s , At +1 = a ]
0 0 0 0
s 0 ,r a0
" #
= ∑ p (s , r | s , a )
0
r + γ ∑ π (a | s )qπ (s , a ) .
0 0 0 0
s 0 ,r a0
Markov Decision Processes and Dynamic Programming
Richard Bellman
Richard E. Bellman
(1920–1984)
Markov Decision Processes and Dynamic Programming
An interesting question is, ‘Where did the name, dynamic programming, come from?’ The
1950s were not good years for mathematical research. We had a very interesting gentleman in
Washington named Wilson. He was Secretary of Defense, and he actually had a pathological
fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely.
His face would suffuse, he would turn red, and he would get violent if people used the term,
research, in his presence. You can imagine how he felt, then, about the term, mathematical.
The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its
boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from
the fact that I was really doing mathematics inside the RAND Corporation. What title, what
name, could I choose? In the first place I was interested in planning, in decision making, in
thinking. But planning, is not a good word for various reasons. I decided therefore to use
the word, ‘programming’. I wanted to get across the idea that this was dynamic, this was
multistage, this was time-varying—I thought, let’s kill two birds with one stone. Let’s take
a word that has an absolutely precise meaning, namely dynamic, in the classical physical
sense. It also has a very interesting property as an adjective, and that is it’s impossible to use
the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly
give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a
good name. It was something not even a Congressman could object to. So I used it as an
umbrella for my activities. [Bel84]
Markov Decision Processes and Dynamic Programming
Example: Gridworld
Exercise
I The Bellman equation must hold for each state for the value function vπ shown above.
I Show numerically that this equation holds for the centre state, valued at +0.7, with
respect to its four neighbouring states, valued at +2.3, +0.4, -0.4, +0.7.
I These values are accurate only to one decimal place.
Markov Decision Processes and Dynamic Programming
Solution
∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) ,
vπ (s ) = for all s ∈ S .
a s 0 ,r
I The possible actions from s = centre are A(s ) = {north, east, south, west}.
I We are considering the equiprobable policy π , and so
1
π (north | centre) = π (east | centre) = π (south | centre) = π (west | centre) = .
4
I In this example, the state-action pair (s , a ) determines a unique (s 0 , r ) pair, and so all
the p (s 0 , r | s , a ) are 1, and so the Bellman equation becomes
1 1
× 1 × (0 + 0.9 × 2.3) + × 1 × (0 + 0.9 × 0.4)
vπ (s ) =
4 4
1 1
+ × 1 × (0 + 0.9 × (−0.4)) + × 1 × (0 + 0.9 × 0.7)
4 4
1 1
= × 0.9 × (2.3 + 0.4 − 0.4 + 0.7) = × 0.9 × 3.0 = 0.675 ≈ 0.7.
4 4
Markov Decision Processes and Dynamic Programming
I The optimal policies π∗ share the same state-value function, called the optimal
state-value function, denoted v∗ , and defined as
v∗ (s ) := max vπ (s ),
π
for all s ∈ S .
I Optimal policies also share the same optimal action-value function, denoted q∗ , and
defined as
q∗ (s , a ) := max qπ (s , a ),
π
I Because v∗ is the optimal value function, this consistency condition can be written in a
special form without reference to any specific policy.
I This form is the Bellman equation for v∗ , or the Bellman optimality equation for v∗ .
I Intuitively, the Bellman optimality equation expresses the fact that the value of a state
under an optimal policy must equal the expected return for the best action from that
state:
v∗ (s ) = max qπ∗ (s , a )
a ∈A(s )
= max ∑ p (s 0 , r | s , a ) r + γv∗ (s 0 ) .
a
s 0 ,r
Markov Decision Processes and Dynamic Programming
Backup diagrams
I The backup diagrams above show graphically the spans of future states and actions
considered in the Bellman optimality equations for (a) v∗ and (b) q∗ .
I These are the same as the backup diagrams for vπ and qπ given earlier except that
arcs have been added at the agent’s choice points to represent that the maximum over
that choice is taken rather than the expected value given some policy.
Markov Decision Processes and Dynamic Programming
I For finite MDPs, the Bellman optimality equation for v∗ has a unique solution
independent of the policy.
I The Bellman optimality equation is actually a system of equations, one for each state,
so if there are n states, then there are n equations in n unknowns.
I If the dynamics p of the environment are known, then in principle one can solve this
system of equations for v∗ using any one of a variety of methods for solving systems of
nonlinear equations.
I One can solve a related set of equations for q∗ .
Markov Decision Processes and Dynamic Programming
Dynamic programming
I The term dynamic programming (DP) refers to a collection of algorithms that can be
used to compute optimal policies given a perfect model of the environment as a
Markov decision process.
I Classical DP algorithms are of limited utility in reinforcement learning both because of
their assumption of a perfect model and because of their great computational expense,
but they are still important theoretically.
I DP provides an essential foundation for understanding of the RL methods.
I In fact, all RL methods can be viewed as attempts to achieve much the same effect as
DP, only with less computation and without assuming a perfect model of the
environment.
Markov Decision Processes and Dynamic Programming
Key idea
I The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organise and structure the search for good policies.
I We shall show how DP can be used to compute the value functions we have given in
the previous examples.
Markov Decision Processes and Dynamic Programming
I First, we consider how to compute the state-value function vπ for an arbitrary policy π .
I This is called policy evaluation in the DP literature.
I We also refer to it as the prediction problem.
I Recall the Bellman equation for vπ :
∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) ,
vπ (s ) = for all s ∈ S .
a s 0 ,r
I If the environment’s dynamics are completely known, then this is a system of |S|
simultaneous linear equations in |S| unknowns (the vπ (s ), s ∈ S ).
I In principle, its solution is a straightforward, if tedious, computation.
I For our purposes, iterative solution methods are most suitable.
Markov Decision Processes and Dynamic Programming
for all s ∈ S .
I Clearly, vk = vπ is a fixed point for this update rule because the Bellman equation for
vπ assures us of equality in this case.
I Indeed, the sequence {vk } can be shown in general to converge to vπ as k → ∞
under the same conditions that guarantee the existence of vπ .
I This algorithm is called iterative policy evaluation.
Markov Decision Processes and Dynamic Programming
Implementation variants
Example: Gridworld
Exercise
I Suppose the agent follows the equiprobable random policy (all actions equally likely).
I Use the two-array version of the iterative policy evaluation algorithm to compute vπ for
π = the equiprobable random policy.
Markov Decision Processes and Dynamic Programming
Solution (i)
Markov Decision Processes and Dynamic Programming
Solution (ii)
Markov Decision Processes and Dynamic Programming
Solution (iii)
I The left column is the sequence of approximations of the state-value function for the
random policy (all actions equally likely).
I The right column is the sequence of greedy policies corresponding to the value
function estimates (arrows are shown for all actions achieving the maximum, and the
numbers shown are rounded to two significant digits).
I The last policy is guaranteed only to be an improvement over the random policy, but in
this case it, and all policies after the third iteration, are optimal.
Markov Decision Processes and Dynamic Programming
Policy improvement
I One reason for computing the value function for a policy is to help find better policies.
I Suppose we have determined the value function vπ for an arbitrary deterministic policy
π.
I For some state s we would like to know whether or not we should change the policy to
deterministically choose an action a , π (s ).
I We know how good it is to follow the current policy from s — that is vπ (s ) — but would
it be better or worse to change to the new policy?
I One way to answer this question is to consider selecting a in s and thereafter following
the existing policy π .
I The value of this way of behaving is
∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) .
=
s 0 ,r
I Let π and π 0 be any pair of deterministic policies such that, for all s ∈ S ,
qπ (s , π 0 (s )) ≥ vπ (s ). (1)
I Then the policy π 0 must be as good as, or better than π . That is, it must obtain greater
or equal expected return from all states s ∈ S :
vπ 0 (s ) ≥ vπ (s ). (2)
I Moreover, if the inequality in (1) is strict at any state, then there must be strict
inequality in (2) at at least one state.
I This result applies in particular to the two policies that we have just considered, an
original policy, π , and a changed policy, π 0 , that is identical to π except that
π 0 (s ) = a , π (s ). Obviously (1) holds at all states other than s.
Markov Decision Processes and Dynamic Programming
Greedy policy
I So far we have seen how, given a policy and its value function, we can easily evaluate
a change in the policy at a single state to a particular action.
I It is a natural extension to consider changes to all states and to all possible actions,
selecting at each state the action that appears best according to qπ (s , a ).
I In other words, to consider the new greedy policy, π 0 , given by
π 0 (s ) := arg max qπ (s , a )
a
where arg maxa denotes the value of a at which the expression is maximised (with ties
broken arbitrarily).
I The greedy policy takes the action that looks best in the short term — after one step
lookahead — according to vπ .
I By construction, the greedy policy meets the conditions of the policy improvement
theorem, so we know that it is as good as, or better than, the original policy.
I The process of making a new policy that improves on an original policy, by making it
greedy with respect to the value function of the original policy, is called policy
improvement.
Markov Decision Processes and Dynamic Programming
Policy iteration
I Once a policy, π , has been improved using vπ to yield a better policy, π 0 , we can then
compute vπ 0 and improve it again to yield an even better π 00 .
I We can thus obtain a sequence of monotonically improving policies and value
functions:
E I E I I E
π 0 → vπ0 → π 1 → vπ1 → . . . → π ∗ → v∗ .
E I
where → denotes a policy evaluation and → denotes a policy improvement.
I Each policy is guaranteed to be a strict improvement over the previous one (unless it is
already optimal).
I Because a finite MDP has only a finite number of policies, this process must converge
to an optimal policy and optimal value function in a finite number of iterations.
I This way of finding an optimal policy is called policy iteration.
Markov Decision Processes and Dynamic Programming
Value iteration
I One drawback to policy iteration is that each of its iterations involves policy evaluation,
which may itself be a protracted iterative computation requiring multiple sweeps
through the state set.
I If policy evaluation is done iteratively, then convergence exactly to vπ occurs only in
the limit.
I Must we wait for exact convergence, or can we stop short of that?
I The policy evaluation step of policy iteration can be truncated in several ways without
losing the convergence guarantees of policy iteration.
I One important special case is when policy evaluation is stopped after just one sweep
(one update of each state).
I This algorithm is called value iteration.
I It can be written as a particularly simple update operation that combines the policy
improvement and truncated policy evaluation steps:
vk +1 = max E [Rt +1 + γvk (St +1 ) | St = s , At = a ]
a
= max ∑ p (s 0 , r | s , a ) r + γvk (s 0 ) ,
a
s 0 ,r
for all s ∈ S .
I For arbitrary v0 , the sequence {vk } can be shown to converge to v∗ under the same
conditions that guarantee the existence of v∗ .
I Another way of understanding value iteration is by reference to the Bellman optimality
equation.
Markov Decision Processes and Dynamic Programming
Richard Bellman.
An introduction to the theory of dynamic programming.
techreport, RAND Corp., 1953.
Richard Bellman.
Dynamic Programming.
Princeton University Press, NJ, 1957.
Richard Bellman.
Eye of the Hurricane: An Autobiography.
World Scientific, 1984.
Dimitri P. Bertsekas.
Dynamic programming and optimal control, Volume I.
Athena Scientific, Belmont, MA, 2001.
Dimitri P. Bertsekas.
Dynamic programming and optimal control, Volume II.
Athena Scientific, Belmont, MA, 2005.
Gely P. Basharin, Amy N. Langville, and Valeriy A. Naumov.
Numerical Solution of Markov Chains, chapter The Life and Work of A. A. Markov,
pages 1–22.
CRC Press, 1991.
Nicole Bäuerle and Ulrich Rieder.
Markov Decision Processes with Applications to Finance.
Springer, 2011.
Markov Decision Processes and Dynamic Programming
Bibliography
Martin L. Puterman.
Markov decision processes: discrete stochastic dynamic programming.
John Wiley & Sons, New York, 1994.
Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction.
MIT Press, 2 edition, 2018.
Lloyd Stowell Shapley.
Stochastic games.
Proceedings of the National Academy of Sciences of the United States of America,
39(10):1095–1100, October 1953.
David Silver.
Lectures on reinforcement learning.
url: https://www.davidsilver.uk/teaching/, 2015.
Csaba Szepesvári.
Algorithms for Reinforcement Learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan &
Claypool, 2010.
Hado van Hasselt.
Lectures on reinforcement learning.
url: https://hadovanhasselt.com/2016/01/12/ucl-course/, 2016.