3 Markov Decision Processes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and


Dynamic Programming

Paul Alexander Bilokon, PhD

Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB

2023.01.20
Markov Decision Processes and Dynamic Programming

Primary sources

This lecture is based on the following primary sources:


I Chapters 3 (Finite Markov Decision Processes) and 4 (Dynamic Programming)
of [SB18] by the ‘fathers’ [Sze10] of RL.
I Chapter 1 (Markov Decision Processes) of [Sze10].
I Lectures on MDPs and Dynamic Programming by David Silver [Sil15] and Hado
van Hasselt [vH16] of DeepMind.
Markov Decision Processes and Dynamic Programming

Books on Markov decision processes and dynamic programming


I Lloyd Stowell Shapley. Stochastic Games. Proceedings of the National Academy of Sciences of
the United States of America, October 1, 1953, 39 (10), 1095–1100 [Sha53].
I Richard Bellman. Dynamic Programming. Princeton University Press, NJ 1957 [Bel57].
I Ronald A. Howard. Dynamic programming and Markov processes. The Technology Press of
M.I.T., Cambridge, Mass. 1960 [How60].
I Dimitri P. Bertsekas and Steven E. Shreve. Stochastic optimal control. Academic Press, New
York, 1978 [BS78].
I Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John
Wiley & Sons, New York, 1994 [Put94].
I Onesimo Hernández-Lerma and Jean B. Lasserre. Discrete-time Markov control processes.
Springer-Verlag, New York, 1996 [HLL96].
I Dimitri P. Bertsekas. Dynamic programming and optimal control, Volume I. Athena Scientific,
Belmont, MA, 2001 [Ber01].
I Dimitri P. Bertsekas. Dynamic programming and optimal control, Volume II. Athena Scientific,
Belmont, MA, 2005 [Ber05].
I Eugene A. Feinberg and Adam Shwartz. Handbook of Markov decision processes. Kluwer
Academic Publishers, Boston, MA, 2002 [FS02].
I Warren B. Powell. Approximate dynamic programming. Wiley-Interscience, Hoboken, NJ,
2007 [Pow07].
I Nicole Bäuerle and Ulrich Rieder. Markov Decision Processes with Applications to Finance.
Springer, 2011 [BR11].
I Alekh Agarwal, Nan Jiang, Sham M. Kakade, Wen Sun. Reinforcement Learning: Theory and
Algorithms. A draft is available at https://rltheorybook.github.io/
Markov Decision Processes and Dynamic Programming

Increasing the complexity (and realism)


I Bandits constitute a simplified setting: multiple actions but one single state. Moreover,
the agent has no model of the environment.
I We shall now consider models that involve evaluative feedback (as in bandits) but
also an associative aspect—choosing different actions in different situations.
I We shall consider sequential decision making, where actions influence not just the
immediate rewards, but also subsequent situations, or states, and, through those,
future rewards.
I Since the reward may be delayed, there may be a need to trade off immediate and
delayed reward.
I Whereas in bandit problems we estimated the value q (a ) of each action a, in this new
setting we estimate the value q∗ (s , a ) for each action a in each state s, or we
estimate the value v∗ (s ) of each state given optimal action selections.
I Note that these quantities, q∗ (s , a ) and v∗ (s ), are state-dependent.
I Markov decision processes (MDPs) (we’ll introduce them shortly) are a
mathematically idealized form of the reinforcement learning problem for which precise
theoretical statements can be made.
I We try to convey the wide range of applications that can be formulated as finite MDPs.
I As in all of ML/AI, there is a tension between breadth of applicability and mathematical
tractability.
Markov Decision Processes and Dynamic Programming

The Markov property

I The agent and the environment interact at each of a sequence of discrete time steps.
I Let {St }t ∈N be a sequence of random states indexed by time; for all t ∈ N0 , St ∈ S .
I A state s has the Markov property if for all states s 0 ∈ S and all rewards r ∈ R

p (Rt +1 = r , St +1 = s 0 | St = s ) = p (Rt +1 = r , St +1 = s 0 | S1 , . . . , St −1 , St = s )

for all possible histories S1 , . . . , St −1 .


I In other words:
I The future is independent of the past given the present.
I The state captures all relevant information from the history.
I Once the state is known, the history may be thrown away.
I The state is a sufficient statistic of the past.
Markov Decision Processes and Dynamic Programming

Andrey Andreyevich Markov

Andrey Adreyevich Markov was a Russian mathe-


matician best known for his work on stochastic pro-
cesses. A primary subject of his research later be-
came known as Markov chains and Markov processes.

Andrey Andreyevich Markov


(1856–1922)
Markov Decision Processes and Dynamic Programming

Letter from Markov to Chuprov dated 6 November, 1910 [BLN91]

6 November 1910
My dear Alexander Alexandrovich:
Of course I was also surprised by your reference to [Ersnt Heinrich] Bruns whom I consider a negligible
quantity.
I can judge all work only from a strictly mathematical point of view and from this viewpoint it is clear to me that
neither Bruns nor [Pavel Alekseevich] Nekrasov nor [Karl] Pearson has done anything worthy of note. You
speak about some kinds of most general constructions, but I cannot find these constructions in their work.
Meanwhile I do find highly general theorems from authors whom you have entirely forgotten: [Aleksandr
Mikhailovich] Liapunov and A. A. Markov. The unique service of P. A. Nekrasov, in my opinion, is namely this:
he brings out sharply his delusion, shared, I believe, by many, that independence is a necessary condition for
the law of large numbers. This circumstance prompted me to explain, in a series of articles, that the law of
large numbers and Laplace’s formula can apply also to dependent variables. In this way a construction of a
highly general character was actually arrived at, which P. A. Nekrasov can not even dream about.
I considered variables connected in a simple chain and from this came the idea of the possibility of extending
the limit theorems of the calculus of probability also to a complex chain.
Independence is not required for the application of these theorems, but on the other hand it is necessary
to assume existence of certain constant quantities. This existence is already assumed by the theory and
therefore it is impossible to deduce this from the theory. And so I will stick to my opinion that your reference
to Bruns and Nekrasov is wrong, as long as you do not cite for me their general constructions.
With complete respect,
A. Markov

...and so Markov chains and Markov decision processes were born.


Markov Decision Processes and Dynamic Programming

Convention

I We use Rt +1 (instead of Rt ) to denote the reward due to At because it emphasizes


that the next reward and next state, Rt +1 and St +1 , are jointly determined.
I Both conventions are widely used in the literature.
Markov Decision Processes and Dynamic Programming

Markov decision process

I A Markov decision process (MDP) is a tuple (S , A, p , γ) where


I S is the set of all possible states;
I A is the set of all possible actions;
I the Markov transition density p (r , s 0 | s , a ) is the joint probability of a reward r and next
state s 0 , given a state s and action a;
I γ ∈ [0, 1] is a discount factor that trades off later rewards for earlier ones.
I The state is assumed to have the Markov property.
Markov Decision Processes and Dynamic Programming

The Markov transition density

I The Markov transition density p (r , s 0 | s , a ) is referred to as the dynamics function


because it defines the dynamics of the MDP.
I It defines a probability distribution, and so

∑ ∑ p ( s 0 , r | s , a ) = 1, for all s ∈ S , a ∈ A(s ).


s 0 ∈S r ∈R

I From the four-argument dynamics function one can compute...


I ...the state-transition probabilities,

p ( s 0 | s , a ) : = P [ St = s 0 | St − 1 = s , At − 1 = a ] = ∑ p (s 0 , r | s , a ).
r ∈R

I ...the expected rewards for state–action pairs,

r ( s , a ) = E [ R t | St − 1 = s , At − 1 = a ] = ∑r ∑ p (s 0 , r | s , a ).
r ∈R s 0 ∈S

I ...the expected rewards for state–action–next-state triples,

p (s 0 , r | s , a )
r ( s , a , s 0 ) = E [ R t | St − 1 = s , At − 1 = a , St = s 0 ] = ∑r p (s 0 | s , a )
.
r ∈R
Markov Decision Processes and Dynamic Programming

Example: recycling robot [SB18] (i)

I A mobile robot has the job of collecting empty soda cans in an office environment.
I High-level decisions about how to search for cans are made by a RL agent based on
the current charge level of the battery.
I There are two charge levels (two states), S = {high, low}.
I In each of these states, the agent can decide whether to
1. actively search for a can for a certain period of time,
2. remain stationary and wait for someone to bring it a can, or
3. head back to its home base to recharge its battery.
(These are the actions.)
I When the energy level is high, recharging would always be foolish, so we do not
include it in the action set for this state.
I The action sets are then
I A(high) = {search, wait} and
I A(low) = {search, wait, recharge}.
Markov Decision Processes and Dynamic Programming

Example: recycling robot [SB18] (ii)

I The best way to find cans is to actively search for them, but this runs down the robot’s
battery, whereas waiting does not.
I Whenever the robot is searching, the possibility exists that its battery will become
depleted. In this case the robot must shut down and wait to be rescued (producing a
low reward).
I If the energy level is high, then a period of active search can always be completed
without a risk of depleting the battery.
I A period of searching that begins with a high energy level leaves the energy level
high with probability α and reduces it to low with probability 1 − α. A period of
searching undertaken when the energy level is low leaves it low with probability β and
depletes the battery with probability 1 − β.
I In the latter case the robot must be rescued, and the battery is then recharged back to
high.
Markov Decision Processes and Dynamic Programming

Example: recycling robot [SB18] (iii)

I The rewards are zero most of the time, but become positive when the robot secures an
empty can, or large and negative if the battery runs all the way down.
I Each can collected by the robot counts as a unit reward, whereas a reward of −3
results whenever the robot has to be rescued.
I Let rsearch > rwait , respectively, denote the expected number of cans the robot will
collect (and hence the expected reward) while searching and while waiting.
I No cans can be collected during a run home for recharging, and that no cans can be
collected on a step in which the battery is depleted.
Markov Decision Processes and Dynamic Programming

Example: recycling robot [SB18] (iv)

s a s0 p (s 0 | s , a ) r (s , a , s 0 )
high search high α rsearch
high search low 1−α rsearch
low search high 1−β -3
low search low β rsearch
high wait high 1 rwait
high wait low 0 rwait
low wait high 0 rwait
low wait low 1 rwait
low recharge high 1 0
low recharge low 0 0

Note that there is a row in the table for each possible combination of current state, s, action
a ∈ A(s ), and next state, s 0 .
Markov Decision Processes and Dynamic Programming

Example: recycling robot [SB18] (v)

s a s0 r p (s 0 , r | s , a )
high search high rsearch α
high search low rsearch 1−α
low search high -3 1−β
low search low rsearch β
high wait high rwait 1
high wait low rwait 0
low wait high rwait 0
low wait low rwait 1
low recharge high 0 1
low recharge low 0 0

This table is analogous to the previous one but it uses p (s 0 , r | s , a ) instead of p (s 0 | s , a )


and r (s , a , s 0 ).
Markov Decision Processes and Dynamic Programming

Example: recycling robot [SB18] (vi)

I Another useful way of summarising the dynamics of a finite MDP is a transition


graph.
I There are two kinds of nodes: state nodes and action nodes.

Figure: Transition graph for the recycling robot example


Markov Decision Processes and Dynamic Programming

Example: student MDP [Sil15]

Figure: Transition graph for the student MDP example


Markov Decision Processes and Dynamic Programming

Goals and rewards

I The reward hypothesis1 :


All of what we mean by goals and purposes can be well thought of as the maxi-
mization of the expected value of the cumulative sum of a received scalar signal
called reward.
I The use of a reward signal to formalize the idea of a goal is one of the most distinctive
features of reinforcement learning.
I If we want an agent to do something for us, we must provides rewards to it in such a
way that in maximizing them the agent will also achieve our goals—this is sometimes
referred to as reward engineering.

1 http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html
Markov Decision Processes and Dynamic Programming

Examples of rewards

I To make a robot learn to find and collect empty soda cans for recycling, one might give
it a reward of zero most of the time, and then a reward of +1 for each can collected.
One might also give the robot negative rewards when it bumps into things or when
somebody shouts at it.
I For an agent to learn to play checkers or chess, the natural rewards are +1 for winning,
-1 for losing, and 0 for drawing and for all nonterminal positions.
Markov Decision Processes and Dynamic Programming

What, not how

I The reward signal is not the place to impart to the agent prior knowledge about how to
achieve what we want it to do.
I Better places for imparting this kind of prior knowledge are the initial policy or initial
value function, or in influences on these.
I For example, a chess-playing agent should be rewarded only for actually winning, not
for achieving subgoals such as taking its opponent’s pieces or gaining control of the
centre of the board.
I If achieving these sorts of subgoals were rewarded, then the agent might find a way to
achieve them without achieving the real goal. For example, it might find a way to take
the opponent’s pieces even at the cost of losing the game.
I The reward signal is your way of communicating to the robot what you want it to
achieve, not how you want it achieved.
Markov Decision Processes and Dynamic Programming

Returns and episodes

I If the sequence of rewards received after time step t is denoted

Rt +1 , Rt +2 , Rt +3 , . . . ,

then what precise aspect of this sequence do we wish to maximize?


I In general, we seek to maximize the expected return, where the return, denoted Gt ,
is defined as some specific function of the reward sequence.
I In the simplest case the return is the sum of the rewards:

Gt := Rt +1 + Rt +2 + . . . + RT ,

where T is a final time step.


I This approach makes sense in applications in which there is a natural notion of final
time step, that is, when the agent–environment interaction breaks naturally into
subsequences, which we call episodes (sometimes called trials in the literature),
such as plays of a game, trips through a maze, or any sort of repeated interaction.
I Each episode ends in a special state called the terminal state, followed by a reset to a
standard starting state or to a sample from a standard distribution of starting states.
Markov Decision Processes and Dynamic Programming

Episodic versus continuing tasks

I Tasks with episodes of this kind are called episodic tasks.


I In episodic tasks we sometimes need to distinguish the set of all nonterminal states,
denoted S , from the set of all states plus the terminal states, denoted S + .
I The time of termination, T , is a random variable that normally varies from episode to
episode.
I On the other hand, in many cases the agent–environment interaction does not break
naturally into identifiable episodes, but goes on continually without limit.
I We call these continuing tasks.
I In order to formulate returns for continuing tasks we need the additional concept of
discounting.
Markov Decision Processes and Dynamic Programming

Example: student MDP [Sil15]

Sample episodes for student MDP starting


from S1 = C1:

S1 , S2 , . . . , ST .

I C1, C2, C3, Pass, Sleep.


I C1, FB, FB, C1, C2, Sleep.
I C1, C2, C3, Pub, C2, C3, Pass, Sleep.
I C1, FB, FB, C1, C2, C3, Pub, C1, FB,
FB, FB, C1, C2, C3, Pub, C2, Sleep.
Figure: Transition graph for the student MDP
example
Markov Decision Processes and Dynamic Programming

Discounted return
I The discounted return is given by

Gt := Rt +1 + γRt +2 + γ2 Rt +3 + . . . = ∑ γ k Rt + k + 1 ,
k =0

where γ is a parameter, 0 ≤ γ ≤ 1, called the discount rate.


I If γ = 0, the agent is “myopic” in being concerned only with maximizing immediate
rewards. As γ approaches 1, the return objective takes future rewards into account
more strongly; the agent becomes more “farsighted”.
I Note that

Gt := Rt +1 + γRt +2 + γ2 Rt +3 + γ3 Rt +3 + . . .
= Rt + 1 + γ ( Rt + 2 + γ Rt + 3 + . . . )
= Rt + 1 + γ Gt + 1 .
I Although the return is a sum of an infinite number of terms, it is still finite if the reward
is nonzero and constant — if γ < 1.
I For example, if the reward is a constant +1, then the return is

1
Gt = ∑ γk =
1−γ
.
k =0
Markov Decision Processes and Dynamic Programming

Why discount?

Most Markov decision processes are discounted. Why?


I Mathematically convenient to discount rewards.
I Avoids infinite returns in cyclic Markov processes.
I Uncertainty about the future may not be fully represented.
I If the reward is financial, immediate rewards may earn more interest than delayed
rewards.
I Animal/human behaviour shows preference for immediate reward—instant versus
delayed gratification2 .

2 But see [ME70, MEZ72].


Markov Decision Processes and Dynamic Programming

Example: pole balancing [SB18]

The objective in this task is to apply forces to a cart moving along a track so as to keep a pole
hinged to the cart from falling over: A failure is said to occur if the pole falls past a given angle
from vertical or if the cart runs off the track. The pole is reset to vertical after each failure.
The task could be treated as episodic, where the natural episodes are the repeated attempts
to balance the pole. The reward in this case could be +1 for every time step on which failure
did not occur, so that the return at each time would be the number of steps until failure. In
this case, successful balancing forever would mean a return of infinity. Alternatively, we could
treat pole-balancing as a continuing task, using discounting. In this case the reward would
be −1 on each failure and zero at all other times. The return at each time would then be
related to −γK , where K is the number of time steps before failure. In either case, the return
is maximized by keeping the pole balanced for as long as possible.
Markov Decision Processes and Dynamic Programming

Policy

I A policy is a mapping from states to probabilities of selecting each possible action.


I If the agent is following policy π at time t, then π (a | s ) is the probability that At = a if
St = s.
I π (a | s ) defines a probability distribution over a ∈ A(s ) for each s ∈ S .
I Reinforcement learning methods specify how the agent’s policy is changed as a result
of its experience.
Markov Decision Processes and Dynamic Programming

The state-value function and the action-value function

I The state-value function of a state s under a policy π , denoted vπ (s ), is the


expected return when starting in s and following π thereafter.
I For MDP, we can define vπ formally by


" #
vπ (s ) := Eπ [Gt | St = s ] = Eπ ∑ γk Rt +k +1 | St = s ,
k =0

for all s ∈ S , where Eπ denotes the expected value of a random variable given that
the agent follows policy π , and t is any time step.
I Note that the value of a terminal state, if any, is always zero.
I Similarly, we define the action-value function of taking action a in state s under a
policy π as the expected return starting from s, taking the action a, and thereafter
following policy π :


" #
qπ (s , a ) := Eπ [Gt | St = s , At = a ] = Eπ ∑ γk Rt +k +1 | St = s , At = a .
k =0

I We shall refer to vπ and qπ collectively as the value functions.


Markov Decision Processes and Dynamic Programming

A relationship between vπ and qπ

I We can obtain an equation for vπ in terms of qπ and π :

vπ (s ) = ∑ π (a | s )qπ (s , a ).
a ∈A

I We can obtain an equation for qπ in terms of vπ and the four-argument p:

qπ (s , a ) = ∑ ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) .
 
s 0 ∈S r ∈R
Markov Decision Processes and Dynamic Programming

Estimating the value functions

I The value functions vπ and qπ can be estimated from experience.


I For example, if an agent follows policy π and maintains an average, for each state
encountered, of the actual returns that have followed that state, then the average will
converge to the state’s value vπ (s ), as the number of times that state is encountered
approaches infinity.
I If separate averages are kept for each action taken in each state, then these averages
will similarly converge to the action values, qπ (s , a ).
I We call estimation methods of this kind Monte Carlo methods because they involve
averaging over many random samples of actual returns.
I If there are very many states, then it may not be practical to keep separate averages
for each state individually. Instead, the agent would have to maintain vπ and qπ as
parameterised functions (with fewer parameters than states) and adjust the
parameters to better match the observed returns.
I This can also produce accurate estimates, although much depends on the nature of
the parameterised function approximator.
Markov Decision Processes and Dynamic Programming

Example: student MDP [Sil15] (i)

Figure: State-value function for the student MDP, γ = 0


Markov Decision Processes and Dynamic Programming

Example: student MDP [Sil15] (ii)

Figure: State-value function for the student MDP, γ = 0.9


Markov Decision Processes and Dynamic Programming

Example: student MDP [Sil15] (iii)

Figure: State-value function for the student MDP, γ = 1


Markov Decision Processes and Dynamic Programming

The Bellman equation (i)


I Value functions satisfy recursive relationships similar to that which we have already
established for return.
I For any policy π and any state s, the following consistency condition holds between
the value of s and the value of its possible successor states:

v π ( s ) : = E π [ Gt | S t = s ]
= E π [ Rt + 1 + γ Gt + 1 | S t = s ]
= ∑ π (a | s ) ∑ ∑ p (s 0 , r | s , a ) r + γEπ Gt +1 | St +1 = s 0
  
a s0 r

= ∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) ,
 
for all s ∈ S ,
a s 0 ,r

where it is implicit that the actions, a, are taken from the set A(s ), that the next states,
s 0 , are taken from the set S (or from S + in the case of an episodic problem), and that
the rewards, r, are taken from the set R.
I We have merged the two sums ∑s 0 ∑r into one sum ∑s 0 ,r over all the possible values
of both s 0 and r. We use this kind of merged sum to simplify formulas.
I The final expression is an expected value: it is a sum over all values of the three
variables, a, s 0 , and r. For each triple, we compute its probability
π (a | s )p (s 0 , r | s , a ), weight the quantity in brackets by that probability, then sum over
all possibilities to get an expected value.
Markov Decision Processes and Dynamic Programming

The Bellman equation (ii)


I The equation
vπ (s ) = ∑ π (a | s ) ∑ p (s 0 , r | s , a ) [r + γvπ (s 0 )] , for all s ∈ S
a s 0 ,r

is the Bellman equation for vπ .


I It expresses a relationship between the value of a state and the values of its successor states.
I Think of looking ahead from a state to its possible successor states, as suggested by the backup
diagram:

I Each open circle represents a state and each solid circle represents a state-action pair.
I Starting from state s, the root node at the top, the agent could take any of some set of
actions—three are shown in the diagram—based on its policy π . From each of these, the
environment could respond with one of several next states, s 0 (two are shown in the figure), along
with a reward, r, depending on its dynamics given by the function p.
I The Bellman equation averages over all the possibilities, weighting each by its probability of
occurring. It states that the value of the start state must equal the (discounted) value of the
expected next state, plus the reward expected along the way.
Markov Decision Processes and Dynamic Programming

The Bellman equation (iii)

I The value function vπ is the unique solution to its Bellman equation

vπ (s ) = ∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γvπ (s 0 ) , for all s ∈ S .


 
a s 0 ,r

I The Bellman equation forms the basis of a number of ways to


I compute,
I approximate,
I and learn
the value function vπ .
I These are collectively referred to as dynamic programming.
Markov Decision Processes and Dynamic Programming

The Bellman equation for state–action values (qπ )

We can derive the corresponding Bellman equation for state–action values, that is, for qπ :

qπ (s , a ) = Eπ [Gt | St = s , At = a ]
= Eπ [Rt +1 + γGt +1 | St = s , At = a ]
= ∑ ∑ p (s 0 , r | s , a ) r + γEπ [Gt +1 | St +1 = s 0 ]
 
s0 r
" #
= ∑ p (s , r | s , a )
0
r + γ ∑ π (a | s )Eπ [Gt +1 | St +1 = s , At +1 = a ]
0 0 0 0
s 0 ,r a0
" #
= ∑ p (s , r | s , a )
0
r + γ ∑ π (a | s )qπ (s , a ) .
0 0 0 0
s 0 ,r a0
Markov Decision Processes and Dynamic Programming

Richard Bellman

Richard Ernest Bellman (1920–1984) was an Ameri-


can applied mathematician, who introduced dynamic
programming in 1953 [Bel53], and made important
contributions in other fields of mathematics.

Richard E. Bellman
(1920–1984)
Markov Decision Processes and Dynamic Programming

Why ‘Dynamic Programming’?

An interesting question is, ‘Where did the name, dynamic programming, come from?’ The
1950s were not good years for mathematical research. We had a very interesting gentleman in
Washington named Wilson. He was Secretary of Defense, and he actually had a pathological
fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely.
His face would suffuse, he would turn red, and he would get violent if people used the term,
research, in his presence. You can imagine how he felt, then, about the term, mathematical.
The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its
boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from
the fact that I was really doing mathematics inside the RAND Corporation. What title, what
name, could I choose? In the first place I was interested in planning, in decision making, in
thinking. But planning, is not a good word for various reasons. I decided therefore to use
the word, ‘programming’. I wanted to get across the idea that this was dynamic, this was
multistage, this was time-varying—I thought, let’s kill two birds with one stone. Let’s take
a word that has an absolutely precise meaning, namely dynamic, in the classical physical
sense. It also has a very interesting property as an adjective, and that is it’s impossible to use
the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly
give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a
good name. It was something not even a Congressman could object to. So I used it as an
umbrella for my activities. [Bel84]
Markov Decision Processes and Dynamic Programming

Example: student MDP [Sil15]

Figure: Bellman equation applied to the student MDP with γ = 1


Markov Decision Processes and Dynamic Programming

Example: Gridworld

I The cells of the grid correspond to the states of the environment.


I At each cell, four actions are possible: north, south, east, and west, which
deterministically cause the agent to move one cell in the respective direction on the
grid.
I Actions that would take the agent off the grid leave its location unchanged, but also
result in a reward of -1.
I Other actions result in a reward of 0, except those that move the agent out of the
special states A and B.
I From state A , all four actions yield a reward of +10 and take the agent to A 0 .
I From state B, all actions yield a reward of +5 and take the agent to B 0 .
I Above we have diagrammatically shown the exceptional reward dynamics (a) and
state-value function for the equiprobable random policy.
Markov Decision Processes and Dynamic Programming

Exercise

I The Bellman equation must hold for each state for the value function vπ shown above.
I Show numerically that this equation holds for the centre state, valued at +0.7, with
respect to its four neighbouring states, valued at +2.3, +0.4, -0.4, +0.7.
I These values are accurate only to one decimal place.
Markov Decision Processes and Dynamic Programming

Solution

I Recall the Bellman equation

∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) ,
 
vπ (s ) = for all s ∈ S .
a s 0 ,r

I The possible actions from s = centre are A(s ) = {north, east, south, west}.
I We are considering the equiprobable policy π , and so

1
π (north | centre) = π (east | centre) = π (south | centre) = π (west | centre) = .
4
I In this example, the state-action pair (s , a ) determines a unique (s 0 , r ) pair, and so all
the p (s 0 , r | s , a ) are 1, and so the Bellman equation becomes
1 1
× 1 × (0 + 0.9 × 2.3) + × 1 × (0 + 0.9 × 0.4)
vπ (s ) =
4 4
1 1
+ × 1 × (0 + 0.9 × (−0.4)) + × 1 × (0 + 0.9 × 0.7)
4 4
1 1
= × 0.9 × (2.3 + 0.4 − 0.4 + 0.7) = × 0.9 × 3.0 = 0.675 ≈ 0.7.
4 4
Markov Decision Processes and Dynamic Programming

Ordering on policies, optimal policy

I A policy π is said to be better than or equal to a policy π 0 if its expected return is


greater than or equal to that of π 0 for all states.
I In symbols, π ≥ π 0 iff vπ (s ) ≥ vπ 0 (s ) for all s ∈ S .
I There is always at least one policy that is better than or equal to all other policies. This
policy is an optimal policy.
I Although there may be more than one, we denote all the optimal policies by π∗ .
Markov Decision Processes and Dynamic Programming

Optimal state-value function and optimal action-value function

I The optimal policies π∗ share the same state-value function, called the optimal
state-value function, denoted v∗ , and defined as

v∗ (s ) := max vπ (s ),
π

for all s ∈ S .
I Optimal policies also share the same optimal action-value function, denoted q∗ , and
defined as
q∗ (s , a ) := max qπ (s , a ),
π

for all s ∈ S and a ∈ A(s ).


I For the state–action pair (s , a ), this function gives the expected return for taking action
a in state s and thereafter following an optimal policy.
I Thus we can write q∗ in terms of v∗ as follows:

q∗ (s , a ) = E [Rt +1 + γv∗ (St +1 ) | St = s , At = a ] .


Markov Decision Processes and Dynamic Programming

Bellman optimality equation for v∗


I Because v∗ is the value function for a policy, it must satisfy the self-consistency
condition given by the Bellman equation for state values

vπ (s ) = ∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γvπ (s 0 ) , for all s ∈ S .


 
a s 0 ,r

I Because v∗ is the optimal value function, this consistency condition can be written in a
special form without reference to any specific policy.
I This form is the Bellman equation for v∗ , or the Bellman optimality equation for v∗ .
I Intuitively, the Bellman optimality equation expresses the fact that the value of a state
under an optimal policy must equal the expected return for the best action from that
state:

v∗ (s ) = max qπ∗ (s , a )
a ∈A(s )

= max Eπ∗ [Gt | St = s , At = a ]


a

= max Eπ∗ [Rt +1 + γGt +1 | St = s , At = a ]


a

= max E [Rt +1 + γv∗ (St +1 ) | St = s , At = a ]


a

= max ∑ p (s 0 , r | s , a ) r + γv∗ (s 0 ) .
 
a
s 0 ,r
Markov Decision Processes and Dynamic Programming

Bellman optimality equation for q∗

I The Bellman optimality equation for q∗ is


 
q∗ (s , a ) = E Rt +1 + γ max q∗ (St +1 , a 0 ) | St = s , At = a
a0
 
= ∑ p (s 0 , r | s , a ) r + γ max q∗ (s 0 , a 0 ) .
a0
s 0 ,r
Markov Decision Processes and Dynamic Programming

Backup diagrams

I The backup diagrams above show graphically the spans of future states and actions
considered in the Bellman optimality equations for (a) v∗ and (b) q∗ .
I These are the same as the backup diagrams for vπ and qπ given earlier except that
arcs have been added at the agent’s choice points to represent that the maximum over
that choice is taken rather than the expected value given some policy.
Markov Decision Processes and Dynamic Programming

Solving the Bellman optimality equation for v∗

I For finite MDPs, the Bellman optimality equation for v∗ has a unique solution
independent of the policy.
I The Bellman optimality equation is actually a system of equations, one for each state,
so if there are n states, then there are n equations in n unknowns.
I If the dynamics p of the environment are known, then in principle one can solve this
system of equations for v∗ using any one of a variety of methods for solving systems of
nonlinear equations.
I One can solve a related set of equations for q∗ .
Markov Decision Processes and Dynamic Programming

Determining the optimal policy from v∗


I Once one has v∗ , it is relatively easy to determine an optimal policy.
I For each state s, there will be one or more actions at which the maximum is obtained
in the Bellman optimality equation.
I Any policy that assigns nonzero probability only to these actions is an optimal policy.
I You can think of this as a one-step-ahead search. If you have the optimal value
function, v∗ , then the actions that appear best after a one-step-ahead search will be
optimal actions.
I Another way of saying this is that any policy that is greedy with respect to the optimal
evaluation function v∗ is an optimal policy.
I The term greedy is used in computer science to describe any search or decision
procedure that selects alternatives based only on local or immediate considerations,
without considering the possibility that such a selection may prevent future access to
even better alternatives. Consequently, it describes policies that select actions based
only on their short-term consequences.
I The beauty of v∗ is that if one uses it to evaluate the short-term consequences of
actions — specifically, the one-step-ahead consequences — then a greedy policy is
actually optimal in the long-term sense in which we are interested because v∗ already
takes into account the reward consequences of all possible future behaviour.
I By means of v∗ , the optimal expected long-term return is turned into a quantity that is
locally and immediately available for each state. Hence, a one-step-ahead search
yields the long-term optimal actions.
Markov Decision Processes and Dynamic Programming

Determining the optimal policy from q∗

I Having q∗ makes choosing optimal actions even easier.


I With q∗ , the agent does not even have to do a one-step-ahead search: for any state s,
it can simply find any action that maximises q∗ (s , a ).
I The action-value function effectively caches the results of all one-step-ahead
searches. It provides the optimal expected long-term return as a value that is locally
and immediately available for each state–action pair.
I Hence, at the cost or representing a function of state–action pairs, instead of just
states, the optimal action-value function allows optimal actions to be selected without
having to know anything about possible successor states and their values, that is,
without having to know anything about the environment’s dynamics.
Markov Decision Processes and Dynamic Programming

Dynamic programming

I The term dynamic programming (DP) refers to a collection of algorithms that can be
used to compute optimal policies given a perfect model of the environment as a
Markov decision process.
I Classical DP algorithms are of limited utility in reinforcement learning both because of
their assumption of a perfect model and because of their great computational expense,
but they are still important theoretically.
I DP provides an essential foundation for understanding of the RL methods.
I In fact, all RL methods can be viewed as attempts to achieve much the same effect as
DP, only with less computation and without assuming a perfect model of the
environment.
Markov Decision Processes and Dynamic Programming

Key idea

I The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organise and structure the search for good policies.
I We shall show how DP can be used to compute the value functions we have given in
the previous examples.
Markov Decision Processes and Dynamic Programming

Policy evaluation (prediction)

I First, we consider how to compute the state-value function vπ for an arbitrary policy π .
I This is called policy evaluation in the DP literature.
I We also refer to it as the prediction problem.
I Recall the Bellman equation for vπ :

∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) ,
 
vπ (s ) = for all s ∈ S .
a s 0 ,r

I If the environment’s dynamics are completely known, then this is a system of |S|
simultaneous linear equations in |S| unknowns (the vπ (s ), s ∈ S ).
I In principle, its solution is a straightforward, if tedious, computation.
I For our purposes, iterative solution methods are most suitable.
Markov Decision Processes and Dynamic Programming

Iterative policy evaluation

I Consider a sequence of approximate value functions v0 , v1 , v2 , . . . each mapping S to


R.
I The initial approximation, v0 , is chosen arbitrarily (except that the terminal state, if any,
must be given value 0), and each successive approximation is obtained by using the
Bellman equation for vπ as an update rule:

vk +1 := Eπ [Rt +1 + γvk (St +1 ) | St = s ]


= ∑ π (a | s ) ∑ p (s 0 , r | s , a ) r + γvk (s 0 ) ,
 
a s 0 ,r

for all s ∈ S .
I Clearly, vk = vπ is a fixed point for this update rule because the Bellman equation for
vπ assures us of equality in this case.
I Indeed, the sequence {vk } can be shown in general to converge to vπ as k → ∞
under the same conditions that guarantee the existence of vπ .
I This algorithm is called iterative policy evaluation.
Markov Decision Processes and Dynamic Programming

Implementation variants

I To write a sequential computer program to implement iterative policy evaluation you


would have to use two arrays, one for the old values, vk (s ), and one for the new
values, vk +1 (s ).
I With two arrays, the new values can be computed one by one from the old values
without the old values being changed.
I Of course, it is easier to use one array and update the values “in place”, that is, with
each new value immediately overwriting the old one.
I Then, depending on the order in which the states are updated, sometimes new values
are used instead of old ones on the right-hand side of the expected update equation.
I This in-place algorithm also converges to vπ ; in fact, it usually converges faster than
the two-array version, as you might expect, because it uses new data as soon as they
are available.
I We think of the updates as being done in a sweep through the state space.
I For the in-place algorithm, the order in which states have their values updated during
the sweep has a significant influence on the rate of convergence.
I We usually have the in-place version in mind when we thing of DP algorithms.
Markov Decision Processes and Dynamic Programming

Example: Gridworld

I Consider again the 4 × 4 gridworld.


I The nonterminal states are S = {1, 2, 3, . . . , 14}.
I There are four actions possible in each state, A = {up, down, right, left}, which
deterministically cause the corresponding state transitions, except that actions that
would take the agent off the grid in fact leave the state unchanged.
I The reward is −1 on all transitions until the terminal state is reached.
I The terminal state is shaded in the figure.
I Although it is shown in two places, it is formally one state.
Markov Decision Processes and Dynamic Programming

Exercise

I Suppose the agent follows the equiprobable random policy (all actions equally likely).
I Use the two-array version of the iterative policy evaluation algorithm to compute vπ for
π = the equiprobable random policy.
Markov Decision Processes and Dynamic Programming

Solution (i)
Markov Decision Processes and Dynamic Programming

Solution (ii)
Markov Decision Processes and Dynamic Programming

Solution (iii)

I The left column is the sequence of approximations of the state-value function for the
random policy (all actions equally likely).
I The right column is the sequence of greedy policies corresponding to the value
function estimates (arrows are shown for all actions achieving the maximum, and the
numbers shown are rounded to two significant digits).
I The last policy is guaranteed only to be an improvement over the random policy, but in
this case it, and all policies after the third iteration, are optimal.
Markov Decision Processes and Dynamic Programming

Policy improvement
I One reason for computing the value function for a policy is to help find better policies.
I Suppose we have determined the value function vπ for an arbitrary deterministic policy
π.
I For some state s we would like to know whether or not we should change the policy to
deterministically choose an action a , π (s ).
I We know how good it is to follow the current policy from s — that is vπ (s ) — but would
it be better or worse to change to the new policy?
I One way to answer this question is to consider selecting a in s and thereafter following
the existing policy π .
I The value of this way of behaving is

qπ (s , a ) := E [Rt +1 + γvπ (St +1 ) | St = s , At = a ]

∑ p (s 0 , r | s , a ) r + γ vπ (s 0 ) .
 
=
s 0 ,r

I The key criterion is whether this is greater than or less than vπ (s ).


I If it is greater — that is, if it is better to select a once in s and thereafter follow π than it
would be to follow π all the time — then one would expect it to be better still to select a
every time s is encountered, and that the new policy would in fact be a better one
overall.
Markov Decision Processes and Dynamic Programming

Policy improvement theorem

I Let π and π 0 be any pair of deterministic policies such that, for all s ∈ S ,

qπ (s , π 0 (s )) ≥ vπ (s ). (1)

I Then the policy π 0 must be as good as, or better than π . That is, it must obtain greater
or equal expected return from all states s ∈ S :

vπ 0 (s ) ≥ vπ (s ). (2)

I Moreover, if the inequality in (1) is strict at any state, then there must be strict
inequality in (2) at at least one state.
I This result applies in particular to the two policies that we have just considered, an
original policy, π , and a changed policy, π 0 , that is identical to π except that
π 0 (s ) = a , π (s ). Obviously (1) holds at all states other than s.
Markov Decision Processes and Dynamic Programming

Greedy policy
I So far we have seen how, given a policy and its value function, we can easily evaluate
a change in the policy at a single state to a particular action.
I It is a natural extension to consider changes to all states and to all possible actions,
selecting at each state the action that appears best according to qπ (s , a ).
I In other words, to consider the new greedy policy, π 0 , given by

π 0 (s ) := arg max qπ (s , a )
a

= arg max E [Rt +1 + γvπ (St +1 ) | St = s , At = a ]


a

= arg max ∑ p (s 0 , r | s , a ) r + γvπ (s 0 ) ,


 
a s 0 ,r

where arg maxa denotes the value of a at which the expression is maximised (with ties
broken arbitrarily).
I The greedy policy takes the action that looks best in the short term — after one step
lookahead — according to vπ .
I By construction, the greedy policy meets the conditions of the policy improvement
theorem, so we know that it is as good as, or better than, the original policy.
I The process of making a new policy that improves on an original policy, by making it
greedy with respect to the value function of the original policy, is called policy
improvement.
Markov Decision Processes and Dynamic Programming

Policy iteration

I Once a policy, π , has been improved using vπ to yield a better policy, π 0 , we can then
compute vπ 0 and improve it again to yield an even better π 00 .
I We can thus obtain a sequence of monotonically improving policies and value
functions:
E I E I I E
π 0 → vπ0 → π 1 → vπ1 → . . . → π ∗ → v∗ .
E I
where → denotes a policy evaluation and → denotes a policy improvement.
I Each policy is guaranteed to be a strict improvement over the previous one (unless it is
already optimal).
I Because a finite MDP has only a finite number of policies, this process must converge
to an optimal policy and optimal value function in a finite number of iterations.
I This way of finding an optimal policy is called policy iteration.
Markov Decision Processes and Dynamic Programming

Value iteration
I One drawback to policy iteration is that each of its iterations involves policy evaluation,
which may itself be a protracted iterative computation requiring multiple sweeps
through the state set.
I If policy evaluation is done iteratively, then convergence exactly to vπ occurs only in
the limit.
I Must we wait for exact convergence, or can we stop short of that?
I The policy evaluation step of policy iteration can be truncated in several ways without
losing the convergence guarantees of policy iteration.
I One important special case is when policy evaluation is stopped after just one sweep
(one update of each state).
I This algorithm is called value iteration.
I It can be written as a particularly simple update operation that combines the policy
improvement and truncated policy evaluation steps:
vk +1 = max E [Rt +1 + γvk (St +1 ) | St = s , At = a ]
a

= max ∑ p (s 0 , r | s , a ) r + γvk (s 0 ) ,
 
a
s 0 ,r

for all s ∈ S .
I For arbitrary v0 , the sequence {vk } can be shown to converge to v∗ under the same
conditions that guarantee the existence of v∗ .
I Another way of understanding value iteration is by reference to the Bellman optimality
equation.
Markov Decision Processes and Dynamic Programming

Policy evaluation and policy improvement


Markov Decision Processes and Dynamic Programming
Bibliography

Richard Bellman.
An introduction to the theory of dynamic programming.
techreport, RAND Corp., 1953.
Richard Bellman.
Dynamic Programming.
Princeton University Press, NJ, 1957.
Richard Bellman.
Eye of the Hurricane: An Autobiography.
World Scientific, 1984.
Dimitri P. Bertsekas.
Dynamic programming and optimal control, Volume I.
Athena Scientific, Belmont, MA, 2001.
Dimitri P. Bertsekas.
Dynamic programming and optimal control, Volume II.
Athena Scientific, Belmont, MA, 2005.
Gely P. Basharin, Amy N. Langville, and Valeriy A. Naumov.
Numerical Solution of Markov Chains, chapter The Life and Work of A. A. Markov,
pages 1–22.
CRC Press, 1991.
Nicole Bäuerle and Ulrich Rieder.
Markov Decision Processes with Applications to Finance.
Springer, 2011.
Markov Decision Processes and Dynamic Programming
Bibliography

Dimitri P. Bertsekas and Steven E. Shreve.


Stochastic optimal control.
Academic Press, New York, 1978.
Eugene A. Feinberg and Adam Shwartz.
Handbook of Markov decision processes.
Kluwer Academic Publishers, Boston, MA, 2002.
Onesimo Hernández-Lerma and Jean B. Lasserre.
Discrete-time Markov control processes.
Springer-Verlag, New York, 1996.
Ronald A. Howard.
Dynamic programming and Markov processes.
The Technology Press of M.I.T., Cambridge, Mass., 1960.
Walter Mischel and Ebbe B. Ebbesen.
Attention in delay of gratification.
Journal of Personality and Social Psychology, 16(2):329–337, 1970.
Walter Mischel, Ebbe B. Ebbesen, and Antonette Raskoff Zeiss.
Cognitive and attentional mechanisms in delay of gratification.
Journal of Personality and Social Psychology, 21(2):204–218, 1972.
Warren B. Powell.
Approximate dynamic programming.
Wiley-Interscience, Hoboken, NJ, 2007.
Markov Decision Processes and Dynamic Programming
Bibliography

Martin L. Puterman.
Markov decision processes: discrete stochastic dynamic programming.
John Wiley & Sons, New York, 1994.
Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction.
MIT Press, 2 edition, 2018.
Lloyd Stowell Shapley.
Stochastic games.
Proceedings of the National Academy of Sciences of the United States of America,
39(10):1095–1100, October 1953.
David Silver.
Lectures on reinforcement learning.
url: https://www.davidsilver.uk/teaching/, 2015.
Csaba Szepesvári.
Algorithms for Reinforcement Learning.
Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan &
Claypool, 2010.
Hado van Hasselt.
Lectures on reinforcement learning.
url: https://hadovanhasselt.com/2016/01/12/ucl-course/, 2016.

You might also like