RL-Notes Book
RL-Notes Book
Mathukumalli Vidyasagar
This is a draft. To quote Oliver Goldsmith, “There are a hundred faults in this Thing and a hundred things
might be said to prove them beauties.” I hope that, as time passes, the faults will decrease while the
“beauties” will increase. In the meantime, “caveat emptor” is the watchword for the reader.
Feedback of all kinds would be gratefully received at [email protected]
i
ii
Contents
Preface i
1 Introduction 1
1.1 Introduction to Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Some Examples of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 About These Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Stochastic Approximation 29
3.1 An Overview of Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Introduction to Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Standard Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Batch Asynchronous Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Two Time Scale Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Finite-Time Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7 Finite-Time Bounds 89
7.1 Finite Time Bounds on Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Finite Time Bounds for Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3 Probably Approximately Correct Markov Decision Processes . . . . . . . . . . . . . . . . . . . 90
iii
iv CONTENTS
8 Background Material 91
8.1 Random Variables and Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Markov processses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3 Contraction Mapping Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.4 Some Elements of Lyapunov Stability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 1
Introduction
Many if not most decision-making problems fall into the lower-left quadrant of “good model, no alter-
ation.” For example, a well-studied control system such as a fighter aircraft has an excellent model
thanks to aerodynamical modelling and/or wind tunnel tests. To be specific, the dynamical model of
a fighter aircraft depends on the so-called “flight condition,” consisting of the altitude and velocity
(measured as its Mach number). While the dependence of the dynamical model on the flight condition
is nonlinear and somewhat complex, usually sufficient modelling studies are carried out, both before
the aircraft is flown and afterwards, that the dynamical model can be assumed to be “known.” In
turn this permits the control system designers to formulate an optimal (or some other form of) control
problem, which can be solved.
Controlling a chemical reactor would be an example from the lower-right quadrant. As a traditional
control system, it can be assumed that the dynamical model of such a reactor does not change as a
consequence of the control strategy adopted. However, due to the complexity of a reactor, it is difficult
to obtain a very accurate model, in contrast with a fighter aircraft for example. In such a case, one can
adopt one of two approaches. The first, which is a traditional approach in control system theory, is to
use a nominal model of the system and to treat the deviations from the nominal model as uncertainties
in the model. The second, which would move the problem from the upper left to the upper right
quadrant, is to attempt to “learn” the unknown dynamical model by probing its response to various
inputs. This approach is suggested in [33, Example 3.1]. A similar statement can be made about
robots, where the geometry determines the form of the dynamical equations describing it, but not the
parameters in the equations; see for example SHV20. In this case too, it is possible to “learn” the
dynamics through experimentation. In practice, such an approach is far slower than the traditional
control systems approach of using a nominal model and designing a “robust” controller. However,
“learning control” is a popular area in the world of machine learning.
1
2 CHAPTER 1. INTRODUCTION
Action
Interacts with
Environment
Model Quality
A classic example of a problem belonging to the upper-left corner is a Markov Decision Process (MDP).
In this class of problems, at each time instant the actor (or agent) decides on the action to be taken
at that time. In turn the action affects the probabilities of the future evolution of the system. As this
class of problems forms the starting point for RL (upper-right quadrant), the first part of these notes
are addressed to a study of MDPs. Board games without an element of randomness would also belong
to the upper-left quadrant, at least in principle. Games such as tic-tac-toe belong here, because the
rules of the game are clear, and the number of possible games is manageable. In principle, games such
as chess which are “deterministic” (i.e., there is no throwing of dice as in Backgammon for example)
would also belong here. Chess is a two-person game in which, for each board position, it is possible to
assign the likelihood of the three possible outcomes: White wins, Black wins, or it is a draw. However,
due to the enormous number of possibilities, it is often not possible to determine these likelihoods
precisely. It is pointed out explicitly in [30] that, merely because we cannot explicitly compute this
likelihood function, that does not mean that it does not exist! However, as a practical matter, it is not
a bad idea to treat this likelihood function as being unknown, and to infer it on the basis of experiment
/ experience. Thus, as with chemical reactors, it is not uncommon to move chess-playing from the
lower-right corner to the upper-right corner.
The upper-right quadrant is the focus of these notes. Any problems where the actions taken by
the learner alter the environment, in ways that are not known to the learner, are referred to as
“reinforcement learning” (RL). Despite the lack of knowledge about the consequences, the learner
has no option but to keep trying out various actions in order to “explore” the environment in which
the unknown system is operating. As time goes on, some amount of knowledge is gained, and it is
therefore possible, at least in principle, to “exploit” the knowledge to improve decision making. The
trade-off between exploration and exploitation is a standard topic in RL. A canonical example is MDPs
where the underlying parameters are not known, and these occupy a major part of these notes. As
mentioned above, often complex problems from the lower-right quadrant (such as chemical reactors),
or the upper-left quadrant (such as Chess), are also treated as RL problems.
Now we will give a general description of the problem. In a RL problem, there are “states” and then
there are “actions.” At each time t, the agent, also sometimes referred to as the learner, measures the state
Xt at time t, which belongs to a state space X . Based on this measurement, the agent chooses a control Ut
from a menu of “actions,” which we denote by U. While it is possible for the state space X and the range
of possible actions U to be infinite, in these notes we simplify our lives by restricting U to be a finite set.
1.1. INTRODUCTION TO REINFORCEMENT LEARNING 3
In the same way, it is possible to treat “time” as a continuum, but again we simplify life by treating t as a
discrete variable assuming values in the set of natural numbers N = {0, 1, · · · }. Thus RL requires the agent
to take a set of sequential decisions from a finite menu, at discrete instants of time. When the agent chooses
an action Ut ∈ U, two things happen.
1. The agent receives a “reward” Rt . The reward could either be deterministic, or random, and both
possibilities are permitted in these notes. The reward could be a negative number, suggesting a
penalty instead of a reward, but the phrase “reward” is standard phraseology. In case the reward is
random, it is assumed that the reward lies in a bounded interval in R which is known a priori, in
which case the reward can be translated to belong to an interval [0, M ]. The same transformation can
of course be applied if the reward is deterministic. Note that some authors speak of a “cost” which
is to be minimized, rather than a reward which is to be maximized. The modifications required to
tackle this situation are obvious and we will not comment upon this further. The reward depends not
just on the action chosen Ut , but also the state Xt of the environment at time t. There can be two
sources of uncertainty in the reward. In a Markov Decision Problem (MDP), the reward could be a
random function of Xt and Ut , but with a known probability distribution. In an RL problem, even
the probability distribution of the reward is not necessarily known. However, for technical reasons, it
is assumed that the upper bound M on the reward is known.
2. The action Ut affects the dynamics of the system. A consequence is that the same action taken at a
different time need not lead to the same reward, because in the meantime the “state” of the environment
may have changed.
Over the years, the RL research community has given some “structure” to the above rather vague and
general description. Specifically:
1. The environment is taken as a Markov process (see Section 8.2) in which the state transition matrix
depends on the action taken. So there are |U| state transition matrices, one for each possible action.
2. If Xt denotes the state of the Markov process at time t and Ut is the action taken at time t, then the
reward R is taken to be a function R(Xt , Ut ). This formalism explains why the same action Ut ∈ U
taken at a different time may lead to a different reward, because the state Xt may have changed. It is
also possible for R to be a “random” function of Xt and Ut , so that Xt , Ut only specify the probability
distribution of R(Xt , Ut ). In such a case, even if the same state-action pair (Xt , Ut ) were to occur at
a different time, the resulting reward need not the same.
3. Yet another variation is that the reward R(Xt , Ut ) (whether random or deterministic) is paid at the
next time instant t + 1. This is the case in some books, notably [35, 33]. In other words, if the Markov
process is in state Xt and the action Ut is applied, the reward is Rt+1 = R(Xt , Ut ). This allows those
authors to consider the situation where the “next state” Xt+1 and “next reward” Rt+1 can share a
joint probability distribution, which depends on Xt and Ut . This is the convention adopted in these
notes. Note that some other authors assume that the reward is immediate, so that Rt = R(Xt , Ut ).
4. There are two distinct types of Markov Decision Processes that are widely studied, namely: Discounted
reward processes and average reward processes. Each of them has rather a distinct behavior from the
other. In discounted reward processes, there is a “discount factor” γ ∈ (0, 1) that is applied to future
rewards. The objective is to maximize the sum of the future rewards, where the reward at time t is
discounted by the factor γ t . Because this future discounted reward is itself random, we maximize the
expected value of this random variable. In the average reward process, the objective is to minimize the
expected value of the average of future rewards over time. Because there is no discounting of future
rewards, a reward paid at any time contributes just as much to the average as a reward paid at any
other time.
4 CHAPTER 1. INTRODUCTION
5. In the simplest version of the problem, the |U| state transition matrices, one for each possible action,
are assumed to be known, as is the reward function. In the case where the reward is a random function
of Xt and Ut , it is assumed that the probability distribution of R(Xt , Ut ) is known. It is also assumed
that the state Xt of the Markov process can be observed by the agent, and can be used to decide the
action Ut . A key concept in RL is that of a “policy” π which is a map from the state space X of
the Markov process to the set of actions U. The objective here is to choose the optimal policy, which
maximizes the expected value of the discounted future reward over all possible policies. This version
of the problem is usually known as a Markov Decision Process (MDP).1 It is usually viewed as
a precursor to RL. In “proper” RL, neither the Markovian dynamics nor the reward are assumed to
be known, and must be learned on the fly so to speak. However, knowing the solution approaches to
the MDP is very useful in solving RL problems. It should be pointed out that some authors also use
the phrase RL to the problem of finding the optimal policy in an MDP where the parameters of the
problem are completely known.
A dominant theme in RL is the trade-off between “exploration” and “exploitation.” By definition, the
agent in an RL problem is operating in an unknown environment. However, after sometime a reasonably good
model of the environment is available, and a set of actions that is reasonably “rewarding” is also identified.
Should the agent then persist with this set of actions, or occasionally attempt something new, just on the
off-chance that there is a better set of actions available? Let us take a concrete example. A successful chess
player would have evolved, over the years, a set of strategies that work well for him/her. Should the player
persist with the time-proven strategies (exploitation) until someone starts beating him/her, or occasionally
try something completely different just to see what happens (exploration)? The answer is not clear, and is
likely to vary from one domain to another. To illustrate the domain dependence of the solution, suppose
a person moves to a new town and wishes to find the best coffee shop. Then it is probably sufficient to
try each nearby coffee shop just once (or just a few times), because most coffee shops have standardized
protocols for preparing coffee, so that the quality is not likely to vary very much from one visit to the next.
Therefore a person can stick to the coffee shop that is most appealing after a few visits, and there is very
little incentive for further “exploration,” only “exploitation.” In contrast, it can be assumed that the course
of a chess match between two players at the highest level almost invariably leads to a previously unexplored
set of positions. Thus persisting with a stock strategy would invariably lead to suboptimal results, and there
must be greater emphasis on exploration than in the coffee shop example.
There are a couple of methods for quantifying the trade-off between exploration and exploitation. We
begin with the observation that almost any “sensible” learning algorithm would converge to a nearly optimal
policy within a finite number of time steps. Here are two ways to measure how good the algorithm is:
1. Given an accuracy , one can measure how many time steps are required for the policy to be within
of the optimal policy.2 The faster a policy becomes -suboptimal, the better it is. Implicit in this
characterization is the assumption that a policy is not penalized for how badly it performs before it
achieves -suboptimality – just the time it takes to achieve -suboptimality status.
2. The other measure is to see what the reward would have been, had the learner somehow magically
implemented the optimal policy right at the outset, and compare it against the actually achieved
performance. This quantity is called the “regret” and is defined precisely later on. The difference
between minimizing the regret and minimizing the time for achieving -optimality is that in the latter,
the performance of the algorithm before achieving -optimality is not penalized, whereas it is counted
as a part of the regret.
Clearly, the two criteria are not the same. A learning strategy that converges relatively quicky, but performs
poorly along the way would be rated highly under the first criterion, and poorly under the second criterion.
1 There is a variant where the state X cannot be observed directly; instead one observes an output Y which is either a
t t
deterministic or a random function of Xt . This problem is known as a Partial Observed Markov Decision Process (POMDP).
This problem is not discussed at all in these notes.
2 This idea is made precise in subsequent chapters.
1.2. SOME EXAMPLES OF REINFORCEMENT LEARNING 5
S 1 2 3 4 5 6 7 8 9 W L
to be the mean or expected value of Xi . Of course, the player does not know either µi or φi (·). But the player
is able to “pull the arm” of each bandit and see what happens. This generates (we assume) statistically
independent samples xi1 , · · · , xim of the random variable Xi . Based on the outcome of these experiments,
the player is able to make some estimate of µi for each bandit i. These estimates can be used to determine
future strategies.
Note that if the quantities µ1 , · · · , µm are known, then the problem is simple: The player should always
play the machine that has the highest expected payout. But the challenge is to determine which machine
this is, on the basis of experimentation. As stated above, there are many reasonable algorithms that will
asymptotically (as the number of trials increases towards infinity) determine the arm(s) with the best re-
turn(s). Therefore one way to assess the performance of an algorithm is its “regret,” that is, the return
achieved over the course of learning, subtracted from the optimal return of always choosing the arm with
the highest return. Interestingly, there are theorems that give quite tight upper and lower bounds on the
achievable regret, and these are discussed in Section 7.1.
(P, H) R
P <H -2
P =H 1
P > H, P 6= W 2
(P, H) = (W, ∗) 5
(P, H) = (L, ∗) -5
At each stage of the game, the player has two choices: to roll the die and take a chance on the outcome, or
not to roll it. We can ask: What is the best strategy for a player as a function of the square currently being
occupied? Clearly, it depends on whether the expected return from playing exceeds the expected return
from not playing.
1.2.3 Blackjack
Blackjack is a popular game in gambling casinos around the world. The player plays against the “house.”3
The player and the house draw cards in alternation. The objective is to draw cards such that the total of
the cards is as close to 21 as possible without exceeding it. That is why sometimes Blackjack is also called
“Twenty-One.” The formulation of Blackjack as a problem in RL is discussed in [33, Example 5.1]. At each
time instant, the player has ony two possible actions: To ask for one more card, or not. These are known
as “hit” and “stick” respectively. So the set of possible actions U has cardinality two. If the player draws a
card, the outcome is obviously random. Either way, the house also draws a card whose outcome is random.
It is shown in [33, Example 5.1] that the process can be modelled by a Markov process with 200 states, so
that |X | = 200. However, tracing out all possible future evolutions of the game, starting from the current
state, is nearly impossible, and simulations are the only way to analyze the problem.
We now present a simplified version of Blackjack. Obviously, drawing a card leads to the player’s total
increasing by anywhere from 1 to 11.4 So if the player’s current total is 10 or less, the player cannot possibly
lose by drawing, and may get closer to winning. So the optimal strategy from such a position is not in doubt.
With that in mind, we replace the drawing of a card by the rolling of a fair four-sided die, with all four
outcomes being equally probable. It does not matter what the “target” total is, because if the target total
is T , then so long as the player’s total is T − 4 or less, the player should roll the die. With this in mind,
we can think of the player’s states as {0, 1, 2, 3, W, L}, with W and L denoting Win and Lose respectivey.
If the player’s current total plus the outcome of the die exactly equals 4, the player wins, and if the total
exceeds 4, the player loses. But there is an added complication, which is the total of the “House.” Let us
assume that the House policy is to “stick” whenever it gets within 3 of the designated total. Hence it can
be assumed that the House total is in {1, 2, 3}. Now the object of the game is not merely to get as close to
W without going over, but also to beat the House total. Hence the reward for this game can be specified
as shown in Table 1.1. With this reward structure, at each position, the player has the option of rolling the
die, or not. It turns out that this game is more complex than just the player playing snakes and ladders.
We will analyze this game also in later chapters.
1.2.4 Backgammon
Backgammon is a board game played by two players on a board with (essentially) 24 positions, with each
player throwing two six-sided dice at each turn. Figure 1.3 shows a typical board position. The game
3 Actually, it is possible to have more than one player plus the “house.” However, to simplify the problem, we study only
combines chance (random outcome of throwing the dice) and strategy (what a player does based on the
outcome of the dice).
Unlike in Blackjack, the range of possible actions available to a player at each turn is quite large. This
game is well-suited to a technique called “temporal difference” or TD-learning, which is studied in Section
4.2. Tesauro has published several articles on how to program a computer to play backgammon, including
[36, 37, 38]. See [33, Section 16.1] for a detailed description of the rules of backgammon and the TD
implementation of Tesauro.
Figure 1.4: Deep Mind’s AlphaGo program playing Lee Sedol. Source [50].
program is able to “teach itself” to play different games. A popular description how AlphaZero goes about
its self-appointed task can be found in [12]. Those interested in the mathematical details can find them in
[31].
One of the intriguing philosophical aspects of AlphaZero is the fact that, as its name implies, AlphaZero
starts from zero, that is, without any prior knowledge. Its superior performance compared to other programs
that make use of prior knowledge has been interpreted by some AI researchers to claim that “prior knowledge”
is not necessary to achieve top performance. To understand why this is interesting, let us consider the same
question, but changing “chess” to “cooking.” Suppose you wish to become a master chef. Should you first
learn under someone who is already a master chef, and experiment on your own only after you have achieved
some level of proficiency? Or is it better for you to undertake trial and error right from Day One? Most of
us would instinctively answer that learning from a master (i.e., tapping domain knowledge) would be better.
One of the intriguing aspects of the success of AlphaZero is that, when it comes to a computer learning to
play chess, domain knowledge apparently does not confer any advantage. However, at the moment the role
of prior domain knowledge in AI is still a topic for further research. It is not clear whether the success of
AlphaZero is a one-off phenomenon, or a manifestation of a more universally applicable principle.
maps from X into U, which has cardinality |U||X | . The value function V : X → R (see Chapter 2) can be
viewed as a vector of dimension |X |. However, the action-value function Q : X × U → R can be viewed
as a vector of dimension d := |X | × |U |, which can be a large number even if |X | and |U| are individually
rather small. In such a case, instead of examining all possible vectors of dimension d, we can choose a set of
linearly independent functions ψi (·, ·) : X × U → R for i = 1, · · · , m (basically, just a set of m d linearly
independent vectors, and then express Q as a linear combination of these m vectors. By resticting Q to be
a linear combination of these vectors, we lose generality but gain tractability of the problem. At some point
we may decide that limiting Q to be a linear combination of some basis functions is too restrictive, and
opt for a nonlinear function, that is, a multi-layer neural network. To over-simplify grossly, that is one way
to think of deep reinforcement learning. The final theme studied in these notes is that of PAC (Probably
Approximately Correct) solutions to Markov Decision Problems. The idea here is that it is not necessary to
strive for absolutely the best possible solution – it is sufficient if the solution is nearly optimal. Moreover,
if these nearly optimal solutions are to be found using probabilistic methods, then it is permissible if these
probabilistic methods work with high probability. To put it in colloquial terms, a PAC algorithm is one that
gives more or less the right answer most of the time. PAC learning is a well-established subject, and [45]
offers a fairly complete (though advanced and rather abstract) treatment of the topic. For reinforcement
learning, it is not clear whether so much generality and abstraction are required. Perhaps there are ways
to present the theory in a simplified setting, and at the same time, explore questions that are relevant to
reinforcement learning. These questions are still being studied, and an attempt will be made to summarize
the current research in PAC-MDP in the last part of the notes.
Chapter 2
A widely used mathematical formalism for reinforcement learning problems is Markov Decision Processes
(MDPs) where the dynamics of the Markov process are not known, and must somehow be “inferred” on the
fly. Before tackling that problem, we must first understand MDPs when the dynamics are known. That is
the aim of the present chapter. In the interests of simplicity, the discussion is limited to the situation where
the state and action spaces underlying the MDP are finite sets. MDPs where the underlying state space
and/or action space is countable, or an arbitrary measurable space, are also of interest in some applications.
However, we do not study the more general situations in these notes. The area of MDP is quite well-studied,
and there are several excellent books on the subject. The reader is directed to [28] for a comprehensive
treatment of the subject, which also studies the case of infinite state and action spaces. The book [10]
contains several practical examples of MDPs. The theory of MDPs is also studied in [33] and [35].
Thus the i-th row of A is the conditional probability vector of Xt+1 when Xt = xi . Clearly the row sums of
the matrix A are all equal to one. Therefore the induced norm kAk∞→∞ also equals one.
Up to now there is nothing new beyond the contents of Section 8.2. Now suppose that there is a “reward”
function R : X → R associated with each state. There is no consensus within the community about whether
the reward corresponding to the state Xt is paid at time t, or time t + 1. We choose to follow [28, 33] and
assume that the reward is paid at time t + 1.1 This allows us to talk about the joint probability distribution
Pr{(Xt+1 , Rt+1 )|Xt }, where both the next state Xt+1 and the reward Rt+1 are random functions of the
current state Xt . In particular, if Rt+1 is a deterministic function of Xt , then we have that
aij if rl = R(xi ),
Pr{(Xt+1 , Rt+1 ) = (xj , rl )|Xt = xi } =
0 otherwise,
11
12 CHAPTER 2. MARKOV DECISION PROCESSES
We often just use “discounted reward” instead of the longer phrase. Note that, because the set X is finite,
the reward function Rt+1 is bounded if it is a deterministic function of Xt . If Rt+1 is a random variable
dependent on Xt , then it customary to assume that it is bounded. With these assumptions, because γ < 1,
the above summation converges and is well-defined. The quantity V (xi ) is referred to as the value function
associated with xi , and the vector
v = [ V (x1 ) · · · V (xn ) ]> , (2.3)
is referred to as the value vector. Note that, throughout these notes, we view the value as both a function
V : X → R as well as a vector v ∈ Rn . The relationship between the two is given by (2.3). We shall use
whichever interpretation is more convenient in a given context.
This raises the question as to how the value function and/or value vector is to be determined.
Define the vector r ∈ Rn ,
r := [ r1 · · · rn ]> , (2.4)
where, if Rt+1 is a random function of Xt , then
ri := E[Rt+1 |Xt = xi ]. (2.5)
If Rt+1 is a deterministic function Rd (Xt+1 ), then
n
X
ri = aij Rd (xj ).
j=1
S 1 2 3 4 5 6 7 8 9 W L
Example 2.1.
In the second step we use fact that the Markov process is stationary. Substituting from (2.9) into (2.8) gives
the recursive relationship (2.14).
Until now it has been assumed that the discount factor γ is strictly less than one. However, if the Markov
process has one or more absorbing states, then (8.40) gives an explicit formula for the average time before
a sample path terminates in an absorbing state. Moreover, this time is always finite. Therefore, in Markov
processes with absorbing states, it is possible to use an undiscounted sum, with the understanding that the
summation ends as soon as Xt is an absorbing state. Equivalently, it is possible to assign a reward of zero
to every absorbing state, so that the infinite sum of rewards contains only a finite number of nonzero terms.
It is left to the reader to state and prove the analog of Theorem 2.1 for this case.
We analyze the toy snakes and ladders game of Example 8.3. As shown therein, the state transition
matrix of this game is given by
S 1 4 5 6 7 8 W L
S 0 0.25 0.25 0.25 0 0.25 0 0 0
1 0 0 0.25 0.50 0 0.25 0 0 0
4 0 0 0 0.25 0.25 0.25 0.25 0 0
5 0 0.25 0 0 0.25 0.25 0.25 0 0
6 0 0.25 0 0 0 0.25 0.25 0.25 0
7 0 0.25 0 0 0 0 0.25 0.25 0.25
8 0 0.25 0 0 0 0 0.25 0.25 0.25
W 0 0 0 0 0 0 0 1 0
L 0 0 0 0 0 0 0 0 1
To define a reward function for this problem, we will set Rt+1 = f (Xt+1 ), where f is defined as follows:
f (W ) = 5, f (L) = −2, f (x) = 0 for all other states. Thus there is no immediate reward. However, there
is an expected reward depending on the state at the next time instant. For example, if X0 = 6, then the
expected value of R1 is 5/4, whereas if X0 = 7 or X0 = 8, then the expected value of R1 is 3/4.
Now let us see how the implicit equation (2.6) can be solved to determine the value vector v. Since the
induced matrix norm kAk∞→∞ = 1 and γ < 1, it follows that the matrix I − γA is nonsingular. Therefore,
for every fixed assignment of rewards to states, there is a unique v that satisfies (2.6). In principle it is
possible to deduce from (2.6) that
v = (I − γA)−1 r. (2.10)
The difficulty wth this formula however is that in most actual applications of Markov Decision Problems,
the integer n denoting the size of the state space X is quite large. Moreover, inverting a matrix has cubic
complexity in the size of the matrix. Therefore it may not be practicable to invert the matrix I − γA. So
we are forced to look for alternate approaches. A feasible approach is provided by the Contraction Mapping
Theorem (CMT), namely Theorem 8.13. With the contraction mapping theorem in hand, we can apply it
to the problem of computing the value of a discounted Markov reward process.
14 CHAPTER 2. MARKOV DECISION PROCESSES
Theorem 2.2. The map y 7→ T y := r + γAy is monotone and is a contraction with respect to the `∞ -norm,
with contraction constant γ.
Proof. The first statement is that if y1 ≤ y2 componentwise (and note that the vectors y1 , y2 need not
consist of only positive components), then T y1 ≤ T y2 . This is obvious from the fact that the matrix A has
only nonnegative components, so that Ay1 ≤ Ay2 . For the second statement, note that, because the matrix
A is row-stochastic, the induced norm of A with respect to k · k∞ is equal to one. Therefore
Therefore one can solve (2.6) by repeated application of the contraction map T . In other words, we can
choose some vector y0 arbitrarily, and then define
yi+1 = r + γAyi .
Then the contraction mapping theorem tells us that yi converges to the value vector v. Moreover, from
(8.49) one can estimate how far the current iteration is from the solution v. Note that the contraction
constant ρ in the statement of the theorem can be taken as the discount factor γ. Define the constant
c := kr + γAy0 − y0 k∞ ,
which measures how far away the initial guess y0 is from satisfying (2.6). Then we have the estimate
γi
kyi − vk∞ ≤ c. (2.11)
1−γ
In this approach to finding the value function, each iteration has quadratic complexity in n, the size of the
state space. Moreover, (2.11) can be used to decide how many iterations should be run to get an acceptable
estimate for v. This approach to determining v (albeit approximately) is known as “value iteration.” Thus
if we use I iterations, then the complexity of value iteration is O(In2 ) as opposed to O(n3 ) for using (2.10).
Hence the value iteration approach is preferable if I n. Note that the faster future rewards are discounted
(i.e., the smaller γ is), the faster the iterations will converge. Moreover, if the reward vector r is nonnegative,
and we choose y0 = r, then the value iterations result in a sequence of estimates {yi } that is componentwise
monotonically nondecreasing. (See Problem 2.1.)
where φ ∈ S(X ) is a probability distribution on X . Compared with the definition (2.2) of the discounted
reward, two points of contrast would strike us at once.
1. In (2.2), the existence of the sum is not in question, because γ < 1. However, in the present instance,
there is no a priori reason to assume that the limit in (2.12) exists.
2.1. MARKOV REWARD PROCESSES 15
2. The value function V in (2.2) is associated with an initial state xi . It is implicit in the definition that
V (xi ) need not equal V (xj ) if xi 6= xj . In (2.12), the initial state is replaced by an initial distribution
φ, which is more general. However, we write c∗ , instead of c∗ (φ), suggesting that the limit, if it exists,
is independent of φ.
Theorem 2.3 presents a simple sufficient condition to address both of the above observations.
Theorem 2.3. Suppose A is irreducible, and let µ denote its unique stationary distribution. Then
Therefore " #
T
1 X
c∗ = φ lim At r = φ1n µr = µr = E[R, µ], (2.14)
T →∞ T
t=0
because φ1n = 1. This is the desired result.
Next we introduce an important concept known variously as the bias or the transient reward. For a
discussion (albeit with “reward” replaced by “cost”), see [28, Section 8.2.3] or [1, Section 4.1].
Definition 2.1. Suppose A is primitive,2 and define c∗ as in (2.14) For each index i, the transient reward
Ji∗ ∈ R is defined as
X∞
Ji∗ = {E[R(Xt |X0 = xi ] − c∗ }. (2.15)
t=0
A priori it is not clear why the sum in (2.15) is well-defined, because there is no averaging over time. It
is now shown that the transient reward is indeed well-defined, and several explicit expressions are given for
it.
Theorem 2.4. Suppose A is primitive, and let µ denote its stationary distribution. Define M := 1n µ ∈
[0, 1]n×n , and J∗ ∈ Rn as [Ji∗ ]. Then the following statements are true:
1. The vector J∗ is well-defined.
2. An explicit expression for J∗ is given by
J = r − c∗ 1n + AJ. (2.17)
µJ = 0. (2.18)
2 This is equivalent to assuming that A is irreducible and aperiodic; see Theorem 8.7.
16 CHAPTER 2. MARKOV DECISION PROCESSES
Proof. Note that µ, 1n are row and column eigenvectors of A corresponding to the eivenvalue λ = 1, and
that all other eigenvalues of A have magnitude less than one. So if we define
A2 = A − 1n µ = A − M,
then the spectrum of A2 is the same as that of A, except that the eigenvalue at 1 is replaced by 0. In
particular, ρ(A2 ) < 1, and as a consequence
∞
X
At2 = (I − A2 )−1 = (I − A + M )−1 . (2.19)
t=0
To prove Statements 1 and 2, let ei denote the i-th elementary basis vector. Then X0 = xi is equivalent
to X0 ∼ e> > t
i . Then Xt ∼ et A , and
∞
X
Ji∗ = [e> t ∗
i A r − c ],
t=0
∞
X ∞
X
J∗ = (At r − c∗ 1n ) = At (r − c∗ 1n )
t=0 t=0
= (I − A + M )−1 (I − M )r. (2.21)
Here we use the fact that c∗ 1n = c∗ At 1n for all t. This establishes Statements 1 and 2.
Now we come to Statement 3. From (2.15), we get
∞
X
Ji∗ = {E[R(Xt |X0 = xi ] − c∗ }
t=0
∞
X
= ri − c∗ + {E[R(Xt |X0 = xi ] − c∗ }
t=1
n
X ∞
X
= ri − c∗ + aij {E[R(Xt |X1 = xj ] − c∗ }
j=1 t=1
n
X
= ri − c∗ + aij Jj∗ ,
j=1
which is just (2.17) written out in component form. Hence J∗ is a particular solution of (2.17).
2.2. MARKOV DECISION PROCESSES 17
Finally, observe that if J is another solution of (2.17), then (J∗ − J) = A(J∗ − J), which implies that
J = J∗ + α1n for some constant α. Thus {J∗ + α1n : α ∈ R} is the set of all solutions to (2.17). Now, since
µ(r − c∗ 1n ) = 0, it follows that
∞
X ∞
X
µJ∗ = µ At (r − c∗ 1n ) = µ(r − c∗ 1n ) = 0.
t=0 t=0
Moreover, if µ(J∗ +α1n ) = 0, then α = 0. Hence J∗ is the unique solution of (2.17) that satisfies µJ = 0.
It is possible to give an alternate proof of Statement 3, and we do so now. Suppose J∗ is given by (2.21).
Observe that
µ(I − A + M ) = µM = µ, or µ(I − A + M )−1 = µ.
Also, µ(I − M ) = 0. Therefore
J∗ − AJ∗ = r − c∗ 1n .
Obviously, for each fixed uk ∈ U, the corresponding state transition matrix Auk is row-stochastic. In addition,
there is also a “reward” function R : X × U → R. Note that in a Markov reward process, the reward depends
only on the current state, whereas in a Markov decision process, the reward depends on both the current
state as well as the action taken. As in Markov reward processes studied in Section 2.1, it is possible to
permit R to be a random function of Xt and Ut as opposed to a deterministic function. Moreover, to be
consistent with the earlier convention, it is assumed that the reward R(Xt , Ut ) is paid at the next time
instant t + 1 and not at time t.
The most important aspect of an MDP is the concept of a “policy,” which is just a systematic way
of choosing Ut given Xt . One can make a distinction between deterministic and probabilistic policies. A
deterministic policy is just a map from X to U. A probabilistic policy is a map from X from the set of
probability distributions on U, denoted by S(U). Let Πd , Πp denote respectively the set of deterministic,
and the set of probabilistic, policies. Clearly the number of deterministic policies is |U||X | , while Πp is
uncountable. Observe that a policy π ∈ Πd can be represented by a |X | × |U | matrix P , where each row of
P contains a single one and the rest are zeros. Thus in row i, the one is in column π(xi ) and the rest are
18 CHAPTER 2. MARKOV DECISION PROCESSES
zeros. If π ∈ Πp , then P need not be binary, but P must have only nonnegative elements, and the sum of
each row must equal one.
Now we make an important observation. Whether a policy π is deterministic or probabilistic, the resulting
stochastic process {Xt } is Markov with the state transition matrix determined as follows: If π ∈ Πd , then
π(xi )
Pr{Xt+1 = xj |Xt = xi , π} = aij . (2.23)
If π ∈ Πp and
π(xi ) = [ φi1 ··· φim ], (2.24)
where m = |U|, then
m
X
Pr{Xt+1 = xj |Xt = xi , π} = φik auijk . (2.25)
k=1
Equation (2.25) contains (2.24) as a special case, by setting φij = 1 if π(xi ) = uj , and zero otherwise. In a
similar manner, for every policy π, the reward function R : X × U → R can be converted into a reward map
Rπ : X → R, as follows: If π ∈ Πd , then
whereas if π ∈ Πp , then
m
X
Rπ (xi ) = φik R(xi , uk ). (2.27)
k=1
Despite its simplicity, the above observation is very useful, because it states that a Markov process with
an action variable remains a Markov process under any choice of policy, deterministic or probabilistic. The
reason why this is so is that the policy has no memory. To illustrate, suppose π : X → U is a deterministic
policy, with (for example) π(xi ) = uk . Then the action variable U (t) equals uk , whenever the state X(t)
equals xi ; the time t at which this happens is immaterial. Similar remarks apply to the reward function.
For this reason, we define Aπ to be the state transition matrix that results from applying the policy π, and
Rπ to be the reward function that results from applying the policy π.
For a MDP, one can pose three questions:
1. Policy evaluation: For a given policy π, define Vπ (xi ) to be the “value” associated with the policy
π and initial state xi , that is, the expected discounted future reward with X0 = xi . How can Vπ (xi )
be computed for each xi ∈ X ?
to be the optimal value over all policies for that initial state. How can V ∗ (xi ) be computed? Note
that in (2.28), the optimum is taken over all probabilistic policies. It is shown in Theorem 2.9 in the
sequel that the optimum can actually be achieved by a deterministic policy.
How can the optimal policy map π ∗ be determined? Note that we can restrict to π ∈ Πd because, as
stated above, the maximum over π ∈ Πp is not any larger. Moreover, it is again shown in Theorem 2.9
that there exists one common optimal policy for all initial states.
2.2. MARKOV DECISION PROCESSES 19
Policy Evaluation:
Suppose a policy π ∈ Πd is specified. Then the corresponding state transition matrix and reward are given
by (2.23) and (2.26) respectively. Now suppose we define the vector vπ by
vπ = rπ + γAπ vπ . (2.32)
As before, it is inadvisable to compute vπ via vπ = (I − γAπ )−1 rπ . Instead, one should use value iteration
to solve (2.32).
For future use we introduce another function Q : X × U → R, known as the action-value function,
which is defined as follows:
"∞ #
X
t
Qπ (xi , uk ) := R(xi , uk ) + Eπ γ Rπ (Xt )|X0 = xi , U0 = uk . (2.33)
t=1
Apparently this function was first defined in [47]. Note that Qπ is defined only for deterministic policies.
In principle it is possible to define it for probabilistic policies, but this is not commonly done. In the above
definition, the expectation Eπ is with respect to the evolution of the state Xt under the policy π. When the
reward is a random function of Xt and Ut , then inside the summation we would need to take the expected
value of R(Xt , π(Xt )) for a deterministic policy.
The way in which a MDP is set up is that at time t, the Markov process reaches a state Xt , based on the
previous state Xt−1 and the state transition matrix Aπ corresponding to the policy π. Once Xt is known,
the policy π determines the action Ut = π(Xt ), and then the reward Rπ (Xt ) = R(Xt , π(Xt )) is generated at
time t+1. In particular, when defining the value function Vπ (xi ) corresponding to a policy π, we start off the
MDP in the initial state X0 = xi , and choose the action U0 = π(xi ). However, in defining the action-value
function Q, we do not feel compelled to set U0 = π(X0 ) = π(xi ), and can choose an arbitrary action uk ∈ U.
From t = 1 onwards however, the action Ut is chosen as Ut = π(Xt ). This seemingly small change leads to
some simpifications. Specifically, it will be seen in later chapters that it is often easier to approximate (or
to “learn”) the action-value function than it is to approximate the value function.
Just as we can interpret V : X → R as a |X |-dimensional vector, we can interpret Q : X × U → R as an
|X | · |U|-dimensional vector. Consequently the Q-vector has higher dimension than the value vector.
Proof. Observe that at time t = 0, the state transition matrix is Auk . So, given that X0 = xi and U0 = uk ,
the next state X1 has the distribution
X1 ∼ [auijk , j = 1, · · · , n].
20 CHAPTER 2. MARKOV DECISION PROCESSES
Moreover, U1 = π(X1 ) because the policy π is implemented from time t = 1 onwards. Therefore
!
Xn ∞
X
u t
Qπ (xi , uk ) = R(xi , uk ) + Eπ aijk γR(xj , π(xj )) + γ Rπ (Xt )|X1 = xj , U1 = π(xj )
j=1 t=2
!
n
X ∞
X
uk
= R(xi , uk ) + Eπ γ aij R(xj , π(xj )) + γ t Rπ (Xt )|X1 = xj , U1 = π(xj )
j=1 t=1
n
X
= R(xi , uk ) + γ auijk Q(xj , π(xj )).
j=1
This is the desired conclusion. Note that in the above summation, we have written R(Xt ) for reward to be
paid at time t + 1.
This is the same as (2.22) written out componentwise. We know that (2.22) has a unique solution. This
shows that (2.35) holds.
Tπ v = rπ + γAπ v. (2.37)
Then it follows from Theorem 2.2 that Tπ is monotone and is a contraction with respect to the `∞ -norm,
with contraction constant γ.
Now we introduce one of the key ideas in Markov Decision Processes. Define the Bellman iteration
map B : Rn → Rn via
n
X
(Bv)i := max R(xi , uk ) + γ auijk vj . (2.38)
uk ∈U
j=1
Theorem 2.7. The map B is monotone and a contraction with respect to the `∞ -norm.
Proof. The theorem has two claims: The first claim is that the map B is monotone, meaning that if v1 ≤ v2
componentwise, then B(v1 ) ≤ B(v2 ) componentwise. The second claim is that B is a contraction with
respect to the `∞ -norm. Note that, unlike the value iteration map Tπ defined in (2.37), the map B is not
affine.
2.2. MARKOV DECISION PROCESSES 21
= (B(v2 ))i .
Here we use the fact that auijk ≥ 0 for all i, j. This establishes that B is monotone, which is the first claim.
The proof of the second claim is a bit more elaborate. We begin by establishing that
max g(xi , uk ) − max h(xi , uk ) ≤ max |g(xi , uk ) − h(xi , uk )|, ∀xi ∈ X . (2.39)
uk ∈U uk ∈U uk ∈U
To prove (2.39), we begin with the obvious observation that, if α, β are real numbers, then
α − β ≤ |α − β| =⇒ α ≤ |α − β| + β.
Note that this inequality holds irrespective of the signs of α and β. Fix xi ∈ X , uk ∈ U and apply the above
inequality with α = g(xi , uk ), β = h(xi , uk ). This gives
Rearranging gives
max g(xi , uk ) − max h(xi , uk ) ≤ max |g(xi , uk ) − h(xi , uk )|.
uk ∈U uk ∈U uk ∈U
n
X n
X
≤ max R(xi , uk ) + γ auijk v1j − R(xi , uk ) − γ auijk v2j
uk ∈U
j=1 j=1
n
X n
X
= max γ auijk (v1j − v2j ) ≤ max γ auijk |v1j − v2j |
uk ∈U uk ∈U
j=1 j=1
≤ γkv1 − v2 k∞ . (2.40)
22 CHAPTER 2. MARKOV DECISION PROCESSES
Because the inequality (2.40) holds for every index i, it follows that
This shows that the map B is a contraction with respect to the `∞ -norm, which is the second claim.
Theorem 2.8. Define v̄ ∈ Rn to be the unique fixed point of B, and define v∗ ∈ Rn to equal [V ∗ (xi ), xi ∈ X ],
where V ∗ (xi ) is defined in (2.28). Then v̄ = v∗ .
Proof. By definition, for every π ∈ Πd , we have that
n
π(xi )
X
[Tπ (v̄)]i = R(xi , π(xi )) + aij V̄j
j=1
n
X
≤ max R(xi , uk ) + γ auijk V̄j = (B(v̄))i = V̄i , (2.41)
uk ∈U
j=1
then
l
X n
X
[Tπ (v)]i = φil R(xi , ul ) + auijl V̄j
l=1 j=1
n
X
≤ max R(xi , uk ) + auijk V̄j
uk ∈U
j=1
Because (2.41) and (2.42) hold for every index i, it follows that
Tπ (v̄) ≤ v̄.
Now let l → ∞. Then the left side approaches the fixed point of the map Tπ , which is vπ . Thus we conclude
that, for all policies in Πd or Πp , we have that
vπ ≤ v̄. (2.43)
In case of ties, choose any deterministic tie-breaking rule, e.g., choose the uk with the lowest index. Then,
since the right side of (2.45) equals (B(v̄))i = V̄i , we conclude that
n
π̄(xi )
X
V̄i = R(xi , π̄(xi )) + aij V̄j , ∀i. (2.46)
j=1
Hence Tπ̄ (v̄) = v̄. But since Tπ̄ is a contraction, it has a unique fixed point, which shows that V̄i = Vπ̄ (xi )
for all i. Therefore, for each index i, we have that
By replacing v̄ in Theorem 2.8 by v∗ (which equals v̄), we derive the following fundamental result for
Markov Decision Processes.
Theorem 2.9. Define the optimal value function V ∗ (xi ) as in (2.28). Then
1. The optimal value function V ∗ : X → R is the unique solution of the following recursive relationship,
known as the Bellman optimality equation:
n
X
V ∗ (xi ) = max R(xi , uk ) + γ auijk V ∗ (xj ) . (2.47)
uk ∈U
j=1
Specifically, the policy π̄ defined by restating (2.45) with V̄j replaced by Vj∗ , namely
n
X
π ∗ (xi ) = arg max R(xi , uk ) + auijk Vj∗ . (2.49)
uk ∈U j=1
Note that Item 2 of the theorem states that enlarging the policy space to include probabilistic policies
does not increase the maximum value. Also, there is one common policy that achieves the optimal value for
every state xi . Perhaps neither of these statements is obvious on the surface.
In analogy with the optimal value function, we can also define an optimal action-value function.
is optimal.
Proof. Since Q∗ (·, ·) is defined by (2.50), it follows that
n
X
max Q∗ (xi , uk ) = max R(xi , uk ) + γ auijk V ∗ (xj ) = V ∗ (xi ),
uk ∈U uk ∈U
j=1
by (2.47). This establishes (2.52) and (2.53). Substituting from (2.52) into (2.50) gives (2.51).
Now we define an iteration on action-functions that is analogous to (2.38) for value functions. As with
the value function, the action-value function can either be viewed as a map Q : X × U → R, or as a vector
in R|X |·|U | . Define F : R|X |×|U | → R|X |×|U | by
n
X
[F (Q)](xi , uk ) := R(xi , uk ) + γ auijk max Q(xj , wl ). (2.54)
wl ∈U
j=1
Theorem 2.11. The map F is monotone and is a contraction. Therefore for all Q0 : X × U → R, the
sequence of iterations {F t (Q0 )} converges to Q∗ as t → ∞.
Proof. The proof is very similar to that of Theorem 2.9. Given a map Q : X × U → R, define the associated
map M(Q) : X → R by
[M(Q)](xi ) = max Q(xi , uk ),
uk ∈U
Also, if Q, Q0 : X × U → R, let Q ≤ Q0 denote that Q(xi , uk ) ≤ Q0i (xi , uk ) for all xi , uk . Then it is clear that
if Q ≤ Q0 , then M(Q) ≤ M(Q0 ). Because auijk is always nonnegative, it follows that the map F is monotone.
Next, as in the proof of Theorem 2.7, for arbitrary maps Q1 , Q2 : X × U → R, we have
As a result
kM(Q1 ) − M(Q2 )k∞ ≤ kQ1 − Q2 k∞ .
Substituting this into (2.55) gives
If we were to rewrite (2.47) and (2.51) in terms of expected values, the advantages of the Q-functioon
would become apparent. We can rewrite (2.47) as
and (2.51) as
∗ ∗
Q (Xt , Ut ) = R(Xt , Ut ) + γE max Q (Xt+1 , Ut+1 ) . (2.58)
Ut+1 ∈U
Thus in the Bellman formulatioon and iteration, the maximization occurs outside the expectation, whereas
with the Q-formulation and F -iteration, the maximization occurs inside the expectation. As shown in later
chapter, learning Q∗ is easier than learning V ∗ .
The idea of learning Q∗ instead of learning V ∗ is introduced in [47].
Then
1. vπk+1 ≥ vπk , where the dominance is componentwise.
2. {vπk } ↑ v∗ as k → ∞.
The proof is quite straightforward. The key step is to verify that if we define the updated policy πk+1
according to (2.59), then the corresponding value vπk+1 is just Bvπk ; but this is obvious.
Example 2.2. Now we return to the game of Blackjack. A detailed discussion of the game is given in [33,
Example 5.1]. To describe the original game briefly, it is played between a player and the “House.” (It is
possible to have more than one player playing against the House, but we don’t study that problem in the
interests of simplicity.) At each turn, the player and the House have the option of drawing a card (“hit”) or
not drawing (“stick”). Each card is counted as its face value, with picture cards counted as 10. An ace can
count as either 1 of 11 at the player’s preference. The objective of the player is to exceed the total of the
House without going over 21.
From the description, it is obvious that if the player’s current total is eleven or less, then the best strategy
is to hit, because there is no chance of losing on the next draw. Hence the issue of what to do arises only
when the player’s total reaches 12 or higher. Indeed, if the target were to be changed to some number N ,
then it is clear that if the player’s total is N − 10 or less, then the correct solution is to hit. It can also be
assumed that the probability of any particular card being the next card drawn is the same, no matter what
26 CHAPTER 2. MARKOV DECISION PROCESSES
cards have been drawn until then (infinitely many card decks being used). In the original Blackjack game,
only one card of the House is visible. In what follows, for the purposes of illustration, we eliminate all of
these complications, and introduce a simplified game.
Suppose that, instead of drawing a card, the player rolls a fair four-sided die. Since there are only four
possible outcomes, irrespective of what the target total might be, it is reasonable to suppose that the state Pt
of the player lies in the set {0, 1, 2, 3, W, L}, with 0 being the start state. It can be assumed that the current
state is in {0, 1, 2, 3}, while W and L are terminal states. To simplify the problem further, suppose that the
House adopts the strategy that it does not roll the die further once its state is in {1, 2, 3} (i.e., it does not try
for a win from any of these states). Therefore the state Ht of the house lies in the set {1, 2, 3}. The overall
state (Pt , Ht ) lies in the Cartesian product {0, 1, 2, 3, W, L} × {1, 2, 3}. Out of these, there are twelve possible
current states, namely {0, 1, 2, 3} × {1, 2, 3} where the first number is the state of the player and the second
is the state of the House. If the player rolls the die, the possible next states are {1, 2, 3, W, L} × {1, 2, 3},
or a total of fifteen states. In this game, as in the snakes and ladders game, the reward is random and is a
function of the next state.
As a part of the problem statement, we need to specify the dynamics of the Markov process. For the
House, it does not play, so its state transition matrix is the 3 × 3 identity matrix, which ensures that
Ht+1 = Ht . As for the player’s state Pt , if the action is to “stick,” then the state transition matrix AS is
the 5 × 5 identity matrix. If the action is to “hit,” then the state transition matrix AH is given by
0 1 2 3 W L
0 0 0.25 0.25 0.25 0.25 0
1 0 0 0.25 0.25 0.25 0.25
AH = 2 0 0 0 0.25 0.25 0.50 .
3 0 0 0 0 0.25 0.75
W 0 0 0 0 1 0
L 0 0 0 0 0 1
To complete the problem formulation, we need to specify the reward. Unlike the state transition matrix
above, which is based on nothing more than the assumption that all four outcomes of the die are equally
likely, the reward is to some extent arbitrary. Let us assign the following rewards:
Pt > Ht 2
Pt = H t 1
Pt < Ht 0
Pt = W 5
Pt = L −5
With this problem specification, we should strive to find an optimal policy. Note that the action space
U = {H, S} (for “hit” or “stick”) has cardinality two. Hence the number of policies is 212 = 4, 096, which is
already large enough that simply enumerating all possibilities is not practicable.3 Hence some kind of policy
iteration is the only way.
For evaluating a specific policy, it can be noted that the duration of the game cannot exceed four time
steps. This is because the player’s position has to advance by at least one at each time step. So discount
factors very close to 1 do not make sense. The discount γ should be chosen much smaller, say 0.5.
Problem 2.2. Suppose that a Markov decision problem has four states and two actions. Suppose further
that the two row-stochastic matrices corresponding to the two actions are as follows:
0.1 0.3 0.3 0.3 0.3 0.2 0 0.5
0.3 0.4 0.1 0.2 u 0.1 0.1 0.2 0.6
Au 1 =
0 0.4 0.4 0.2 , A = 0.2 0.5 0.1 0.2 .
2
Suppose further that the reward map R : X ×U is as follows (note that we write e.g., (3, 1) instead of (x3 , u1 )
to save space):
(1, 1) (1, 2) (2, 1) (2, 2) (3, 1) (3, 2) (4, 1) (4, 2)
R= .
2 5 −1 4 3 3 6 −1
Suppose we define a deterministic policy π by
0 1 1 0
π= .
1 0 0 1
In other words, π(x1 ) = u2 , π(x2 ) = u1 , π(x3 ) = u1 , π(x4 ) = u2 . Compute the corresponding state
transition matrix Aπ and reward map Rπ .
Suppose we define a probabilistic policy π by
0.3 0.4 0.2 0.6
π= .
0.7 0.6 0.8 0.4
With a discount factor of γ = 0.9, compute the optimal value and optimal policy using Theorem 2.12.
Stochastic Approximation
In this chapter, we discuss stochastic approximation (SA), which provides the mathematical foundation for
many of the reinforcement learning (RL) algorithms presented in subsequent chapters. In RL, the learner
attempts to identify an optimal (or nearly optimal) policy for an MDP with possibly unknown dynamics.
Even if the MDP dynamics were to be known, computing an optimal policy using the policy iteration
approach described in Theorem 2.12 would require, at each iteration, the exact computation of the value
function associated with the policy. One of the applications of SA is that the policy can be updated even as
the value function is being estimated. In the case where the MDP dynamics are not known, the learner has
two possible approaches:
1. Estimate the MDP parameters and use this estimate to formulate the corresponding optimal policy.
This is often referred to as “indirect” RL, drawing inspiration from the phrase “indirect adaptive
control” from control theory.
2. Directly estimate the optimal policy without attempting to estimate the MDP parameters. This is
often referred to as “direct” RL, again drawing upon the phrase “direct adaptive control.”
29
30 CHAPTER 3. STOCHASTIC APPROXIMATION
where {αt } is a predetermined sequence of step sizes.1 Note that αt is permitted to be a random number; in
this case, its probability distribution depends only on information up to and including time t. This is made
precise later on. If only noisy measurements of the form (3.1) are available, then (3.2) is replaced by
However, there is some ambiguity about the updating rule (3.3). Note that h(θ) also equals k − f (θ)k22 . So
we could just as easily use the updating rule
if θ 6= θ ∗ , then we should use the updating (3.4). However, if the sign is reversed in (3.5), then we should
use the updating rule (3.3). Going forward, we will use (3.4). The main reason is that, if the function f (·)
satisfies (3.5), then under mild additional conditions θ ∗ is a globally attractive equilibrium of the associated
differential equation
θ̇ = f (θ). (3.6)
We will refer to a function f : Rd → Rd for all θ as a passive function, borrowing a term from circuit
theory.
Stochastic approximation theory is devoted to the study of conditions under which a sequence {θ t } defined
as in (3.3) converges to a zero of the function f (·). Due to the presence of the noise, the convergence can only
be probabilistic. The SA algorithm was introduced in [29], and some generalizations and/or simplifications
followed very quickly; see [49, 20, 15, 13]. An excellent survey can be found in [26]. Book-length treatments
of SA can be found in [24, 3, 25, 8].
The above formulation of SA can be used to address some related problems. For instance, suppose
g : Rd → Rd is some function, and it is desired to find a fixed point of the map g, that is, a vector θ ∗ such
that g(θ ∗ ) = θ ∗ . As shown in Chapter 2, computing the value of a Markov reward problem, or the value
of a policy in an MDP, both fall into this category. This problem can be formulated as that finding a zero
of the function f (θ) = g(θ) − θ. If we were to substitute this expression into (3.3), we get what might be
called the “fixed point version” of SA, namely
Another application is that of finding a stationary point of a function J : Rd → R, that is, finding a θ ∗ ∈ Rd
such that ∇J(θ ∗ ) = 0. Again, the above problem can be formulated in the present framework by defining
f (θ) := −∇J(θ). Here again, one can ask by f (θ) := −∇J(θ) and not f (θ) := ∇J(θ). The answer is that
if we write f (θ) := −∇J(θ), then under suitable conditions we can expect SA to find a local (or global)
minimum of J. If we wish to maximize J, then of course we should choose f (θ) := ∇J(θ). As in the previous
case, the learner has available only noisy measurements of the gradient, in the form yt+1 = −∇J(θ t ) + ξ t+1 .
For this reason, the above formulation is sometimes referred to as “stochastic gradient descent.” The reader is
cautioned that the same phrase is also used with an entirely different meaning in the deep learning literature.
Equation 3.3 describes what might be called the “standard” SA. Two variants of SA are germane to RL,
namely “asynchronous” SA and “two time-scale” SA. Each of these is briefly described next.
1 In earlier years, methods such as steepest descent used to consist of two parts: (i) a choice of the search direction, and
(ii) solution of a minimum along the search direction to determine the step size. However, in recent times, one-dimensional
minimization has been dispensed with. Instead, the step size is chosen according to a predetermined schedule.
3.1. AN OVERVIEW OF STOCHASTIC APPROXIMATION 31
We begin with a description of “asynchronous” SA. Let [d] denote the set {1, · · · , d}, and let there be a
rule I that maps Z+ , the set of nonnegative integers, into [d]. At time t ∈ Z+ , let I(t) be the corresponding
element of [d]. Then the update rule (3.3) is applied only to the I(t)-th component of θ. Thus
θt,j − αt yt+1,j , if j = I(t),
θt+1,j = (3.8)
θt,j , if j 6= I(t).
From an implementation standpoint, the asynchronous update rule (3.8) presumably requires less storage.
However, convergence with the asynchronous update rule might be slower than with (3.3).
Note that there is a great deal of flexibility in the update rule, which could either be deterministic or
probabilistic. To illustrate, updating each component of θ sequentially would be an example of a deterministic
update rule, while choosing an index i from [d] according to some probability distribution would be an
example of a probabilistic rule. We shall see in subsequent chapters that, in some RL applications, the
process {I(t)} is itself a Markov process assuming values in [d]. For a given integer T , let T (i) denote the
number of times that component i ∈ [d] is chosen to be updated, until time T . Then we insist that there
exists a ν > 0 such that
T (i)
lim inf ≥ ν > 0, ∀i ∈ [d]. (3.9)
T →∞ T
If i is chosen in a random fashion, for example in accordance with some probability distribution on [d], or
as a Markov process on [d], then we insist only that (3.9) must hold almost surely. The purpose of (3.9) is
to ensure that, though only one component of θ t is updated at time t, as time goes on, the fraction of times
that each component of θ t is updated is bounded below by some positive constant.
This variant of asynchronous SA is introduced in [39], and builds on “asynchronous optimization” intro-
duced in [40]. Another variant of asynchronous SA is introduced in [7], and studied further in [9]. In this
variant, the update rule (3.8) is changed to
θt,j − αν(t+1;i) yt+1,j , if j = I(t),
θt+1,j = (3.10)
θt,j , if j 6= I(t),
where
t+1
X
ν(t + 1; i) = I{I(τ )=i} .
τ =1
Thus ν(t+1; i) counts the number of time instants up to time t+1 when component i is selected for updating.
The difference between (3.8) and (3.10) is this: In (3.8), the step size is αt , which depends on the “global
counter” t. In contrast, in (3.10), the step size is αν(t+1;i) , which depends on the “local counter” ν(t + 1; i).
Establishing the convergence of this particular variant makes use of far more stringent requirements on the
noise {ξ t }, compared to the version in (3.8). This can be seen by comparing the contents of [39] with those
of [7]. Moreover, standard RL algorithms such as Q-learning, introduced in subsequent chapters, correspond
to the update rule (3.8) and not (3.10). For this reason, the update rule in (3.10) is not studied further.
The third variant of SA studied here is “two time-scale” SA, in which we attempt to solve “coupled”
equations of the form
f (θ, φ) = 0, g(θ, φ) = 0, (3.11)
where θ ∈ Rd , φ ∈ Rl , f : Rd × Rl → Rd , and g : Rd × Rl → Rl . As before, for given “current” guesses
θ t , φt , we have access only to noise-corrupted measurements of the form
Note the provision to have different step sizes for updating θ t and φt . Now, if we were to choose the step
sizes in such a way that αt = βt , or that the ratio αt /βt is bounded above and also below away from zero,
then nothing much would be gained by permitting two different step sizes. However, suppose that αt /βt → 0
as t → ∞. Then φt is updated more rapidly than θ t (or θ t is updated more slowly than φt ). In this setting,
(3.13) is said to represent “two time-scale” SA.
The final section of this chapter deals with “finite-time” SA. Traditional results in SA are asymptotic,
and assert that, under suitable conditions, the iterates converge to a solution of the problem under study as
t → ∞. In contrast, in “finite-time” SA, the emphasis is on providing (probabilistic) estimates of how far
the current guess is from a solution. The finite-time approach is applicable to each of the three variants of
SA mentioned above.
Throughout this chapter, the emphasis is on stating, and wherever possible proving, theorems about the
behavior of various SA algorithms. The application of these results to problems in RL is deferred to later
chapters.
1. Zt ∈ M(Fτ ) whenever τ ≥ t.
If {Zt }t≥0 is an Rd -valued stochastic process on (Ω, F, P ), then we can define the “natural filtration” by
Ft = σ(Z0t ),
where σ(Z0t ) ⊆ F is the σ-algebra generated by Z0t . However, we do not always use the natural filtration.
Suppose {Ft } is a filtration on (Ω, F), and that {Zt }t≥0 is an Rd -valued stochastic process on (Ω, F, P ).
Then the pair ({Zt }, {Ft }) is said to be a martingale if the following hold:
If we use the natural filtration Ft = σ(Z0t ), then (3.16) can be replaced by Condition (M3’), namely
If (3.16) is replaced by
E(Zt+1 |Ft ) ≤ Zt , a.s., ∀t ≥ 0, (3.18)
then {Zt }t≥0 is called a supermartingale, whereas if (3.17) is replaced by
The equality is replaced by ≤ for a supermartingale, and by ≥ for a submartingale. Next, by the expected
value preservation property (Item 3 of Theorem 8.1), it follows that2
Then it is obvious that {Zt } is also adapted to {Ft }. The sequence ({ξt }, {Ft }) is said to be a martingale
difference sequence if ({Zt }, {Ft }) is a martingale. It is easy to show using (3.16) that, if {ξt } is a
martingale difference sequence, then
If ξ0 = 0 almost surely (that is, if ξ0 is a constant), then it follows that E[ξt , P ] = 0 for all t ≥ 1. The picture
is clearer if each ξt belongs to L2 (Ω, Ft ). Then, by the projection property (Item 9) of Theorem 8.1, (3.26)
is equivalent to the statement that ξt+1 is orthogonal to every element of L2 (Ω, Ft ).
Example 3.1. Suppose {ξt } is a sequence of independent (but not necessariy identically distributed) zero-
mean random variables. Then {ξt } is a martingale difference sequence, and the sequence {Zt } is a martingale.
If E[ξt , P ] ≥ 0 for all t, then {Zt } is submartingale, whereas if E[ξt , P ] ≤ 0 for all t, then {Zt } is super-
martingale.
2 The reader is reminded that, wherever possible, we use parentheses for the conditional expectation, which is a random
variable, and square brackets for the expected value, which is a real number.
34 CHAPTER 3. STOCHASTIC APPROXIMATION
Now we present some results on the convergence of martingales, which are useful in proving the conver-
gence of various SA algorithms. The material is taken from [48, Chapter 12] and/or [14, Chapter 4] and is
stated without proof. Citations from these sources are given for individual results stated below.
We begin with a preliminary concept. Given a filtration {Ft } and a stochastic process {At } that is
adapted to Ft , we say that {(At , Ft )} is predictable if At ∈ M(Ft−1 ) for all t ≥ 1. Note that there is no
A0 . Also, note that in [48], such processes are said to be “previsible.” However, the phrase “predictable”
is used in [14] and appears to be more commonly used. We say that a martingale {Zt } (adapted to Ft ) is
null at zero if Z0 = 0 a.s., and that a predictable process {At } is null at zero if A1 = 0 a.s..3 With these
preliminaries, we can now state the following:
Theorem 3.1. (Doob decomposition theorem. See [48, Theorem 12.11] or [14, Theorem 4.3.2].) Suppose
{Ft } is a filtration and {Yt } is a stochastic process adapted to {Ft }. Then Yt can be expressed as
Yt = Y0 + Zt + At , (3.26)
where {Zt }t≥0 is a martingale null at zero, and {At } is a predictable process null at zero. If {Zt0 } and {A0t }
also satisfy the above conditions, then
P {ω : Zt (ω) = Zt0 (ω)&At (ω) = A0t (ω), ∀t} = 1. (3.27)
Moreover, {Yt } is a submartingale if and only if {At } is an increasing process, that is
P {ω : At+1 (ω) ≥ At (ω), ∀t} = 1. (3.28)
Note that an “explicit” expression for At is given by
t−1
X t−1
X
At = E((Yτ +1 − Yτ )|Fτ ) = [E(Yτ +1 |Fτ ) − Yτ ]. (3.29)
τ =0 τ =0
Next, suppose Yt = Mt2 , where {Mt } is a martingale in L2 (Ω, P ) null at zero. Then it is easy to show using
the conditional Jensen’s inequality (not covered here) that {Yt } is a submartingale null at zero. Therefore
the Doob decomposition of Yt = Mt2 is
Mt2 = Zt + At , (3.30)
where {Zt } is a martingale and {At } is an increasing predictable process, both null at zero. It is customary
to refer to {At } as the quadratic variation process and to denote it by hMt i. Note that
2
At+1 − At = E((Mt+1 − Mt2 )|Ft ) = E((Mt+1 − Mt )2 |Ft ). (3.31)
Define A∞ (ω) = limt→∞ At (ω) for (almost all) ω ∈ Ω. Then we have the following:
Theorem 3.2. (See [48, Theorem 12.13].) If A∞ (·) is bounded almost everywhere as a function of ω, then
{Mt (ω)} converges almost everywhere at t → ∞.
Actually [48, Theorem 12.13] is more powerful and gives “almost necessary and sufficient” conditions for
convergence. We have simply extracted what is needed for present purposes.
Theorem 3.3. (See [14, Theorem 4.2.12].) If {Zt } is nonnegative (i.e., Zt ≥ 0 a.s.) supermartingale, then
there exists a Z ∈ L1 (Ω, P ) such that Zt → almost surely, and E[Z, P ] ≤ E[Z0 , P ].
Theorem 3.4. Suppose {Zt } is a martingale wherein Zt ∈ Lp (Ω, P ) for some p > 1, and suppose further
that the martingale is bounded in k · kp , that is
sup E[Ztp , P ] < ∞. (3.32)
t
Then there exists a Z ∈ Lp (Ω, P ) such that Zt → Z as t → ∞, almost surely and in the p-th mean.
The above theorem is false if p = 1. The convergence is almost sure but need not be in the mean. See
[14, Example 4.2.13].
3 This is a slight inconsistency because actually At is “null at time one,” but the usage is common.
3.3. STANDARD STOCHASTIC APPROXIMATION 35
Theorem 3.5. (See [16, Theorem 1].) Suppose the following assumptions hold:
1. f (·) is passive with a zero at θ ∗ , that is, (3.35) holds.
2. There exists a constant d1 such that
kf (θ)k22 ≤ d1 (1 + kθ − θ ∗ k22 ). (3.38)
3. Let Ft := σ(θ t0 , ξ t1 ).4 Then the noise sequence {ξt+1 } satisfies two conditions: First,
E(ξ t+1 |Ft ) = 0 a.s.. (3.39)
Second, there exists a constant d2 > 0 such that
E(kξ t+1 k22 |Ft ) ≤ d2 (1 + kθ t k22 ) a.s., ∀t. (3.40)
4 The reader is reminded that θ t0 is a shorthand for (θ 0 , · · · , θ t etc., and σ(θ t0 , ξt1 ) denotes the σ-algebra generated by θ t0
and ξt1 .
36 CHAPTER 3. STOCHASTIC APPROXIMATION
For notational simplicity, suppose θ is changed to θ − θ ∗ , and f (θ) is changed to f (θ − θ ∗ ), so that the
unique solution to f (θ) = 0 is θ = 0. Then (3.40) continues to hold with possibly a different (but still finite)
constant d2 . Next,
Now we use (3.38), (3.39), and (3.40), and define d = d1 + d2 . This gives
E(kθ t+1 k22 |Ft ) = kθ t k22 + αt2 kf (θ t )k22 + αt2 E(kξ t+1 k22 |Ft ) + 2αt θ >
t f (θ t )
Zt := at kθ t k22 + bt , (3.43)
and choose the constants at , bt recursively so as to ensure that Zt is a nonnegative supermartingale. For this
purpose, note that
Now substitute
1 bt
kθ t k22 = Zt − .
at at
This gives
at+1 (1 + dαt2 ) at+1 (1 + dαt2 )
E(Zt+1 |Ft ) ≤ Zt − bt + at+1 dαt2 + bt+1 .
at at
So we get the supermartingale property
E(Zt+1 |Ft ) ≤ Zt
provide we define at+1 and bt+1 in such a way that
at+1 (1 + dαt2 ) at
= 1, or at+1 = ,
at 1 + dαt2
The square-summability of the step size sequence {αt } ensures that both a0 and b0 are well-defined. This
gives the closed-form expressions
∞
Y ∞
X ∞
X ∞
Y
at = (1 + dαk2 ), bt = dαk2 ak+1 = dαk2 2
(1 + dαm ). (3.44)
k=t k=t k=t m=k+1
Moreover, we have
1 < at+1 < at < a0 , 0 < bt+1 < bt < b0 , ∀t.
Therefore at ↓ a∞ ≥ 1 and bt ↓ b∞ ≥ 0.
With the above choices for at , bt , {Zt } is a nonnegative supermartingale. By Theorem 3.3, {Zt } converges
almost surely to some some random variable. Because at → a∞ > 1 and bt → b∞ , it follows from (3.43) that
kθ t k2 also converges almost surely to some random variable ζ. Thus the proof is complete once it is shown
that ζ = 0 a.s..
Since {Zt } is a nonnegative supermartingale, it follows that
E[Zt , P ] ≤ E[Z0 , P ], ∀t ≥ 0.
for some constant c. This is the first desired conclusion. Now let us return to (3.42). After taking the
expected value with respect to P and noting that
we get
E[kθ t+1 k22 , P ] ≤ E[kθ t k22 , P ] + αt2 d(1 + E[kθ t k22 , P ]) + 2αt E[θ >
t f (θ t ), P ].
t
X ∞
X ∞
X
2 αk E[θ >
k f (θ k ), P ] ≤ c + d(1 + c) αk2 − E[kθ t+1 k22 , P ] ≤ c + d(1 + c) αk2 .
k=0 k=0 k=0
In short, we have
∞
t
" #
X 1 X
αk E[θ >
k f (θ k ), P ] ≤ c + d(1 + c) 2
αk =: c1 , ∀t.
2
k=0 k=0
Now let t → ∞ and also, change the index of summation from k to t. This gives
∞
X
αt |E[θ >
k f (θ k ), P ]| ≤ c1 < ∞. (3.47)
t=0
P
Now recall that t αt = ∞. This implies that
|E[θ >
t f (θ t ), P ]| ≥ , ∀t ≥ K.
Also
∞
X
αt = ∞,
t=K
because dropping a finite number of terms in the summation does not affect the divergence of the sum.
These two observations taken together contradict (3.47). Therefore (3.48) holds. As a result, there is a
subsequence of {θ t }, call it {θ tk }, such that θ >
tk f (θ tk ) → 0 in probability. In turn this implies that there is
yet another subsequence, which is again denoted by {θ tk }, such that θ > tk f (θ tk ) → 0 almost surely as k → ∞.
Again, if
lim inf kθ tk k22 > 0,
k→∞
then applying (3.35) leads to a contradiction. Hence there exists a subsequence of times {tk } such that
kθ tk k22 → 0 almost surely as k → ∞. Now kθ tk k22 → ζ a.s., and a subsequence converges almost surely to 0.
This shows that ζ = 0 a.s..
As a concrete application of Theorem 3.5, we consider the convergence of the iterations of a contraction
map with noisy measurements. Though the result is a corollary of Theorem 3.5, it is stated separately as a
theorem in view of its importance.
First we describe the set-up. Suppose g : Rd → Rd is a global contraction with respect to the Euclidean
norm. Specifically, suppose there exists a constant ρ < 1 such that
To determine the unique fixed point θ ∗ of g(·), define the iterations as in (3.7), namely
Hence f (·) is passive, and moreover, θ ∗ is the unique zero of f (·). Also, we have that
where the definitions of a and b are obvious. Now using the easily proven identity
we conclude that f (·) satisfies (3.38). Then the desired result follows from Theorem 3.5.
3.3. STANDARD STOCHASTIC APPROXIMATION 39
Another application of Theorem 3.5 is to minimizing a convex function h : Rd → R using noisy measure-
ments of the gradient ∇h. Recall that a function h : Rd → R is said to be convex if
Moreover, θ ∗ is a global minimum of h if and only if ∇h(θ ∗ ) = 0. Hence one can attempt to find θ ∗ by
finding a zero of the function f (θ) = −∇h(θ), using noisy measurements of the form
Now the following assumptions are made about the function f . The first assumption is that the function
f is globally Lipschitz continuous with constant L. Thus
The next several assumptions concern the existence of a function V satisfying various assumptions. Though
it is not immediately obvious, these are actually assumptions about the function f . See Section 8.4 to see
the connection. It is now assumed that there exists a C 2 function V : Rd → R+ , and constants a, b, c > 0
and M < ∞ such that
akθ − θ ∗ k22 ≤ V (θ) ≤ bkθ − θ ∗ k22 , ∀θ ∈ Rd , (3.55)
This implies, among other things, that V (θ ∗ ) = 0 and that V (θ) > 0 for all θ 6= θ ∗ . Next
where V̇ : Rd → R is defined as
V̇ (θ) := ∇V (θ)f (θ). (3.57)
Note that the gradient ∇V (θ) is taken as a row vector. The last assumption on V is that
where ∇2 V is the Hessian matrix of V , and k · kS denotes the spectral norm of a matrix, that is, its largest
singular value.
As before, let Ft := σ(θ t0 , ξ t1 ). Then the noise sequence {ξt+1 } is assumed to satisfy the same two
conditions as before, reprised here for convenience: First,
Finally, the step size sequence {αt } satisfies the RM conditions (3.37), restated here:
∞
X ∞
X
αt = ∞, αt2 < ∞. (3.61)
t=1 t=1
Proof. The proof is analogous to that of Theorem 3.5, with kθ t k22 replaced by V (θ t ). To begin with, we
translate coordinates so that θ ∗ = 0. This may cause the constant d in (3.60) to change, but it would still
be finite. Note that
V (θ t+1 ) = V (θ t + αt (f (θ t ) + ξ t+1 )).
Now by Taylor’s expansion around θ t , we get
Now apply
L2 1
kf (θ t )k22 ≤ L2 kθ t k22 ≤ V (θ t ), kθ t k22 ≤ V (θ t ).
a a
This leads to
α2 M
E(V (θ t+1 )|Ft ) ≤ 1 + t (L2 + d) V (θ t ) + dM αt2 .
a
This is entirely analogous to (3.42). By replicating earlier reasoning, we conclude that V (θ t ) is bounded
almost surely and converges to some random variable ζ. In the last part of the proof, we restore the term
αt ∇V (θ t )f (θ t ) which is neglected earlier, and observe that
In fact, this part of the proof is simpler than that of Theorem 3.5, because in that case there is no bound
on how small the product |θ t f (θ t )| can be, whereas here the term |∇V (θ t )f (θ t )| is bounded below by the
term (c/b)V (θ t ). This allows us to conclude that ζ = 0 a.s., so that V (θ t ) is bounded almost surely and
approaches zero at t → ∞. Finally, using (3.55) gives that {θ t } is also bounded almost surely and approaches
zero at t → ∞. These details are simple and are left to the reader.
Next we present two examples, one where convergence follows from Theorem 3.5 but Theorem 3.8 does
not apply, and the other one is vice versa.
Example 3.2. As shown in Section 8.4, the hypotheses of Theorem 3.8 imply that the origin is a globally
exponentially stable (GES) equilibrium of the ODE θ̇ = f (θ). Now suppose d = 1 (scalar problem), and
define f (θ) = − tanh(θ). Then |f (θ)| is bounded as a function of θ. Thus the ODE θ̇ = − tanh(θ) cannot be
GES. Hence Theorem 3.8 does not apply. However, since θ tanh(θ) > 0 for every θ 6= 0, Theorem 3.5 applies.
Example 3.3. In the other direction, suppose we wish to solve the fixed point equation
v = Gv + r
for some vector r ∈ Rd . We can cast this into the standard stochastic approximation framework by defining
f (θ) = r + (G − Id )θ.
This is like the standard relationship of a discounted reward Markov reward process. Now choose a matrix
G ∈ Rd×d that satisfies kGkS > 1, and ρ(G) < 1, where ρ(·) denotes the spectral radius. Then choose a
vector x ∈ Rd such that x> x < x> Gx. For example, let d = 2, λ ∈ (0.5, 1), and
λ 1 1
G= ,x = .
0 λ 1
Because ρ(G) < 1, the matrix G − Id is nonsingular, so that for each vector r, there is a unique solution θ ∗
to the equation f (θ) = 0, namely θ ∗ = (G − Id )−1 r. Now let θ = θ ∗ + x where x is chosen as above. Then
Hence (3.5) does not hold, and Theorem 3.5 does not apply.
On the other hand, because ρ(G) < 1, the eigenvalues of the matrix G − Id all have negative real parts,
and as a result, the Lyapunov matrix equation
G> P + P G = −Id
has a unique solution P which is positive definite. Then, if we define V (θ) = θ > P θ, then Theorem 3.8
applies, and implies that the iterates converge to θ ∗ .
kθ ts k∞ = max kθ τ k∞ .
s≤τ ≤t
The symbol N denotes the set of natural numbers plus zero, so that N = {0, 1, · · · }. If {Ft } is a filtration
with t ∈ N, then M(Ft ) denotes the set of functions that are measurable with respect to Ft . Recall that the
symbol [d] denotes the set {1, · · · , d}.
In Section 3.3, the objective is to compute the zero of a function f : Rd → Rd . Theorems 3.5 and 3.8
provide sufficient conditions under which “synchronous” stochastic approximation can be used to solve this
problem. In particular, if g : Rd → Rd is a contraction map with respect to the `2 -norm, then Theorem 3.6
shows that a fixed point of g can be found using the iterative scheme (3.5). However, as seen in Chapter 2,
often one has to determine the fixed point of a map g : Rd → Rd which is a contraction in the `∞ -norm.
Theorem 3.5 does not apply to this case. Proving a version that works in this case is the main motivation
for introducing “asynchronous” stochastic approximation in [39]. The treatment below basically extends
those arguments to the case where more than one component of θ t is updated at any time. An alternate
version of ASA is introduced in [7] using local clocks. However, the assumptions on the measurement noise
process {ξ t } are fairly restrictive, in that the process is assumed to be an i.i.d. sequence, an assumption that
does not hold in RL problems. In contrast, in [39], the noise process is assumed only to be a martingale
difference sequence, which is satisfied in RL problems. Thus, even if only one component of θ t is updated
at any one time t, the treatment here combines the best features of both [39] (noise process is a martingale
difference sequence) and [7] (updating can use a local clock). The possibility of “batch” updating is a bonus.
One important difference between standard SA and BASA is that even the step sizes can now be random
variables. In principle random step sizes can be permitted even in standard SA, but it is not common.
The proof is divided into two parts. In the first, it is shown that iterations using a very general updating
rule are bounded almost surely. This result would appear to be of independent interest; it is referred to
as “stability” by some authors. Then this result is applied to show convergence of the iterations to a fixed
point of a contractive map. In contrast with both [39] and [7], for the moment we do not permit delayed
information.
3.4. BATCH ASYNCHRONOUS STOCHASTIC APPROXIMATION 43
For the first theorem on almost sure boundedness, we set up the problem as follows: We consider a
function h : N × (Rd )N → (Rd )N , and say that it is nonanticipative if, for each t ∈ N,
θ∞ ∞ d N t t ∞ ∞
0 , φ0 ∈ (R ) , θ 0 = φ0 =⇒ h(τ, θ 0 ) = h(τ, φ0 ), 0 ≤ τ ≤ t.
Note that such functions are also referred to as “causal” in control and system theory. There are also d
distinct “update processes” {νt,i }t≥0 for each i ∈ [d]. Finally, there is a “step size” sequence {βt }t≥0 , which
for convenience we take to be deterministic, though this is not essential. Also, it is assumed that βt ∈ (0, 1)
for all t.
The “core” stochastic processes are the parameter sequence {θ t }t≥0 , and the noise sequence {ξ t }t≥1 .
Note the mismatch in the initial values of t. Often it is assumed that θ 0 is deterministic, but this is not
essential. We define the filtration
F0 = σ(θ 0 ), Ft = σ(θ t0 , ξ t1 ) for t ≥ 1,
where σ(·) denotes the σ-algebra generated by the random variables inside the parentheses.
Now we can begin to state the problem set-up.
(U1). The update processes νt,i ∈ M(Ft ) for all t ≥ 0, i ∈ [d].
(U2.) The update processes satisfy the conditions that ν0,i equals either 0 or 1, and that νt,i equals either
νt−1,i or νt−1,i + 1. In other words, the process can only increment by at most one at each time instant
t, for each index i. This automatically guarantees that νt,i ≤ t for all i ∈ [d].
(S1). For “batch asynchronous updating” with a “global clock,” the step size αt,i for each index i is defined
as
βt if νt,i = νt−1,i + 1,
αt,i = (3.62)
0, if νt,i = νt−1,i .
Therefore αt,i equals βt for those indices i that get incremented at time t, and zero for other indices.
(S2.) For “batch asynchronous updating” with a “local clock,” the step size αt,i for each index i is defined
as
βνt ,i if νt,i = νt−1,i + 1,
αt,i = (3.63)
0, if νt,i = νt−1,i .
Note that, with a local clock, we can also write
t
X
αt,i = ν0,i + I{νt,i =νt−1,i +1} , (3.64)
τ =1
where I denotes the indicator function: It equals 1 if the subscripted statement is true, and equals 0
otherwise. Note that with a global clock, the step size for each index that has a nonzero αt,i is the
same, namely βt . However, with a local clock, this is not necessarily true.
With this set-up, we can define the basic asynchronous iteration scheme.
θt+1,i = θt,i + αt,i (ηt,i − θt,i + ξt+1,i ), i ∈ [d], t ≥ 0, (3.65)
where
η t = h(t, θ t0 ). (3.66)
Note that in our view it makes no sense to define
ν
ηt,i = hi (νt,i , θ 0t,i ).
We cannot think of any application for such an updating rule.
The question under study in the first part of the paper is the almost sure boundedness of the iterations
{θ t }. For this purpose we introduce a few assumptions.
44 CHAPTER 3. STOCHASTIC APPROXIMATION
(A2.) The step sizes αt,i satisfy the analogs of the Robbins-Monro conditions of [29], namely
∞
X
αt,i = ∞ a.s., ∀i ∈ [d], (3.71)
t=0
∞
X
2
αt,i < ∞ a.s., ∀i ∈ [d]. (3.72)
t=0
P∞ 2
Note that if t=0 βt < ∞, then the square-summability of the αt,i is automatic. However, the
divergence of the summation of αt,i requires additional conditions on both the step sizes βt and the
update processes νt,i .
Now we can state the result on the almost sure boundedness of the iterates.
Theorem 3.9. With the set up above, and subject to assumptions (A1), (A2) and (A3), we have that the
sequence {θ t } is bounded almost surely.
Next, we state a result on convergence. For this purpose, we significantly reduce the generality of the
function h(·).
Theorem 3.10. Suppose that, in addition to the conditions of Theorem 3.9, we have that the function h(·)
is defined by
h(t, θ t0 ) = g(θ t ),
where the function g : Rd → Rd satisfies
for some constant g < 1.5 Let θ ∗ denote the unique fixed point of the map g. Then θ t → θ ∗ almost surely
as t → ∞.
5 In other words, the function g is a contraction with respect to the `∞ -norm.
3.4. BATCH ASYNCHRONOUS STOCHASTIC APPROXIMATION 45
Now we prove the two theorems. The proof of Theorem 3.9 is very long and involves several auxiliary
lemmas. Before proceeding to the lemmas, a matter of notation is cleared up. Throughout we are dealing
with stochastic processes. So in principle we should be writing, for instance, θ t (ω), where ω is the element
of the probability space that captures the randomness. We do not do this in the interests of brevity, but the
presence of the argument ω is implicit throughout.
Lemma 3.1. Define a real-valued stochastic process {Ui (0; t)}t≥0 by
where Ui (0, 0) ∈ F0 , {αt,i } satisfy (3.71) and (3.72), and {ξ t } satisfies (3.73) and (3.74). Then Ui (0; t) → 0
almost surely as t → ∞.
This lemma is a ready consequence of Theorem 3.5. Note that (3.75) attempts to find a fixed-point of
the function f (U ) = −U , and that U f (U ) < 0 whenever U 6= 0. Now apply the theorem for U = Ui .
Lemma 3.2. Define a doubly-indexed real-valued stochastic process Wi (τ ; t) by
where Wi (t0 , t0 ) = 0, {αt,i } satisfy (3.71) and (3.72), and {ξ t } satisfies (3.73) and (3.74). Then, for each
δ > 0, there exists a t0 such that |Wi (s; t)| ≤ δ, for all t0 ≤ s ≤ t.
Proof. It is easy to prove using induction that Wi (s; t) satisfies the recursion
"t−1 #
Y
Ui (0; t) = (1 − αr,i ) Ui (0; s) + Wi (s; t), 0 ≤ s ≤ t.
r=s
Note that the product in the square brackets is no larger than one. Hence
Now, given δ > 0, choose a t0 such that |Ui (0; t)| ≤ δ/2 whenever t ≥ t0 . Then choosing t0 ≤ s ≤ t and
applying the triangle inequality leads to the desired conclusion.
Now we come to the proof of Theorem 3.9.
Proof. For t ≥ 0, define
Γt := max{kθ t0 k∞ , c1 }, (3.77)
where c1 = (ρ − γ)/c0 as before. With this definition, it follows from (3.70) and (3.66) that
Next, choose any ∈ (0, 1) such that ρ(1 + ) < 1.6 Now we define a sequence of constants recursively.
Let Λ0 = Γ0 , and define
Λt , if Γt+1 ≤ Λt (1 + ),
Λt+1 = (3.79)
Γt+1 if Γt+1 > Λt (1 + ).
Define λt = 1/Λt . Then it is clear that {Λt } is a nondecreasing sequence, starting at Γ0 = kθ 0 k∞ . Conse-
quently, {λt } is a bounded, nonnegative, nonincreasing sequence. Further, λt Γt ≤ 1 + for all t. Moreover,
either Λt+1 = Λt , or else Λt+1 > Λt (1 + ). Hence, saying that Λt+1 > Λt is the same as saying that
Λt+1 > Λt (1 + ). Let us refer to this as an “updating” of Λt at time t + 1.
Next, observe that
6 In the proof of Theorem 3.9, it is sufficient that ρ(1 + ) ≤ 1. However, in the proof of Theorem 3.10, we require that
ρ(1 + ) < 1. To avoid proliferating symbols, we use the same in both proofs.
46 CHAPTER 3. STOCHASTIC APPROXIMATION
ΛT = ΛT +1 = · · · = Λt−2 = Λt−1 .
kθ t+1 k∞ λt ≤ 1 + , ∀t ≥ T, (3.81)
then Λ is not updated after time t, and as a result, {θ t } is bounded. So henceforth our efforts are focused
on establishing (3.81).
Now we derive a “closed form” expression for λT θT +1,i . Towards this end, observe the following: A closed
form solution to the recursion
at+1 = (1 − xt )at + bt (3.82)
is given by " # " #
T
Y T
X T
Y
aT +1 = (1 − xr ) as + (1 − xr ) bt , (3.83)
r=s t=s r=t+1
for every 0 ≤ s ≤ T . The proof by induction is easy and is left to the reader. Observe that
" T
# T
" T
#
Y X Y
GT +1,i = (1 − αr,i ) Gs + (1 − αr,i ) αt,i λt vt,i . (3.87)
r=s t=s r=t+1
Adding these two equations, noting that Fs + Gs = λs θs,i , and expanding vt,i as ηt,i + ξt+1,i gives
X T
T h Y i
λT +1 θTi +1 = (1 − αri ) θt+1
i
(λt+1 − λt )
t=s r=t+1
X T
T h Y i T
hY i
+ (1 − αri ) αti (λt ηti + λt ξt+1
i
)+ (1 − αri ) λs θsi . (3.88)
t=s r=t+1 r=s
Cancelling the common term λT +1 θT +1,i and moving λT θT +1,i to the left side gives the desired expression,
namely
λT θT +1,i = Ai (s, T ) + Bi (s, T ) + Ci (s, T ) + Di (s, T ), (3.90)
where for 0 ≤ s < T
T
X −1h T
Y i
Ai (s, T ) = (1 − αri ) θt+1
i
(λt+1 − λt ) (3.91)
t=s r=t+1
T
hY i
Bi (s, T ) = (1 − αri ) λs θsi (3.92)
r=s
T h Y
X T i
Ci (s, T ) = (1 − αri ) αti λt ηti (3.93)
t=s r=t+1
48 CHAPTER 3. STOCHASTIC APPROXIMATION
T h Y
X T i
Di (s, T ) = (1 − αri ) αti λt ξt+1
i
. (3.94)
t=s r=t+1
Now we carry on with the proof of Theorem 3.9. First, we find an upper bound for Ci (s, T ). Observe
that, because η(·) is a contraction with constant ρ, and ρ(1+) ≤ 1, we get from (3.79) that Γt+1 ≤ Λt (1+),
or Γt λt ≤ 1 + . Since kθ t0 k∞ ≤ Γt , all this implies that
t=s r=t+1
T
X −1h T
Y i
= αT,i λT ξT +1,i + (1 − αri ) αti λt ξt+1
i
t=s r=t+1
T
X −1
−1h TY i
= αT,i λT ξT +1,i + (1 − αT,i ) (1 − αri ) αti λt ξt+1
i
t=s r=t+1
= αT,i λT ξT +1,i + (1 − αT,i )Di (s, T − 1). (3.97)
If Λ is not updated at any time after t0 , then it follows from earlier discussion that {θ t } is bounded.
Otherwise, define s to be any time after t0 + 1 at which Λ is updated. Thus kθ s k∞ λs = 1 due to the
updating. Next, suppose that, for some T ≥ s, we have that
kθ t k∞ λt ≤ 1 + , ∀s ≤ t ≤ T. (3.99)
3.4. BATCH ASYNCHRONOUS STOCHASTIC APPROXIMATION 49
Finally, from Claim 3, we have that |Di (s, T )| ≤ . Combining all these shows that
Proof. Now we come to the proof of Theorem 3.10. We study is the common situation where
h(t, θ t0 ) = g(θ t ),
where g : Rd → Rd is a contraction with respect to the `∞ -norm. In other words, there exists a γ < 1 such
that
|gi θ) − gi (φ)| ≤ γkθ − φk∞ , ∀i ∈ [d], ∀θ, φ ∈ Rd .
In addition, define c0 = kg(0)k∞ . This notation is consistent with (3.67) and (3.70). These hypotheses
imply that there is a unique fixed point θ ∗ ∈ Rd of g(·). Theorem 3.10 states that θ t → θ ∗ almost surely as
t → ∞.
For notational convenience, it is assumed that θ ∗ = 0. The modifications required to handle the case
where θ ∗ 6= 0 are obvious, and can be incorporated at the expense of more messy notation. If θ ∗ = 0, then
g(0) = 0 so that c0 = 0. Hence we can take c1 = 0, ρ ∈ (γ, 1), and (??) becomes
Let Ω0 ⊆ Ω be the subset of Ω such that Ui (0, t)(ω) → 0 as t → ∞, for all i ∈ [d]. So for each ω ∈ Ω0 ,
there exists a bound H0 = H0 (ω) such that
kθ ∞
0 (ω)k∞ ≤ H0 (ω), or |θt,i (ω)| ≤ H0 (ω), ∀i ∈ [d], ∀t ≥ 0.
Hereafter we suppress the dependence of ω. Now choose ∈ (0, 1) such that ρ(1 + ) < 1. Note that in the
proof of Theorem 3.9, we required only that ρ(1 + ) ≤ 1. For such a choice of , Theorem 3.9 continues to
apply. Therefore kθ ∞0 k∞ ≤ H0 . Now define Hk+1 = ρ(1 + )Hk for each k ≥ 0. Then clearly Hk → 0 as
k → ∞.
Now we show that there exists a sequence of times {tk } such that kθ ∞ tk k∞ ≤ Hk , for each k. This is
enough to show that θ t (ω) → 0 for all ωωΩ0 , which is the claimed almost sure convergence to the fixed point
of g. The proof of this claim is by induction. It is shown that if there exists a tk such that kθ ∞ tk k∞ ≤ Hk ,
then there exists a tk+1 such that kθ ∞ k
tk1 ∞ ≤ Hk1 . The statement holds for k = 0 – take t0 = 0 because H0
∞ ∞
is a bound on kθ 0 k∞ . To prove the inductive step, suppose that kθ tk k∞ ≤ Hk . Choose a τk ≥ t0 such that
the solution to (3.76) satisfies
ρ
|Wi (τk ; t)| ≤ Hk , ∀i ∈ [d], ∀t ≥ τk . (3.100)
2
50 CHAPTER 3. STOCHASTIC APPROXIMATION
Then clearly Yt,i → ρHk as t → ∞ for each i ∈ [d]. Now it is claimed that
The proof of (3.102) is also by induction on t. The bound (3.102) holds for t = τk because Wi (τk ; τk ) = 0,
and the inductive assumption on k, which implies that
|θτk ,i | ≤ kθ ∞ ∞
τk k∞ ≤ kθ tk k∞ ≤ Hk = Yτk ,i .
Now note that if (3.102) holds for a specific value of k, then for t ≥ τk we have
This proves the upper bound in (3.102) for t + 1. So the inductive step is established, showing that the
upper bound in (3.102) is true for all t ≥ τk . Now note that Yt,i → ρHk as t → ∞. Hence there exists a
t0k+1 ≥ τk such that
ρ
Yt,i ≤ ρHk + =ρ 1+ Hk , ∀t ≥ t0k+1 .
2 2
This, combined with (3.100) shows that
Hence
θt,i ≤ Hk+1 ∀i ∈ [d], t ≥ t0k+1 .
A parallel argument gives a lower bound: There exists a t00k+1 such that
This establishes the inductive step for k and completes the proof.
The contents of Chapter 2 are based on the assumption that the parameters of the Markov Decision Process
are all known. In other words, the |U| possible state transition matrices Auk , as well as the reward map
R : X × U → R (or its random version), are all available to the agent to aid in the choice of an optimal
policy. One can say that the distinction between MDP theory and reinforcement learning (RL) theory is
that in the latter, it is not assumed that the parameters of the MDP are known. Thus, in RL, one attempts
to learn these parameters based on observations.
In the RL literature, a couple of phrases are widely used without always being defined precisely. The
first phrase is “tabular methods.” As we will see, the methods presented in this chapter attempt to form
estimates of the value function, or the action-value function, for a specific policy. These estimates are almost
invariably iterative, in that the next estimate is based on the previous one. To elaborate this point, let V̂t (xi )
denote the estimate, at step t, of the value of the state xi . In many if not most iterative procedures, V̂t (xi )
depends on the estimates for all states V̂t−1 (x1 ) through V̂t−1 (xn ) at step t−1. When we attempt to estimate
the action-value function, the estimate Qt (xi , uk ) depends on all previous estimates of Qt−1 (xj , wl ). This
requires that the problems under study be sufficiently small that these estimates will fit into the computer
storage. The second phrase that is used is “on-policy” simulation and its twin, “off-policy” simulation. What
this means is the following: Suppose we wish to approximate the value function Vπ for a particular policy
π, known as the target policy. For this purpose, we have available a sample path of the MDP under some
other policy θ, which is known as the behavior policy. If the behavior policy is the same as the target
policy, that is, if we are able to observe a sample path of the Markov process under the same policy that we
wish to evaluate, then that situation is referred to as on-policy; otherwise the situation is called off-policy.
51
52 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION
Pr{Xt+1 = xi |Xt = xi } = 1,
or equivalently, the row of the state transition matrix corresponding to the state xi consists of a 1 in column
i and zeros in other columns. The Markov process can have more than one absorbing state. While the
dynamics of the MDP are otherwise assumed to be unknown, it is assumed that the learner knows which
states are absorbing. By tradition, it is assued that the reward R(xi ) = 0 whenever xi is an absorbing state.
Observe too that the state transition matrix Aπ depends on the policy under study. So different policies
could lead to different absorbing states. However, since we study one policy at a time, it is acceptable to
assume that the set of absorbing states under the policy being studied is known.
In this setting, an episode refers to any sample path {(Xt , Ut , Wt+1 }t≥0 that terminates in an absorbing
state. Since the policy π is chosen by the learner and is deterministic, it is always the case that Ut = π(Xt ).
Therefore Ut does not add any new information. Once Xt reaches an absorbing state, the episode terminates.
The underlying assumption is that, once the Markov process reaches an absorbing state, it can be restarted
with the initial state distributed according to its stationary (or some other) distribution. This assumption
may not always hold in practice.
Now we discuss how to generate an estimate for the discounted future value, based on a single episode.
The assumption mentioned above implies that we can repeat the estimation process over multiple episodes,
and then average all of those estimates, to arrive finally at an overall estimate. Define
∞
X
Gt = γ i Wt+i . (4.1)
i=0
If an absorbing state is reached after a finite time, say T , then the summation can be truncated at time T ,
because Wt+1 = 0 for t ≥ T . In this connection, recall Theorem 8.12, which gives a formula for the average
time needed to hit an absorbing state. Now by definition, for a state xi ∈ X , we have
Accordingly, suppose that an episode contains the state of interest xi at time τ , that is, Xτ = xi . Let us
also suppose that the episode terminates at time T . In such a case, we note that
"T −τ # "∞ #
X X
E γ i Wτ +i = E γ i Wτ +i = V (xi ).
i=0 i=0
provides an unbiased estimate for V (xi ). It is therefore one method of estimating V (xi ). Now suppose we
have L episodes, call them E1 , · · · , EL . Let k the number of these episodes in which the state of interest xi
occurs. Without loss of generality, renumber the episodes so that these are E1 through Ek . It is of course
possible that the state of interest xi occurs more than once in the sample path. Therefore there are a couple
of different ways of estimating V (xi ) using this collection of episodes. For each such episode, let τ denote
the first time at which xi appears in the state sequence, and T the time at which the episode terminates.1
Further, define
T
X −τ
Hl := γ i Wτ +i .
i=0
Then
k
1X
Hl (4.2)
k
l=1
provides an estimate for V (xi ), known as the first-time estimate. If the state of interest xi occurs multiple
times within the same episode, then one can form multiple estimates Hl each time the state of interest xi
occurs in the trajectory, and then average them. This called the everytime estimate.
Example 4.1. The objective of this example is to illustrate the difference between a first-time estimate
and an everytime estimate. Suppose n = 3, and for convenience label the states as A, B, C, where A is
an absorbing state and B, C are nonabsorbing. Suppose further that R(C) = 3, R(B) = 2 and of course
R(A) = 0. Suppose L = 3 and that the three episodes (all terminating at A) are:
Now suppose we wish to estimate the value V (C). Then the episode E2 does not interest us because C does
not occur in it. If the discount factor γ equals 0.9, then we can form the following quantities:
it can be stated that the first-time estimate converges to the true value V (xi ). Moreover, one can invoke
Hoeffding’s inequality of Theorem 8.13 to generate a bound on just how reliable this estimate is, in terms of
accuracy and confidence. The everytime estimates are not statistically independent. Therefore the analysis
of their convergence is more delicate. It is shown in [32] that even the everytime estimate converges to
the true value V (xi ), if the number of episodes in which the state of interest xi occurs approaches infinity.
However, unlike the first-time estimate, the everytime estimate is biased, and has higher variance.
Example 4.2. This example, taken from [32] illustrates the bias of the everytime estimate. Consider a
Markov process with just two states, called S for state and A for absorbing respectively. Suppose the state
transition matrix is
S A
S 1−p p
A 0 1
So all trajectories starting in S look like SS · · · SA, where there are say l occurrences of S. If we were to
attach a reward R(S) = 1, and set the discount factor γ to 1, then the first-time estimate for V (S) would be
l, the length of the sample path before hitting A. Moreover, the analysis of hitting times given in Theorem
8.12 shows that the average length of this sample path is 1/p (which may not be an integer, but which is
the correct answer). On the other hand, there are l everytime estimates of V for such a trajectory, and their
sum is l(l + 1)/2. So the everytime estimate is (l + 1)/2 for a trajectory that consists of l occurrences of S
followed by A. Hence the expected value of the everytime estimate is ((1/p) + 1)/2, which is erroneous by a
factor of 2 if p is very small.
The Monte Carlo method for estimating the value of a policy suffers from many drawbacks. In the case
of first-time estimates, the total number of samples of a particular value V (xi ) equals the total number
of episodes that contain the state xi . However, each episode can be quite long. Thus the total number
of time steps elapsed can be far larger than the number of episodes. This makes the estimation process
very slow, with a long observation leading to a small number of samples. Also, because the number of
observations is very high, the variance of the estimate can also be very high. Another requirement is that,
in order to be able to form an estimate for V (xi ) for a particular state xi , that state must occur in a large
fraction of episodes. In turn this requires that when the Markov process with state transition matrix Aπ
is started from a randomly selected initial state, the resulting sample path must pass through the state xi
with high probability before hitting an absorbing state. Moreover, the method is based on the assumption
that, once the Markov process reaches an absorbing state, it can be restarted with a specified probability
distribution for the initial state. This assumption often does not hold in practice. Finally, note that “partial
episodes,” that is to say, trajectories of the Markov process that do not terminate in an absorbing state,
are of no use in forming an estimate of V (xi ). Everytime estimates provide a larger number of samples for
V (xi ) than first-time estimates. This is because every episode provides only one first-time estimate, but can
provide multiple everytime estimates. The difficulty is that these everytime estimates are not statistically
independent. Moreover, all of the above comments regarding the drawbacks of first-time estimates apply
also to everytime estimates.
to overcome this problem is to restrict π to be a probabilistic policy, with the additional requirement that
each state-action pair (xi , uk ) has a positive probability under π. To put it another way, for each xi ∈ X ,
the probability distribution π(xi , ·) on U has all positive elements. Even in this case however, given that
the number of state-action pairs is much larger than the number of states alone, the number of episodes
required for reliably estimating the Q-function would be far larger than the number of episodes required
for reliably estimating the V -function. Another possible approach is to use “off-policy” sampling, that is,
generate samples using a policy that is different from the policy that we wish to evaluate. In this case,
the policy π for which we wish to estimate Qπ is called the “target” policy, which can be deterministic,
and/or have several missing pairs (xi , uk ). In contrast, the policy φ that is used to generate the samples is
called the “behavior” policy. One adjusts for the fact that φ may be different from π using a method called
“importance sampling,” which is described next.
Suppose we wish to estimate vπ for a policy π, but all we have is a sample path under another policy φ.
Let the sequence of observations be {Xt , Ut , Wt+1 }Tt=0 . It is assumed that the policy φ is probabilistic and
“dominates” π. That is,
Pr{π(xi ) = uk } > 0 =⇒ Pr{φ(xi ) = uk } > 0. (4.3)
The easiest way to satisfy (4.3) is to choose φ ∈ Πp (a probabilistic policy) such that the right side of the
equation is always positive, that is
Pr{φ(xi ) = uk } > 0 ∀xi ∈ X , uk ∈ U.
If we had a sample path under the policy π, then we could estimate Vπ (xi ) for each fixed state xi ∈ X
as follows: Take all episodes that contain xi , and discard the initial part of the episode that happens before
the first occurrence of xi . For example, suppose a Markov process has the state space {B, C, D, A} where A
is an absorbing state, and B is the state of interest. Suppose there are three episodes, namely
E1 = CBDBCDA, E2 = CDDCA, E3 = BCDBCA.
Then we ignore E2 and discard the first part of E1 . This gives
Ē1 = BDBCDA, Ē2 = BCDBCA.
Now back to the discussion. After discarding sample paths that do not contain xi and the initial parts of
the sample paths that do contain xi , we concatenate them while keeping markers of where one sample path
ends and the next one begins. Let J(xi ) denote the number of distinct time instants when Xt = xi . For
first-time estimates, we count only the occasions when the sample path starts with xi , whereas for everytime
estimates, we count all occasions when Xt = xi . For instance, in the example above, wth xi = B, for
first-time estimates we choose J(B) = {1, 7}, while for everytime estimates, we choose J(B) = {1, 3, 7, 10}.
In either case, we compute
1 X
V̂π (xi ) = Gt , (4.4)
|J(xi )|
t∈J(xi )
where
T
X −t
Gt = γ l Wt+l
l=0
is the discounted reward.
Now we describe importance sampling. In case π is a probabilistic policy, there is a likelihood of the
state-action pairs
Pr{Ut , Xt+1 , Ut+1 , · · · , XT |Xt , UtT −1 ∼ π},
where UtT −1 ∼ π means that Pr{Uτ |Xτ } has the distribution π(Xτ ), for t ≤ τ ≤ T − 1. This quantity can
be expressed as
−1
TY
Pπ = π(Uτ |Xτ ) Pr{Xτ +1 |Xτ , Uτ }.
τ =t
56 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION
However, because the sample path is generated using the policy φ, what we can actually measure is
−1
TY
Pφ = φ(Uτ |Xτ ) Pr{Xτ +1 |Xτ , Uτ }.
τ =t
Now note there is a simple formula for the ratio of the two likelihoods. Specifically
TY −1
Pπ π(Uτ |Xτ )
ρ[t,T −1] := = . (4.5)
Pφ τ =t
φ(Uτ |Xτ )
In other words, the unknown transition probabilities Pr{ Xτ +1 |Xτ , Uτ } simply cancel out. Therefore the
quantity ρ[t,T −1] can be computed because π, φ are known policies, and Uτ , Xτ can be observed.
With this background, we can modify the estimate in (4.4) in one of two possible ways. The estimate
1 X
V̂π (xi ) = ρ[t,T −1] Gt (4.6)
|J(xi )|
t∈J(xi )
is called the “weighted importance sampling.” In each case, the term ρ[t,T −1] compensates for off-policy
sampling. The ordinary one is unbiased but can have very large variance, whereas the weighted one is biased
but consistent, and has lower variance. For further details, see [33, p. 105].
This approach can also be used to estimate Qπ on the basis of a sample path run on another policy φ;
see [33, p. 110].
Then
Vφ (xi ) ≥ Vπ (xi ), ∀xi ∈ X . (4.10)
Moreover, suppose there is a state xi ∈ X such that (4.9) holds with strict inequality, that is
Then there is a state xj ∈ X such that (4.10) holds with strict inequality, that is,
Now look at the second term on the right side, without the γ. This equals
Noting that
Vπ (Xt+2 ) = Qπ (Xt+2 , π(Xt+2 )) ≤ Qπ (Xt+2 , φ(Xt+2 )),
we can repeat the above reasoning. This gives, for every integer l
" l−1 #
X
Vπ (xi ) ≤ Eφ γ i R(Xt+i )|Xt = xi + γ l Eφ [Qπ (Xt+l , φ(Xt+l ))|Xt = xi ].
i=0
As l → ∞, the second term approaches zero, while the first term approaches Vφ (xi ). Hence (4.10) follows.
The statements about strict inequalities are easy to prove and are left as an exercise.
The policy improvement theorem suggests the following “greedy” approach to finding an optimal policy.
Suppose that we have an initial policy and corresponding action-value function Qπ . We can define a new
“greedy” policy via
k ∗ := arg max Qπ (xi , uk ), φ(xi ) = uk∗ , ∀xi ∈ X . (4.12)
uk ∈U
Then (4.9) holds by construction, and it follows from Theorem 4.1 that Vφ (xi ) ≥ Vπ (xi ) for all xi ∈ x,
or equivalently, vφ ≥ vπ . Now we can compute the action-value function corresponding to φ and repeat.
A consequence of Theorem 4.1 is that unless equality holds for all xi ∈ X in (4.9), we have that at least
one component of vφ exceeds that of vπ . Now, it is easy to show that if the above greedy update policy
terminates with (4.9) holding with equality for all xi ∈ X , then not only is vφ = vπ , but both are optimal
policies. Another noteworthy point is that the incremental update in (4.12) can be implemented for just one
index i; in other words, the update can be done asynchronously.
Now we show how to use Theorem 4.1 to construct “-greedy” policies. The update rule (4.12) can initially
perform poorly when the current guess π is far from being optimal. Moreover, in reinforcement learning,
there is always a trade-off between exploration and exploitation. One way to achieve both exploration and
exploitation simultaneously is to use probabilistic policies, where for a given state xk , every policy is applied
with positive probability.
Suppose π ∈ Πp is a probabilistic policy. Then we use the notation
Suppose π ∈ Πp is the current -soft policy. We can generate an updated -soft policy φ as follows: Define
a deterministic policy ψ ∈ Πd by
which is (4.9). In the next to last equation, we reason as follows: Define nonnegative constants λk that add
up to one, as follows:
1
λk := π(uk |xi ) − .
1− |U|
Then X
λk Qπ (xi , uk ) ≤ max Qπ (xi , uk ).
uk ∈U
uk ∈U
Because (4.9) holds, we can apply the -greedy updating rule to improve the policy.
The import of Theorem 4.2 is that the above -greedy updating rule will eventually converge to the
optimal -soft policy. The -greedy policy and updating rule can be combined with a “schedule” for reducing
to zero, which would presumably converge to the optimal policy.
where
Rt+τ +1 = R(Xt+τ , Ut+τ )
4.2. TEMPORAL DIFFERENCE METHODS 59
is the reward, but paid at time t + τ + 1. If the reward is random, then the expectation includes this
randomness also. Now (4.14) can be expanded as
"∞ #
X
V (xi ) = Rt+1 + E γ τ Rt+τ +1 |Xt = xi .
τ =1
With this background, we can think of Monte Carlo simulation as a way to approximate V (xi ) using
the formula (4.14). Specifically, suppose an episode passes through the state of interest xi at time t and
terminates in an absorbing state a tme T . Then the quantity
T
X −t
V̂ (xi ) = γ τ Rtτ +1
τ =0
where the symbol ∼ denotes that the random variables on both sides of the formula are the same. Check
this statement, and if necessary, elaborate. So, if v̂ ∈ Rn is a current guess for the value vector at time
t, then we can take V̂ (Xt ) as a proxy for V (Xt ), Wt+1 as a proxy for Rt+1 , and V̂ (Xt+1 ) as a proxy for
V (Xt+1 ). Therefore the error
δt+1 = Wt+1 + γ V̂t (Xt+1 ) − V̂t (Xt ), (4.17)
gives a measure of how erroneous the estimate V̂ (xi ) is. Call this the “Bellman gradient” and give an
interpretation. But it does not tell us how far off the estimates V̂ (xj ), xj 6= xi are. So we choose a
predetermined sequence of step sizes {αt }, and adjust the estimate v̂ as follows:
V̂t (xj ) + αt δt+1 if Xt+1 = xj ,
V̂t+1 (xj ) = (4.18)
V̂t (xj ), otherwise.
Note that only the component of v̂ corresponding to the current state at time t is adjusted, while the
remainder are left unaltered.
Note that if v̂t = v, the true value vector, then
Thus δt+1 is the “temporal difference” at time t + 1 between what we expect to see and what we actually
see. If that difference is not zero, we correct the corresponding component of V̂ by moving in that direction.
Then the estimated value vector v̂t converges to the true value vector v almost surely.
Proof. (See [35, Section 3.1.1].) To analyze the behavior of TD learning, define the map F : Rn → Rn as
where A is the state transition matrix of the unknown Markov process. Then (4.15) can be rewritten in
vector notation as
v = r + γAv,
where v is the unknown value vector. With this reformulation, it can be seen that TD learning is just
asynchronous stochastic approximation with the function f : Rn → Rn equal to F . The conditions for the
asynchronous version of stochastic approximation to converge are broadly similar to those in Theorem ??.
Specifically, it suffices to establish that the differential equation
ẏ = F y = r + (γA − In )y (4.22)
is globally asymptotically stable around the equilibrium y = v, the true value function, and moreover, there
is a Lyapunov function that satisfies the conditions of (??) and (??). (Note that the V in those equations is
the Lyapunov function, and not the value function.) But this is immediate. Note that the equation (4.22) is
linear. Moreover, because ρ(γA) = γ < 1, all eigenvalues of γA − In have negative real parts. So the global
asymptotic stability of this equilibrium can be established using a quadratic Lyapunov function, so that (??)
and (??) are automatically satisfied. Therefore v̂t → v as t → ∞ almost surely.
Note that the recursion formula for TD learning does not require the trajectory to be an episode. There-
fore, for a given sample path of length T , the TD updates operate T times, whereas MC updates operate
only as many times as the number of episodes contained in the sample path. Indeed, aside from the fact that
v̂ is updated at every time instant, this is one more advantage, namely, that partial episodes are also useful.
Moreover, TD learning can also (apparently) be used with Markov processes that do not have an absorbing
state. However, in case the Markov process does have an absorbing state and the sample path corresponds
to an episode, it is possible to derive a useful formula that can be used to relate Monte Carlo simulation to
TD learning. Specifically, the Monte Carlo return over an episode can be expressed as a sum of temporal
differences. Define, as before
T
X −t
Gt := γ τ Wt+τ +1 (4.23)
τ =0
to be the total return over an episode starting at time t and ending at T . Then Gt satisfies the recursion
Gt = Wt+1 + γGt+1 .
So we can write
Now repeat until the end of the episode, when both GT +1 and V̂ (XT +1 ) are zero (because XT is an absorbng
state). This leads to
T
X −t
Gt − V̂ (Xt ) = γ τ δt+τ . (4.25)
τ =0
In TD learning, it is not necessary to look “only” one step ahead. It is possible to derive an “l-step
look-ahead” TD predictor. Note that if we define
l−1
X
Gt+l
t := γ t Rt+τ +1 , (4.26)
τ =0
(Note that the one-step look-ahead predictor corresponds to l = 0.) For l > 0, the analog of (4.16) is
V (Xt ) ∼ Gt+l
t + γ l V (Xt+l ). (4.27)
So, given an estimated value vector v̂, we observe that if v̂ = v, the true value vector, then
Let us define
t+l+1
δt+1 := Gt+l
t + γ l V̂ (Xt+l ) − V̂ (Xt ). (4.29)
t+l+1
The extra ones in the subscript and the superscript are to ensure that when l = 0, we have that δt+1 = δt+1
t+l+1
as defined in (4.17). In view of (4.28), we can think of δt+1 as an l-step temporal difference. The analogous
updating rule for this l-step TD learning is
t+l+1
V̂t (xj ) + αt δt+1 if Xt+1 = xj ,
V̂t+1 (xj ) = (4.30)
V̂t (xj ), otherwise.
The limiting behavior of first-time Monte Carlo and TD are quite different. Suppose that the duration
of the data is T , which consists of K episodes, of duration T1 , · · · , TK respectively For each episode k, form
the first-time estimate Gk (xi ) for each state xi ∈ X . If the state xi does not occur in episode k, then this
term is set equal to zero, which is functionally equivalent to omitting it. Then
Tk
XX
V̂MC (xi ) → arg min (Gk (xi ) − V (xi ))2 . (4.31)
V (xi ) k=1 k=1
As a result, TD methods converge more quickly than Monte Carlo methods. These statements are made on
[33, p. 128].
However, for the case of probabilistic policies, it is advantageous to write the recursion as
Given a sample path {(Xt , Ut , Wt+1 )}t≥0 , start with some initial guess Q̂0 for the action-value function.
Observe that, if at time t, the estimate Q̂t,π were to be correct, then
It can be surmised that, if {αt } approaches zero as per the conditions (4.20), then Qt,π converges almost
surely to Qπ . However, this requires that every possible pair (xi , uk ) ∈ X × U must be visited infinitely often
by the sample path, which is impossible if π is a deterministic policy. Hence, in order to apply the above
method for a deterministic policy, one should use an -soft version of π.
All this produces the action-value pair for one fixed policy. However, it is possible to combine this with
an -greedy improvement approach to converge to an optimal policy. Specifically, we can update the policy
π using an -greedy approach on the current policy, as before: Define
where all policies are functions of t, but the dependence is not explicitly displayed in the interests of clarity.
Now we can reduce to zero over time. Then perhaps φ → π ∗ and Q̂ → Q∗ as t → ∞. However, this is just
a hope. The Q-learning approach given next turns this hope into a reality by adopting a slightly different
approach to updating Q.
4.2.3 Q-Learning
A substantial advance in RL came in the paper [47]. In this paper, the authors propose an iterative scheme
for learning the optimal action-value function Q∗ (xi , uk ), by starting with an arbitrary initial guess, and
then updating the guess at each time instant.
Recall the following definitions and facts from Section 2.2. The action-value function is defined as
follows, for a given policy π (cf. (2.33)):
"∞ #
X
Qπ (xi , uk ) := Eπ γ t Rπ (Xt )|X0 = xi , U0 = uk .
t=0
If the return R is a random function of Xt , Ut , then the first term on the right side would be the expected
value of the reward R(Xt , Ut ). The optimal action-value function Q∗ satisfies the relationships
n
X
∗
Q (xi , uk ) = R(xi , uk ) + γ auijk max Q∗ (xj , wl ),
wl ∈U
j=1
4.2. TEMPORAL DIFFERENCE METHODS 63
The recursion relationship for Q∗ suggests the following iterative scheme for estimating this function.
Start with some arbitrary function Q0 : X × U → R. Choose a sequence of positive step sizes {αt }t≥0
satisfying
X∞ X∞
αt = ∞, αt2 < ∞. (4.36)
t=0 t=0
As the time series {(Xt , Ut , Wt+1 )}t≥0 is observed, update function Qt as follows:
Qt (xj , wl ) + αt+1 [Wt+1 + γVt (xj ) − Qt (xj , wl )], if (Xt+1 , Ut+1 ) = (xj , wl ),
Qt+1 (xj , wj ) = (4.37)
Qt (xj , wl ) otherwise,
where
Vt (xj ) = max Qt (xj , ul ). (4.38)
ul ∈U
In other words, if (Xt+1 , Ut+1 ) = (xj , wl ), then the corresponding estimate Qt+1 (xj , wj ) is updated, but the
estimates of Q(xi , uk ) for (Xt+1 , Ut+1 ) 6= (xj , wl ) are not updated.
The convergence of the above scheme is analyzed in the next theorem.
Theorem 4.4. (See [47, p. 282].) If there exists a finite constant RM such that |R(Xt , Ut )| ≤ RM , then
One of the main advantages of Q-learning over learning the value function Vπ is the following: If (Xt , Ut ) =
(xi , uj ), then the quantity
R(xi , uk ) + γ max Q(Xt+1 , wl )
wl ∈U
Problem 4.1. The objective of this problem is to analyze the l-step look-ahead Temporal Difference Learn-
ing rule proposed in (4.30). Suppose we have a Markov decision process with a prespecified policy π (deter-
ministic, or probabilistic, it does not matter).
Start with (2.32) for the value vector associated with the policy π, namely
v = r + γAv,
where we havem omitted π as it is anyway fixed. Show that the following l-step look-ahead version of
the above equation is also true: !
l−1
X
v= γ A r + Al v.
i i
i=0
64 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION
Show that (4.30) is a stochastic approximation approach to solving the above equation.
Hence establish that if the conditions in (4.20) hold, then the l-step look-ahead TD learning converges
almost surely to the true value function.
Hint: If A is a row-stochastic matrix, γ < 1, and l is an integer, show that all eigenvalues of the matrix
l
X
γ i Ai − I
i=0
Problem 4.2. Generate a 20 × 20 row-stochastic matrix A, and a 20 × 1 reward vector r, in any manner
you wish.
Compute the actual value function for the A and r you have generated using (2.32).
Generate a sample path of length 200 time steps starting from some arbitrary initial state. Compute
the associated reward functions as a part of the sample path. (Note that the action is not required as
the policy remains constant.)
Choose the discount factor γ = 0.95 and the look-ahead step length l = 5 or l = 10. Using the sample
path {(Xt , Wt+1 )} generated above, estimate the value function using conventional TD learning, and
l-step TD-learning with l = 5 and l = 10. Plot the Euclidean norm of the error (difference between
the true and estimated value vector) as a function of the iteration number, for each of these three
approaches. What conclusions do you draw, if any?
Chapter 5
Until now we have presented various iterative methods for computing the value function associated with a
prespecified policy, and for recursively improving a starting policy towards an optimum. These methods
converge asymptotically to the right answer, as the number of episodes (for Monte Carlo methods) or the
number of samples (for Temporal Difference and Q-learning methods) approaches infinity. So in principle
one could truncate these methods when some termination criterion is met, such as the change from one
iteration to the next being smaller than some prespecified threshold. However, the emphasis is still finding
the exact value function, or the exact optimal policy.
This approach can be used to find the value function if the size n of the state space |X | is sufficiently
small. Similarly, this approach can be used to find the action-value function if the product nm of the size
of the state space |X | and the size of the action space |U| is sufficiently small. However, in many realistic
applications, these numbers are too large. Note that the value function V : X → R is an n-dimensional
vector. Hence we can identify the “value space” with Rn . Similarly, the policy π can be associated with a
matrix B ∈ Rn×m , where bij = Pr{π(xi ) = uj }. The associated action-value function Q : X × U → R, and
can be thought of as an nm-dimentional vector. In principle, every vector in Rn can be the value vector
of some Markov reward process, and every vector in Rnm can be the action-value function of some Markov
Decision Process.
Therefore, when these numbers are too large, it becomes imperative to approximate the value or action-
value functions using a smaller number of parameters (smaller than n for value functions and smaller than
nm for action-value functions and policies). The present chapter is devoted to a study of such approximation
methods. We begin with methods for approximating the value function. Then we can ask: Is it possible to
find directly an approximation the optimal policy using a suitable set of basis functions? This question lies at
the heart of “policy gradient” methods and is studied in Section 5.3. The solution is significantly facilitated
by a result known as the “policy gradient theorem.” Proceeding further, we can attempt simultaneously to
approximate both the policy function and the value function over a common set of basis functions. This is
known as the “actor-critic” approach and is studied in Section 5.3.
65
66 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS
true n-dimensional value vector corrsponding to the policy π. Suppose we try to approximate vπ by another
vector of the form v̂π (θ), where θ ∈ Rd is a vector of adjustable parameters, and d n. Thus v̂π : Rd → Rn
where n = |X |. Depending on the context, we can also think of v̂ as a collection of n maps, indexed by the
state xi , where each V̂ (xi , ·) : Rd → R. In general it is not possible to choose the parameter vector θ so
as to make v̂π (θ) equal to vπ . In the approaches suggested in Chapter 4, the estimate of Vπ (xi ) did not
depend on the estimate of Vπ (xj ) for xj 6= xi . However, in the present instance, since our aim is to estimate
all values Vπ (xi ) simultaneously, we need to define a suitable error criterion.
Suppose have a sample path {Xt }Tt=0 that is used to estimate Vπ . A natural choice is the least-squares
criterion given by
T
1 X
E(θ) := [Vπ (Xt ) − V̂ (Xt , θ)]2 .
2(T + 1) t=0
The above error criterion suggests that an error at any one time t is as significant as an error at any other
time τ . The above error criterion can be rewritten as
1 X
E(θ, µ) := µi [Vπ (xi ) − V̂π (xi , θ)]2 , (5.1)
2
xi ∈X
where
T
1 X
µi := I{Xt =xi }
T + 1 t=0
is the fraction of times that Xt = xi , the state of interest, in the sample path. With this definition of the
coefficients µi , the error criterion can be further rewritten as
1
E(θ, µ) := [vπ − v̂π (θ)]> M [vπ − v̂π (θ)], (5.2)
2
where M ∈ Rn×n is the diagonal matrix with the µi as its diagonal elements.
This raises the question as to what the coefficients µi should be. Clearly, in a Markov process, the
seriousness of an error in estimating Vπ (xi ) should depend on how likey the state xi is likely to occur in a
sample path. Thus it would be reasonable to choose the coefficient vector µ as the stationary distribution
of the Markov process. However, there are two difficulties with this approach. The first difficulty is that
we do not know what this stationary distribution is. So we try to construct an estimate of the stationary
distribution, before we start the process of approximating the unknown value vector vπ . The second difficulty
is that if the Markov process has one or more absorbing states, then the above approach would be meaningless
as shown below.
If we assume that, under the policy π, the resulting Markov process is irreducible (see Section 8.2), then
there is a unique stationary distribution µ of the process. Moreover, µ > 0, i.e., every component µi is
positive. The question is how to estimate this stationary distribution. Theorem 8.8 and specifically (8.32)
provide an approach. For any function f : X → R, we have that
t−1
1X
E[f, µ] = lim f (Xτ ). (5.3)
t→∞ t
τ =0
Now define a function fi : X → R by fi (xi ) = 1, and fi (xj ) = 0 if j 6= i. Then it is easy to see that the
expected value E[fi , µ] = µi . Moreover, (5.3) implies that
t−1
1X
µi = E[fi , µ] = lim I{Xt =xi } . (5.4)
t→∞ t
τ =0
Thus, in a sample path, we count what fraction are equal to the state of interest xi . As t → ∞, this fraction
converges almost surely to µi . Hence, if we take any sample path (or collection of sample paths), then the
5.1. VALUE APPROXIMATION METHODS 67
fraction of occurrences of xi in the sample path is a good approximation to µi . These estimates can be used
in defining the error criterion in (5.1).
Much of reinforcement learning is based on Markov processes with absorbing states, and using episodes
that terminate in an absorbing state. It is obvious that a Markov process with an absorbing state is not
irreducible. Specifically, there is no path from an absorbing state to any other state. Therefore Theorem 8.8
does not apply. Let us change notation and suppose that a Markov process has nonabsorbing states S and
absorbing stages A. Then it is easy to see that the state transition matrix A looks like
A11 A12
A= . (5.5)
0 I|A|
For example, if there is only one absorbing state, the bottom right corner is just a single element of 1. In this
case it can be shown that ρ(A11 ) < 1 (see [46, Chapter 4]. Therefore there is a unique stationary distribution,
with a 1 in the last component and zeros elsewhere. If |A| > 1, then it is again true that ρ(A11 ) < 1, but now
there are infinitely many stationary distributions; see Problem 5.1. Even if it is assumed that there is only
one absorbing state (for the policy under study), it is obvious that trying to use the stationary distribution
to assign the weights µi would be meaningless, because the weights would all equal zero for nonabsorbing
states. A reasonble approach is to set µi equal to the fraction of states Xt in a “typical” sample path that
equal xi , before the sample path hits an absorbing state. For that purpose, the following theorem is useful.
Theorem 5.1. Suppose a reducible Markov process with absorbing states has a state transition matrix of
the form (5.5). Suppose an initial state X0 ∈ S is chosen in accordance with a probability distribution φ.
Then the fraction of states in S along all nonterminal paths (that is, paths until the time when they hit an
absorbing state) has the distribution µ given by
µ> = φ> (I|S| − A11 )−1 . (5.6)
Proof to be supplied later.
This theorem still leaves open the question of how to choose the initial distribution φ.
Either way, once the weight vector µ has been chosen, the next step is to minimize the error function
E defined in (5.1). This is known as “value approximation.” Before discussing how value approximation is
achieved, let us digress to discuss whether this is the “right” problem to solve.
Once we have minimized the error in (5.1), we have an approximate value function v̂π (θ) for a fixed
policy π. However, the objective in MDPs is to choose an optimal or nearly optimal policy. Therefore, in
order to use the approximate value vectors v̂π to guide this choice, we must construct such approximations
for all policies π. Moreover, we must ensure that v̂π is a uniformly good approximation over all possible
policies. In other words, the error in (5.1) needs to be minimized for all policies π. Even if we restrict to
deterministic policies, the cardinality of Πd is mn = |U||X | , which can become very large very quickly. Even
otherwise, just because v̂π is uniformly close to vπ for all π, it does not follow that the minimizer of v̂π over
π is anywhere close to the minimizer of vπ with respect to π. Therefore we can ask: Is it possible to find
directly an approximation the optimal policy using a suitable set of basis functions? This question lies at
the heart of “policy gradient” methods and is studied in Section 5.3. The solution is significantly facilitated
by a result known as the “policy gradient theorem.” Proceeding further, we can attempt simultaneously to
approximate both the policy function and the value function over a common set of basis functions. This is
known as the “actor-critic” approach and is studied in Section 5.3.
To compute the gradient of E with respect to θ, which is the variable of optimization, set θ ← θ + ∆θ.
Then
v̂(θ + ∆θ) ≈ v̂(θ) + ∇θ v̂(θ)∆θ.
Substituting this into the expression for E and retaining only first-order terms shows that
Now let us discuss what form the function v̂(θ) may take. There are two natural categories, namely:
linear and nonlinear. In the linear case, there is a set of linearly independent vectors ψl ∈ Rn for l = 1, . . . , d
where d is the number of weights. (Recall that θ ∈ Rd .) Define
V̂ (xi , θ) = Ψxi θ,
where Ψl denotes the l-th row of the matrix Ψ. With this in mind, let us define the error
1
E(xi , θ) = µi [V (xi ) − V̂ (xi , θ)]2 . (5.12)
2
Now suppose, as we have been doing, that a sample path {(Xt , Ut , Wt+1 )}t≥0 is given. We wish to find the
weight vector θethat (nearly) minimizes the error criterion in (5.1). In what follows, we make a distinction
between so-called gradient methods (or updates), and so-called semi-gradient methods (or updates).
We begin with gradient methods, which are based on Monte Carlo simulation, and thus require the
Markov process to have one or more absorbing states. The sample path is also required to terminate in an
absorbing state at some time T . As before, define the cumulated discounted return
−t−1
TX
Gt+1 = γ τ Wτ +t+1 , t ≤ T.
τ =0
The iterations are started off at index τ = 0 with some initial weight vector θ 0 . Then for each time τ in
0, · · · , T − t − 1, we adjust θ τ as follows:
Here {αt } is a predetermined sequence of time steps that can either be constant or decrease slowly to zero.
However, if we are to interpret (5.13) as moving θ t in the direction of the gradient of the error function
E(Xt , θ) defined in (5.12), then the presence of the coefficient µXt is essential.
The semi-gradient updating method is reminiscent of the Temporal Difference method. We proceed as
follows: Start at time t = 0 at any initial state X0 . Generate a sample path {(Xt , Ut , Wt+1 )} by choosing
Ut ∼ π(·|Xt ) in case the policy is probabilistic, and Ut = π(Xt ) if the policy is deterministic. Observe Wt+1
and Xt+1 . Repeat. Start with any initial vector θ 0 . On the basis of this sample path, update the weight
vector θ t at each time instant as follows:
θ t+1 = θ t + αt+1 [Rt+1 + γ V̂ (Xt+1 , θ t ) − V̂ (Xt , θ t )]∇θ V̂ (Xt , θ t ). (5.14)
If the value approximation function v̂(θ) is linear as in (5.10), then it is easy to see that
V̂ (Xt , θ) = ΨXt θ.
Therefore
∇θ V̂ (Xt , θ) = (ΨXt )> ,
where ΨXt denotes the Xt -th row of the matrix Ψ. If we define
yt = (ΨXt )> , (5.15)
that is, the transpose of row Xt of the matrix Ψ, then we can write
V̂ (Xt , θ) = hyt , θi = yt> θ, ∇θ V̂ (Xt , θ) = yt .
Therefore the updating rule (5.14) becomes
>
θ t+1 = θ t + αt+1 (Wt+1 + γyt+1 θ t − yt> θ t )yt . (5.16)
If the approximation function v̂(θ) is a linear map of the form Ψθ, we can establish the convergence of
the semi-gradient method by viewing it as an implementation of stochastic approximation.
Theorem 5.2. Suppose the value approximation function v̂(θ) is as in (5.10). Suppose the state transition
matrix A is irreducible, and that µ is its unique stationary distribution. Define the error function E(θ) as
in (5.1), and use the update rule (5.16). Finally, suppose the usual conditions on {αt } hold, namely
∞
X ∞
X
αt = ∞, αt2 < ∞. (5.17)
t=1 t=1
Then the sequence of estimates {θ t } converges almost surely to θ ∗ , which is the unique solution of the
equation
Cθ ∗ − b = 0, (5.18)
where
C = Ψ> M (γA − In )Ψ, b = E[Rt+1 yt ], M = Diag(µxi ). (5.19)
To prove this theorem, we state and prove a couple of preliminary results.
Definition 5.1. A matrix F ∈ Rd×d is said to be row diagonally dominant if
X
fii > |fij |, (5.20)
j6=i
which automatically implies that all diagonal elements of F are strictly positive. The matrix F is said to be
column diagonally dominant if F > is row diagonally dominant, or equivalently
X
fii > |fji |. (5.21)
j6=i
Finally, the matrix F is said to be diagonally dominant if F is both row and column diagonally dominant.
70 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS
Lemma 5.1. Suppose F is diagonally dominant. Then so is (F + F > )/2, the symmetric part of F .
The proof is omitted as it is obvious.
Lemma 5.2. Suppose F is diagonally dominant. Then the symmetric matrix F + F > is positive definite.
Proof. The proof is a ready consequence of Lemma 5.1 and the Gerschgorin circle theorem. Recall that, for
any matrix G ∈ Rd×d , the eigenvalues of G are contained in the union of the d Girschgorin circles
X
Ci := {z ∈ C : |z − gii | ≤ |gij |}, i = 1, · · · , d.
j6=i
Now let G = F + F > . Then G is symmetric and thus has only real eigenvalues. So the Girschgorin circles
turn into the Girshgorin intervals. The diagonal dominance of G implies that each interval is a subset of
R+ . Therefore all eigenvalues of G are positive, and G is positive definite.
Lemma 5.3. (Lyapunov Matrix Equation) Suppose C ∈ Rd×d . Then all eigenvalues of C have negative real
parts if there is a positive definite matrix P such that
−(C > P + P C) =: Q
is positive definite.
This is a standard result in linear control theory and the proof can be found in many places, for example
[44, Theorem 5.4.42].
At last we come to the proof of the main theorem
>
Proof. (Of Theorem 5.2.) Note that yt+1 is ΨXt+1 , which is row Xt+1 of the matrix Ψ. Thus the the
conditional expectation
>
E[yt+1 |yt ] = yt> A,
where A is the state transition matrix. Therefore
>
E[(Wt+1 + γyt+1 θ t − yt> θ t )yt |yt ] = b − Cθ.
Now the update rule (5.16) is just stochastic approximation used to solve the linear equation
Cθ ∗ = b.
If all eigenvalues of the matrix C have negative real parts, then the differential equation
θ̇ = Cθ − b
is globally asymptotically stable around its unique equilibrium θ ∗ . Thus the proof is completed once it is
established that all eigenvalues of C have negative real parts.
From Lemma 5.3, a sufficient condition for this is that −(C + C > ) is positive definite. Now note that
where
H = M (In − γA). (5.22)
It is now shown that H is both row and column diagonally dominant. Observe first that H has 1 − γaii on
the diagonal (which are all positive because aii ≤ 1 and γ < 1), and has −γaij on the off-diagonal elements.
Therefore, to establish the diagonal dominance of H, it is enough to show that
Here we make use of the fact that A1n = 1n due to the row-stochasticity of A. For the second inequality,
note that
1>
nM = µ ,
>
where µ is the stationary distribution and thus satisfies µ> A = µ> . Therefore
Now, by Lemma 5.2, the diagonal dominance of H implies that H + H > is positive definite. The fact that
Ψ has full row rank of d implies that
is also positive definite. Finally, it follows from Lemma 5.3 that all eigenvalues of C have negative real
parts.
Problem 5.1. Consider a Markov process with two or more absorbing states, so that its state transition
matrix looks like
A11 A12
A= ,
0 Is
where s = |A|. Show that the set of all stationary distributions of this Markov process consists of
0|S|
,
φ
where
s
X
φi ≥ 0, i = 1, · · · , s, φi = 1.
i=1
Problem 5.2. Give an example of a Markov process that is reducible but does not have any absorbing
states.
Problem 5.3. Show that the proof of Theorem 5.2 remains valid if µ is sufficiently close to the stationary
distribution of A. State and prove a theorem to this effect.
This is the same as δt+1 defined in (5.14). The conventional TD updating rule is
which is the same as (5.16). To define TD(λ) -updating, choose a number λ ∈ [0, 1], and define
t
!
X
t−τ
θ t+1 = θ t + αt+1 δt+1 (γλ) ∇θ V̂ (Xτ , θ τ ) . (5.24)
τ =0
If we choose λ = 0 and set (γλ)0 = 1 if λ = 0, then it is obvious that (5.24) becomes (5.16), the standard
TD updating rule.
In principle the TD(λ) -updating rule (5.24) can be applied with any function V̂ . However, in the remain-
der of this section, we focus on the case of linear approximation, where there is a matrix Ψ ∈ Rn×d , and
V̂ (xi ) = Ψxi θ, for each xi ∈ X . Assume that Ψ has full column rank, so that none of the components of θ
is redundant. Also, define as before the vector
yτ := [ΨXτ ]> ∈ Rd .
With this new notation, the TD(λ) -updating rule (5.24) can be written as
with z0 = 0. Note that if λ = 0 then zt = yt and (5.26) becomes (5.16). Also note that zt satisfies the
recursion
zt = γλzt−1 + yt . (5.27)
In [42], the convergence properties of TD(λ) -updating are studied. As one might expect, the updating
rule is interpreted as the stochastic approximation algorithm applied to solve a linear equation of the form
Cθ ∗ = b.
5.2. VALUE APPROXIMATION VIA TD(λ) -METHODS 73
If the ODE
θ̇ = Cθ − b
is globally asymptotically stable, then the stochastic approximation converges to a solution of the algebraic
equation. However, the matrix C is more complicated than in Section 5.2. We give a brief description of the
results and direct the reader to [42] for full details.
We begin with an alternative to Theorem 2.2. That theorem states that, whenever the discount factor
γ < 1, the map y 7→ T y := r + γAy is a contraction with respect to the `∞ -norm, with contraction constant
γ. The key to the proof is the fact that if A is row-stochastic, then the induced norm kAk∞→∞ = 1. Now
it is shown that a similar statement holds for an entirely different norm.
Suppose A is row-stochastic and irreducible, and let µ denote its stationary distribution. Note that, by
Theorem 8.6, µ is uniquely defined and has all positive elements. Define M = Diag(µi ) and define a norm
k · kM on Rd by
kvkM = (v> M v)1/2 . (5.28)
Then the corresponding distance between two vectors v1 , v2 is given by
Lemma 5.4. Suppose A ∈ [0, 1]n×n , is row-stochastic, and irreducible. Let µ be the stationary distribution
of A. Then
kAvkM ≤ kvkM , ∀v ∈ Rn . (5.29)
Consequently, the map v 7→ rγAv is a contraction with respect to k · kM .
In order to prove Lemma 5.4, we make use of a result known as “Jensen’s inequality,” of which only a
very simple special case is presented here.
Lemma 5.5. (Jensen’s Inequality) Suppose Y is a real-valued random variable taking values in a finite set
Y = {y1 , . . . , yn }, and let p denote the probability distribution of Y. Suppose f : R → R is a convex function.
Then
f (E[Y, p]) ≤ E[f (Y ), p]. (5.30)
It can be shown that Jensen’s inequality holds for arbitrary real-valued random variables. But what is
stated above is adequate for present purposes.
Proof. (Of Lemma 5.5.) Note that, because f (·) is convex, we have
n
! n
X X
f (E[Y, p]) = f pi yi ≤ pi f (yi ) = E[f (Y ), p].
i=1 i=1
kAvk2M ≤ kvk2M , ∀v ∈ Rn ,
However, for each fixed index i, the row Ai is a probability distribution, and the function f (Y ) = Y 2 is
convex. If we apply Jensen’s inequality with f (Y ) = Y 2 , we see that
2
Xn n
X
Aij vj ≤ Aij vj2 , ∀i.
j=1 j=1
Therefore
n n n n
! n
X X X X X
kAvk2M ≤ µi Aij vj2 = µi Aij vj2 = µj vj2 = kvk2M ,
i=1 j=1 j=1 i=1 j=1
Next, to analyze the behavior of the TD(λ) -updating, we define a map T (λ) : Rn → Rn via
∞
" l
#
X X
[T (λ) f ]i := (1 − λ) λl E γ τ R(Xτ +1 ) + γ l+1 fXl+1 |X0 = xi . (5.31)
l=0 τ =0
∞
" l
#
X X
(λ) l τ τ l+1 l+1
T f = (1 − λ) λ γ A r+γ A f , (5.32)
l=0 τ =0
where, as before
r = [ R(x1 ) · · · R(xn ) ]> ,
and
R(xj ) = E[R1 |X0 = xi ].
Lemma 5.6. The map T (λ) is a contraction with respect to k · kM , with contraction constant [γ(1 − λ)]/(1 −
γλ).
Proof. Note that the first term on the right side of (5.32) does not depend on f . Therefore
∞
X
T (λ) (f1 − f2 ) = γ(1 − λ) (γλ)l Al+1 (f1 − f2 ).
l=0
Therefore
∞
X γ(1 − λ)
kT (λ) (f1 − f2 )kM ≤ γ(1 − λ) (γλ)l kf1 − f2 kM = kf1 − f2 kM .
1 − γλ
l=0
T (λ) v = v.
Then it is easy to verify that v is in fact the value function, because the value function also satisfies the
above equation, and the equation has a unique solution because T (λ) is a contraction. Presumably we could
find v by running a stochastic approximation type of iteration, without value approximation. However, our
interest is in what happens with value approximation. Towards this end, define a projection Π : Rn → Rn
by
Πa := Ψ(Ψ> M Ψ)−1 Ψ> M a. (5.33)
Then
Πa = arg min ka − bkM . (5.34)
b∈Ψ(Rd )
Thus Πa is the closest point to a in the subspace Ψ(Rn ). With this definition, we can state the following:
Theorem 5.3. Suppose the sequence {θ t } is defined by (5.24). Suppose further that the standard conditions
∞
X ∞
X
αt = ∞, αt2 < ∞ (5.35)
t=0 t=0
hold. Then the sequence {θ t } converges almost surely to θ ∗ , where θ ∗ is the unique solution of
Moreover
1 − γλ
kΨθ ∗ − vkM ≤ kΠv − vkM . (5.37)
1−γ
Let us understand what the above bound states. Because we are approximating the true value vector v
by a vector of the form Ψθ, the best possible approximator in terms of the distance k · kM is given by Πv,
while kΠv − vkM is the smallest possible approximation error. Now (5.37) states that the limit of TD(λ)
iterations satisfies an error bound that is larger than the lowest possible error by an “expansion”factor of
(1 − γ)/(1 − γλ). In order to make this expansion factor smaller, we should choose λ closer to one. However,
the closer λ is to one, the more slowly the infinite series in (5.32) will converge. These tradeoffs are still not
well-understood.
The proof as usual consists of constructing a matrix C whose eigenvalues all have negative real parts,
and showing that θ ∗ converges to the solution of Cθ ∗ = b. The details can be found in the paper.
But let us discuss what the bound (5.37) means. Note that, since Ψθ ∈ Ψ(Rd ), the quantity kΠv∗ −v∗ kM
is the best that we could hope to achieve. The distance between the limit Ψθ ∗ and the true value vector v∗
is bounded by a factor (1 − γλ)/(1 − γ times this minimum. Note that (1 − γλ)/(1 − γ > 1. So this is the
extent to which the TD(λ) iterations miss the optimal approximation.
Note that c∗ is a constant that is independent of the initial state. Now define
" T
#
1 X
J(xi ) := lim E R(Xt )|X0 = xi − c∗ (5.40)
T →∞ T + 1 t=0
to be the average reward starting in state xi , minus the overall average reward c∗ . So we can think of J(xi )
as the relative advantage or disadvantage of starting in state xi . Define J as the vector J(xi ) as xi varies
over X . Then J satisfies the so-called Poisson equation
J = r − c∗ 1n + AJ. (5.41)
The vector J is called the differential reward. Note that if we replace J by J + α1n for some constant α,
then because A1n = 1n , the vector J + α1n also satisfies (5.41). Therefore there is a unique vector J∗ ∈ Rn
that satisfies (5.41) and also µ> J∗ = 0. Then the set of all solutions to (5.41) is given by {J∗ +α1n : α ∈ R}.
Further, J∗ satisfies
X∞
J∗ = At (r − c∗ 1n ). (5.42)
t=0
Now the problem studied in this subsection is how to approximate J∗ . Choose a matrix Ψ ∈ Rn×d , and
approximate J∗ by Ψθ. A key assumption is that 1n 6∈ Ψ(Rd ). In other words, the vector of all ones does
not belong to the range of Ψ.
Suppose we have a sample path {(Xt , Wt+1 )}t≥0 . As before, for Xt ∈ X , define
There is also another sequence {ct } in R that tries to approximate the constant c∗ . At time t, define the
temporal difference
δt+1 := Wt+1 − ct + J(Xˆ t+1 , θ t ) − J(X
ˆ t , θ t ). (5.45)
Next, choose a constant λ ∈ [0, 1) and use a TD(λ) update
where ηt , αt are step sizes. We will choose these step sizes in tandem, so that ηt = βαt for each time t; we
can also just choose β = 1 so that ηt = αt . Also, yτ is defined as per (5.43). Define the eligibility vector
t
X
zt := λt−τ yτ ∈ Rd , (5.48)
τ =0
zt = λzt−1 + yt . (5.49)
5.2. VALUE APPROXIMATION VIA TD(λ) -METHODS 77
where k · kM is the weighted Euclidean distance as before. For each λ ∈ [0, 1), define T (λ) : Rn → Rn by
∞ l
!
X X
(λ) l τ ∗ l+1
T v := (1 − λ) λ A (r − c 1n ) + A v . (5.51)
l=0 τ =0
Just as before, the map T (λ) is a contraction with respect to k · kM , with constant 1 − λ. Note that, for each
finite l, we have
X l
Aτ (r − c∗ 1n ) + Al+1 J∗ = J∗ . (5.52)
τ =0
With these observations, we can state the following theorem:
Theorem 5.4. Suppose ηt = βαt for some fixed constant β. Suppose further that
∞
X ∞
X
αt = ∞, αt2 < ∞ (5.53)
t=0 t=0
Then, as t → ∞, almost surely we have that (i) ct → c∗ , and (ii) θ t → θ ∗ , where θ ∗ is the unique solution
of
ΠT (λ) (Ψθ ∗ ) = Ψθ ∗ . (5.54)
Define
ct
φ= ∈ Rd+1 . (5.55)
θt
The idea behind the proof is the standard one, namely to use stochastic approximation to solve a linear
equation of the form
Cφ = b.
However, there are a few wrinkles. The matrix C turns out to be
−β 0
C= , (5.56)
1
− 1−λ Ψ> M 1n Ψ> M (A(λ) − In )Ψ
where
∞
X
(λ)
A := (1 − λ) λl Al+1 . (5.57)
l=0
Now note that, since λ < 1, we have that ρ(λA) = λ < 1. Therefore
∞
X
λl Al = (In − λA)−1 .
l=0
Therefore
A(λ) = (1 − λ)(In − λA)−1 A. (5.58)
(λ)
It is easy to verify that A 1n = 1n . Therefore, without additional assumptions, the matrix C is not
asymptotically stable in the sense of having its eigenvalues in the open left half-plane. However, A(λ) v 6= 0 if
v is not a multiple of 1n . Now the assumption that 1n 6∈ Ψ(Rd ) guarantees that successive approximations
Ψθ t are not multiples of 1n . In the complement of the one-dimensional subspace generated by 1n , the matrix
C is asymptotically stable.
78 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS
5.3.1 Preliminaries
The set-up is the standard Markov Decision Process (MDP) with a finite state space X of cardinality n,
a finite action space U of cardinality m, and state transition matrices Auk for each uk ∈ U. Until now,
much of the focus has been on deterministic policies π : X → U, where the current state Xt determines
the current action Ut uniquely as π(Xt ). However, policy approximation methods that have been widely
studied do not work very well with deterministic policies. Instead, the focus is on probabilistic policies.
Given the action set U, let S(U) denote the set of probability distributions on U. Since U is a finite set,
S(U) consists of m-dimensional nonnegative vectors whose components add up to one, that is, the simplex in
Rm+ . A probabilistic policy π : X → S(U) is a map from the current state Xt to a corresponding probability
distribution on U. Until now there is no difference with earlier notation. Now we introduce a parameter
vector θ ∈ Rd , and denote the policy as π(xi , uk , θ). Thus
π(xi , uk , θ) = Pr{Ut = uk |Xt = xi , θ}. (5.59)
In order for policy approximation theory to work, it is usually assumed that some or all of the following
assumptions hold:
P1. We have that
π(xi , uk , θ) > 0, ∀xi ∈ X , uk ∈ U, θ ∈ Rd . (5.60)
Note that this assumption rules out deterministic policies.
P2. The quantity π(xi , uk , θ) is twice continuously differentiable with respect to θ. Moreover, there exists
a finite constant M such that
∂ ln π(xi , uk , θ)
≤ M, ∀xi ∈ X , uk ∈ U, θ ∈ Rd . (5.61)
∂θl
Note that a ready consequence of (5.61) is that
∂π(xi , uk , θ)
≤ M, ∀xi ∈ X , uk ∈ U, θ ∈ Rd . (5.62)
∂θl
See Problem 5.4.
In policy approximation, the objective is to identify an optimal choice of θ that would maximize a chosen
objective function. As stated above, this objective function is usually the average reward under the policy.
In principle, the value approximation methods studied earlier in this chapter could be used for this purpose,
as follows: For each possible policy π, compute an approximate value v̂π . These approximations are a
function of an auxiliary parameter vector θ which is suppressed in the interests of clarity. Ensure that these
approximations are uniformly close in the sense that
kvπ − v̂π k∞ ≤ , ∀π ∈ Π, (5.63)
5.3. POLICY GRADIENT AND ACTOR-CRITIC METHODS 79
where Π is the class of policies under study. Note that (5.63) is equivalent to
Then π̂i∗ is an approximately optimal policy when starting from the state xi . Define πi∗ to be the “true”
optimal policy, that is
πi∗ = arg min V (xi ). (5.65)
π∈Π
In other words, the optimum of the approximate value function (for a given starting state) is within of the
true optimal value starting from xi . The proof is left as an exercise; see Problem 5.5. Thus it is possible,
at least in principle, to compute a good approximation to the optimal value. However, there is no reason to
assume that the policy π̂i∗ is in any way “close” to the true optimal policy πi∗ .
To address this issue, in policy optimization one directly parametrizes the policy π as π(xi , uk , θ), and
then chooses θ so as to optimize an appropriate objective function. As mentioned above, the preferred choice
is the average reward. In order to carry out this process, it is highly desirable to have an expression for the
gradient of this objective function with respect to the parameter vector θ. This is provided by the “policy
gradient theorem” given below. However, there are other considerations that need to be taken into account.
Numerical examples have shown that if a nearly optimal policy is chosen on the basis of value approxi-
mation, the resulting set of policies may not converge; see for example [41]. To quote from the article:
This analysis is particularly interesting, since the algorithm is closely related to Q-learning
(Watkins and Dayan, 1992) and temporal-difference learning (T D(λ)) (Sutton, 1988), with λ
set to 0. The counter-example discussed demonstrates the short-comings of some (but not all)
variants of Q-learning and temporal-difference learning that are employed in practice.
This means the following: For a fixed tolerance , choose an approximately optimal policy π̂i∗ () as in
(5.64). As the tolerance is reduced towards zero, we will get a sequence of nearly optimal policies, one
for each . The question is: As → 0+ , does the corresponding policy π̂i∗ () approach the true optimal
policy πi∗ ? In general, the answer is no. So this would seem to rule out value approximation as a way
of determining nearly optimal policies, though it does allow us to determine nearly optimal values. On
the other side, choosing nearly optimal policies using policy parametrization leads to slow convergence and
high variance. A class of algorithms known as “actor-critic” consist of simultaneously carrying out policy
optimization (approximately) coupled with value approximation. Numerical experiments show that actor-
critic algorithms lead to faster convergence and lower variance than pure policy optimization alone. It must
be emphasized that “actor-critic” refers to a class of algorithms (or a philosophy), and not one specific
algorithm. Thus, within the umbrella actor-critic, there are multiple variants currently in use.
As stated above, actor-critic algorithms update the policy and the value in parallel. Here “actor” refers
to policy updating, while “critic” refers to value updating. Some assumptions about actor-critic algorithms
are often understated or even unstated, so we mention them now.
The behavior of actor-critic algorithms has been analyzed for the most part only when the objective
function to be maximized is the average reward.
In order to prove the convergence of actor-critic algorithms, it is usually assumed that the actor (policy)
is updated more slowly than the critic (value).
80 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS
In what follows, to simplify notation, we denote various quantities with the subscript or superscript θ,
instead of π(θ). To illustrate, above we have used Aθ to denote Aπ(θ) .
Assume that Aθ is irreducible and aperiodic for each θ, and let µθ denote the corresponding stationary
distribution. Therefore µθ satisfies
" #
X X uk
µθ (xi ) π(xi , uk , θ)aij = µθ (xj ) (5.67)
xi ∈X uk ∈U
The above equation illustrates another notational convention. For a vector such as µθ , we use bold-faced
letters, while for a component of the vector such as µθ (xi ), we don’t use bold-faced letters.
Associated with each policy π is an average reward c∗ , defined in analogy with (5.38) and (5.39), namely
T
∗ 1 X
c (θ) := lim Rθ (Xt ) = E[Rθ , µθ ]. (5.68)
T →∞ T + 1
t=0
In addition to the constant c∗ which depends only on θ and nothing else, we also define the function
Qθ : X × U → R by
∞
X
Qθ (xi , uk ) = E[Rθ (Xt ) − c∗ (θ)|X0 = xi , U0 = uk , π(θ)]. (5.71)
t=1
Note that, because C ∗ (θ) is the “weighted average” of the various rewards, the summation in (5.71) is
well-defined even though there is no discount factor. Also, Qθ (xi , uk ) satisfies the recursion
X u
Qθ (xi , uk ) = R(xi , uk ) − c∗ (θ) + aijk Vθ (xj ), (5.72)
uk ∈U
1 As per our convention, if Xt = xi , the reward R(xi ) is paid at time t + 1.
5.3. POLICY GRADIENT AND ACTOR-CRITIC METHODS 81
Now the objective of policy approximation is to choose the parameter θ so as to maximize c∗ (θ). For
this purpose, it is highly desirable to have an expression for the gradient ∇θ c∗ (θ). This is precisely what the
policy gradient theorem gives us. The expression for the gradient can be combined with various methods to
look for a minimizing choice of θ.
With all this notation, the policy gradient theorem can be stated very simply. The source for this proof
is [34]. Note that in that paper, the policy gradient theorem is proved not only for the average reward but
for another type of reward as well. We do not present that result here and the interested reader can consult
the original paper.
Theorem 5.5. (Policy Gradient Theorem) We have that
X X
∇θ c∗ (θ) = µθ (xi ) ∇θ π(xi , uk , θ)Qθ (xi , uk ). (5.74)
xi ∈X uk ∈U
Remark: As we change the parameter θ, the corresponding stationary distribution µθ also changes.
Moreover, it is difficult to compute this distribution, and even more difficult to compute the gradient with
respect to θ. Therefore the main benefit of the policy theorem is that the only function whose gradient
appears is π(xi , uk , θ), and this is a function that is chosen by the learner; therefore it is easy to compute
the gradient ∇θ π(xi , uk , θ).
For convenience we will write ∂/∂θ instead of ∇θ , even though θ is a vector. Recall the expression (5.73)
for Vθ (xi ). This leads to
∂Vθ (xi ) X ∂π(xi , uk , θ) ∂Qθ (xi , uk )
= Qθ (xi , uk ) + π(xi , uk , θ)
∂θ ∂θ ∂θ
uk ∈U
X ∂π(xi , uk , θ) X ∂ X u
= Qθ (xi , uk ) + π(xi , uk , θ) R(xi , uk ) − c∗ (θ) + aijk Vθ (xj ) .
∂θ ∂θ
uk ∈U uk ∈U xj ∈X
However, R(xi , uk ) does not depend on θ so its gradient is zero. Therefore the above equation can be
expressed as
∗
∂Vθ (xi ) X ∂π(xi , uk , θ) X ∂c (θ) X u ∂Vθ (xj )
= Qθ (xi , uk ) + π(xi , uk , θ) − + aijk
∂θ ∂θ ∂θ ∂θ
uk ∈U uk ∈U xj ∈X
X ∂π(xi , uk , θ) ∂c∗ (θ) X X u ∂Vθ (xj )
= Qθ (xi , uk ) − + π(xi , uk , θ) aijk , (5.75)
∂θ ∂θ ∂θ
uk ∈U uk ∈U xj ∈X
Now multiply both sides of (5.75) by µθ (xi ) and sum over xi ∈ X . This gives
X ∂Vθ (xi ) X X ∂π(xi , uk , θ) ∂c∗ (θ)
µθ (xi ) = µθ (xi ) Qθ (xi , uk ) −
∂θ ∂θ ∂θ
xi ∈X xi ∈X uk ∈U
X X X ∂Vθ (xj )
+ µθ (xi ) π(xi , uk , θ) auijk . (5.76)
∂θ
xi ∈X uk ∈U xj ∈X
82 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS
Here again we use the fact that c∗ (θ) does not depend on xi , so that
X ∂c∗ (θ) ∂c∗ (θ)
µθ (xi ) = .
∂θ ∂θ
xi ∈X
Now we invoke (5.67) which involves µ(θ). This shows that the last term on the right side of (5.76) can be
expressed as
X X X u ∂Vθ (xj ) X ∂Vθ (xj )
µθ (xi ) π(xi , uk , θ) aijk = µθ (xj ) .
∂θ ∂θ
xi ∈X uk ∈U xj ∈X xj ∈X
Obviously the left side is the same as the last term on the right side. Cancelling these two terms and
rearranging gives
∂c∗ (θ) X X ∂π(xi , uk , θ)
= µθ (xi ) Qθ (xi , uk ),
∂θ ∂θ
xi ∈X uk ∈U
where we use only the last term in the summation (as it is the only one that depends on ωt . Note that we
use the symbol ∝ because for the moment we are not worried about the choice of the step size.
As is by now familiar practice, we can write J(ω) equivalently as
1 X X
J(ω) = η(xi , uk )[f (xi , uk , ω) − Q̂(xi , uk )]2 ,
2
xi ∈X uk ∈U
where η is the stationary distribution of the joint Markov process {(Xt , Ut )}. It is easy to see that
Pr{Xt = xi , Ut = uk |π(θ)} = Pr{Xt = xi |π(θ)} · Pr{Ut = uk |Xt = xi , π(θ)} = µθ (xi )π(xi , uk , θ).
We will encounter the above formula again. Also, since Q̂θ is an unbiased estimate of Qθ , we can replace Q̂θ
by Qθ when we take the expectation (i.e., sum over xi and uk ). So when ω is chosen such that ∇ω J(ω) = 0
(i.e., at a stationary point of J(·)), we have that
X X
µθ (xi )π(xi , uk , θ)[f (xi , uk , ω) − Qθ (xi , uk )]∇ω f (xi , uk , ω) = 0. (5.77)
xi ∈X uk ∈U
5.3. POLICY GRADIENT AND ACTOR-CRITIC METHODS 83
Now suppose that the Q̂θ -approximating function f is “compatible” with the policy-approximating func-
tion π in the sense that
1
∇ω f (xi , uk , ω) = ∇θ π(xi , uk , θ). (5.78)
π(xi , uk , θ)
One possibility is simply to choose f (·, ·, ω) to be linear in ω, with the gradient given by the right side of
(5.78). Now (5.78) can be rewritten as
Combining the above with (5.74) gives an alternate formula for the gradient of the value function c∗ (θ),
namely X X
∇θ c∗ (θ) = µθ (xi )∇θ π(xi , uk , θ)f (xi , uk , ω). (5.79)
xi ∈X uk ∈U
The significance of this formula is that the unknown function Qθ can be replaced by the approximating
function f (xi , uk , ω). So if we use a gradient search method (for example) to update θ t , we can use (5.79)
instead of (5.74). Note that this introduces a “coupling” between θ t and ω t . In other words, as the policy
approximation gets updated, so does the value approximation. This is a precursor to actor-critic methods
which we study next.
Thus c∗ (θ) is the expected value of the reward R under the joint stationary distribution of (Xt , Ut ). Moreover,
Vθ as defined in (5.73) satisfies the Poisson equation
X X u
c∗ (θ) + Vθ (xi ) = π(xi , uk , θ) R(xi , uk ) + aijk Vθ (xj ) . (5.82)
uk ∈U xj ∈X
84 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS
Now we give a very brief description of the actor-critic methods in [23]. Suppose q1 , q2 are two functions
that map X × U into R. In other words, q1 , q2 are |X | · |U |-dimensional vectors. Suppose we define an inner
product hq1 , q2 iη as X X
hq1 , q2 iη = ηθ (xi , uk )q1 (xi , uk )q2 (xi , uk ).
uk ∈U xi ∈X
This is similar to Lemma 5.4, where we defined an inner product making use of a stationary distribution.
Then
∂c∗ (θ)
= hqθ , ω θ iη ,
∂θi
where qθ is the function Qθ written out as a vector of dimension nm, and ω θ ∈ Rnm is the vector defined in
(5.83). Hence, in order to compute this partial derivative, it is not necessary to know the vector Qθ – it is
enough to know a projection of it on the space spanned by the ω θ vectors. Now suppose we use a “linear”
parametrization of the type
Xd
π(θ) = ψ l θl ,
l=1
nm
where each ψ l ∈ R and the components can be thought of as ψl (xi , uk ). Recall that π(θ) is a probability
distribution on U, so the set of policies consists of summations of the above form that are in S(U). One
possibility is to choose the columns to correspond to policies, and to restrict θ to vary over the unit simplex
in Rd , so that the policy πθ is a convex combination of the columns of Ψ, as opposed to a linear combination
of the columns. In any case, we have
∂π(θ)
= ψl ,
∂θl
and as a result
1
ωl = ψ ∝ ψl .
π(θ) l
Therefore the span of the vectors {ω l } is the same as the span of the vectors ψ l , that is, Ψ(Rd ). So all we
need to find is a projection of the nm-dimensional vector qθ onto Ψ(Rd ). The projection is computed using
the inner product h·, ·iη defined above. Nevertheless, the projection belongs to Ψ(Rd ).
In [23], the authors slightly over-parametrize by taking
s
X
q̂θ = φl ζl = Φζ,
l=1
5.3. POLICY GRADIENT AND ACTOR-CRITIC METHODS 85
as an approximation for the projection of q, where Φ ∈ Rnm×s and ζ ∈ Rs . Here s nm so that the above
approximation results in a reduction in the dimension of q. At the same time, one chooses s > d, so that ζ
has more parameters than θ. Moreover, one chooses the vectors in such a way that
span{ψ 1 , · · · , ψ d } ⊆ span{φ1 , · · · , φs }.
Therefore, the approximant q̂ may not belong to Ψ(Rd ). However, the projection of q̂ onto Ψ(Rd ) can be
used to determine the gradient of c∗ (θ).
Now we present the actor-critic methods in [23]. Unlike in earlier situations where we had a fixed policy,
here the policy also varies with t. Nevertheless, we will get a sample path {(Xt , Ut , Wt+1 )}. That sample
path is used below.
The critic tries to form an estimate of the average reward process. The equations are analogous to (5.45),
(5.46) and (5.47). There are two estimates, one for the value of the process c∗ (θ), and another for the
parameter ζt that gives an approximation q̂ζ t = Φζ t . These updates are as follows:
denote the (Xt , Ut )-th row of the matrix Φ, transposed. Then the TD(λ) update is defined via
t
X
zt = λt−τ yτ
τ =0
zt = zt−1 + yt .
This is at the opposite end of the spectrum from T D(0) which is the standard TD-learning.
For updating the policy, we choose a globally Lipscitz-continuous function Γ : Rs → R+ with the property
that, for some finite constant C > 0, we have
C
Γ(ζ) ≤ , ∀ζ ∈ Rs .
1 + kζk
Then the policy update is just a gradient update, given by
for some positive constant . With this assumption we can state the following theorem:
86 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS
and in addition
αt
→ 0 as t → ∞.
βt
If {θ t } is bounded almost surely, then
lim k∇θ c∗ (θ)k = 0.
t→∞
Note that the policy parameter θ t is update more slowly than the value parameter ζ t .
Problem 5.4. Show that (5.62) is a consequence of (5.61).
Problem 5.5. Prove (5.66).
87
88 CHAPTER 6. INTRODUCTION TO EMPIRICAL PROCESSES
Chapter 7
Finite-Time Bounds
In previous chapters, the convrgence results presented are mostly asymptotic in nature. By and large, they
do not tell us what happens after a particular learning algorithm has been run a finite number of times. In
contrast, in the present chapter, several results are presented that are “finite time” in nature.
µ∗ := max µi .
1≤i≤K
Thus µ∗ is the best possible return. If learner knew which machine has the return µ∗ , then s/he would just
keep playing that machine, which would lead to the maximum possible expected return per play. On the
other hand, there is some amount of “exploration” of all the machines, during the course of which many
suboptimal choices will be made. The regret attempts to capture the return foregone, which is seen as the
cost of exploration. Specifically, suppose that some strategy for choosing the machine to be played at each
time generates a sequence of indices {Mt }t≥1 where each Mt lies between 1 and K. After n plays, let Ti (n)
denote the number of times that machine i is played. Where convenient, we also denote this as ni (n) or
just ni if n is obvious from the context. Now, the actual returns of machine i at these ni time instants are
random, and can be denoted by {Xi,1 , · · · , Xi,ni }. By playing machine i a total of ni times until time n, the
reward foregone is
XK
µ∗ n − µi ni .
i=1
However, this is a random number, because the number of plays ni is random. Therefore, corresponding to
a policy for choosing the next machine to be played, the regret is defined as
K
X
C = µ∗ n − µi E[ni ]. (7.1)
i=1
89
90 CHAPTER 7. FINITE-TIME BOUNDS
There are several equivalent ways to express the regret. For example, define the empirical return of
machine i after n plays as
ni
1 X
µ̂i := Xi,t . (7.2)
ni t=1
Due to the assumption that the returns are i.i.d., it is obvious that E[µ̂i ] = µi . We can write
"K #
X
∗
C=E ni (µ − µ̂i ) .
i=1
In this section we follow [2]. We study two strategies, and derive upper bounds for the regret of each
strategy. In the second case, our result is a slight generalization of the cotents of [2].
Background Material
The objective of this chapter is to collect in one place the background material required to understand the
main body of the notes. While the exposition is rigorous, and several references are given throughout, these
notes are not, by themselves, sufficient to gain a mastery over these topics. A reader who is encountering
these topics for the first time is strongly encouraged to consult the various references in order to understand
the present discussion thoroughly. To avoid tedious repetition, we do not include the phrase “Introduction
to” in the title of each section, but that should be assumed.
S1. Ω ∈ F.
Definition 8.2. Suppose (Ω, F) is a measurable space. A function P : F → [0, 1] is called a probability
measure if it satisfies the following axioms:
1 The term σ-field is more popular, but this terminology is preferred here.
2 Note that [S1] and [S2] together imply that ∅ ∈ F .
91
92 CHAPTER 8. BACKGROUND MATERIAL
P1. P (Ω) = 1.
P2. P is countably additive, that is: Whenever {Ai } are pairwise disjoint sets from F, we have that
∞ ∞
!
[ X
P Ai = P (Ai ). (8.2)
i=1 i=1
Note that if Ω is a finite or countable set, it is customary to take F to be the “power set” of Ω, that is,
the collection of all subsets of Ω, often denoted by 2Ω . If a nonnegative weight pi is assigned to each element
i ∈ Ω, and if P is defined as X X
P (A) = pi = pi I{i∈A} , (8.3)
i∈A i∈Ω
then it is easy to verify that (Ω, 2Ω , P ) is a probability space. However, if Ω is an uncountable set, e.g., the
real numbers, then the above approach of assigning weights to individual elements does not work. Note that
in (8.3), I{i∈A} is the indicator function that equals 1 if i ∈ A and 0 if i 6∈ A.
Definition 8.3. Suppose (Ω, F) and (X , G) are measurable spaces. Then a map f : Ω → X is said to be
measurable if f −1 (S) ∈ F for all S ∈ G.
Thus a map from Ω into X is measurable if the preimage of every set in G under f belongs to F.
Definition 8.4. Suppose (Ω, F, P ) is a probability space, and (X , G) is a measurable space. A function
X : Ω → X is said to be a random variable if it is measurable, that is,
X −1 (S) ∈ F, ∀S ∈ G. (8.4)
P (X −1 (S)) =: PX (S)
In the above definition, (X , G) is called the “event space,” and the sets belonging to the σ-algebra G are
called “events,” because each such set has a probability associated with it (via (8.4))). The triple (Ω, F, P )
is called the “sample space.”
Example 8.1. Suppose we wish to capture the notion of a two-sided coin that comes up H for heads 60% of
the time, and T for tails 40% of the time. In such a case, the event space (the set of possible outcomes) is just
X = {H, T }. Because the set X is finite, the corresponding σ-algebra G can be just 2X = {∅, {H}, {T }, X }.
The sample space (Ω, F, P ) can be anything, as can the map f : Ω → X , provided only that two conditions
hold: First,
(Actually, either one of the conditions would imply the other.) Second,
Definition 8.5. Suppose X is a random variable defined on the sample space (Ω, F, P ) taking values in
(X , G). Then the σ-algebra generated by X is defined as the smallest σ-algebra contained in F with
respect to which X is measurable, and is denoted by σ(X).
8.1. RANDOM VARIABLES AND STOCHASTIC PROCESSES 93
Example 8.2. Consider again the random variable studied in Example 8.1. Thus X = {H, T } and G = 2X .
Now suppose X is a measurable map from some (Ω, F, P ) into (X , G). Then all possible preimages of sets
in G are:
∅ = X −1 (∅), Ω = X −1 (X ), A := X −1 ({H}), B := X −1 ({T }) = Ac = Ω \ A.
Thus the smallest possible σ-algebra on Ω with respect to which X is measurable consists of {∅, A, Ac , Ω}.
Therefore this is the σ-algebra of Ω generated by X. Any other sets in F are basically superfluous. We can
carry this argument further and simply take the sample space Ω to be the same as the event space X , and X
to be the identity operator on (Ω, F, P ) into (Ω, F). Thus Ω = {H, T }, and F = {∅, {H}, {T }, Ω}. Further,
we can define P ({H}) = 0.6, P ({T }) = 0.4. This is sometimes called the canonical representation of the
random variable X. Usually we can do this whenever the event space is finite or countable.
Originally, the phrase “random variable” was used only for the case where the event space X = R, and
the σ-algebra is the so-called Borel σ-algebra, which is defined as the smallest σ-algebra of subsets of R
that contains all closed subsets of R. Random quantities such as the outcomes of coin-toss experiments were
called something else (depending on the author). Subsequently, the phrase “random variable” came to be
used for any situation where the outcome is uncertain, as defined above. In the context of Markov Decision
Processes, often the Markov process evolves over a finite set, and the action space is also finite. So a lot of
the heavy machinery above is not needed to describe the evolution of an MDP. However, in reinforcement
learning, the parameters of the MDP need to be estimated, including the reward–and these are real-valued
quantities. So it is desirable to deal with random variables that assume values in a continuum. When we
do that, the sample set Ω equals R or some subset thereof, and F equals the Borel σ-algebra. Finally, the
probability measure P is defined on Ω.
P (S ∩ T ) = P (S)P (T ), ∀S ∈ F1 , T ∈ F2 . (8.5)
Two random variables X1 , X2 defined on (Ω, F, P ) are said to be independent if the corresponding σ-
algebras σ(X1 ), σ(X2 ) are independent.
It is easy to show that two events S, T are independent if and only if the corresponding σ-algebras
F1 = {∅, S, S c , Ω} and F2 = {∅, T, T c , Ω} are independent.
The extension of the above definition to any finite number of events, or σ-algebras, or random variables,
is quite obvious. For more details, see [11, Section 3.1] or [48, Chapter 4].
Until now we have discussed what might be called “individual” random variables. Now we discuss the
concept of joint random variables, and the associated notion of joint probability. The definition below is
for two joint variables, but it is obvious that a similar definition can be made for any finite number of joint
random variables. In turn this reads to the concept of conditional probability.
94 CHAPTER 8. BACKGROUND MATERIAL
Definition 8.7. Suppose (X , G) and (Y, H) are measurable spaces. Then the product of these two spaces
is (X × Y, G ⊗ H) where G ⊗ H is the smallest σ-algebra of subsets of X × Y that contains all products of
the form S × T, S ∈ G, T ∈ H.
Note that G ⊗ H is called the “product” σ-algebra, and is not to be confused with G × H, the Cartesian
product of the two collections G and H. In fact one can write G ⊗ H = σ(G × H), where σ(S) denotes the
smallest σ-algebra containing all sets in the collection S. Previously we had defined σ(X), the σ-algebra
generated by a random variable X. The two usages are consistent. Suppose X is a random variable on
(Ω, F, P ) mapping Ω into (X , G), and let S consist of all preimages in Ω of sets in G. Then σ(X) and σ(S)
are the same.
Suppose (Ω, F, P ) is a probability space, and that (X , G) and (Y, H) are measurable spaces. Let (X ×
Y, G ⊗ H) denote their product. Suppose further that Z : Ω → X × Y is measurable and thus a random
variable taking values in X × Y. Express Z as (X, Y ) where X, Y are the components of Z, so that
X : Ω → X , Y : Ω → Y. Then it can be shown that X and Y are themselves measurable and are thus
random variables in their own right. The probability measures associated with these two random variables
are as follows:
PX (S) := P −1 (Z ∈ (S × Y)), ∀S ∈ G, PY (T ) := P −1 (Z ∈ (X × T )), ∀T ∈ H. (8.6)
We refer to Z = (X, Y ) as a joint random variable with joint probability measure PZ , and to PX and PY
as the marginal probability measures (or just marginal probabilities) of PZ for X and Y respectively.
Definition 8.8. Suppose (Ω, F, P ) is a probability space. Suppose (X , G) and (Y, H) are measurable spaces,
and let Z = (X, Z) : Ω → X × Y be a joint random variable. Finally, suppose S ∈ G, T ∈ H are events
involving X and Y respectively. Then the conditional probability Pr{X ∈ S|Y ∈ T } is defined as
Pr{Z = (X, Y ) ∈ S × T } PZ (S × T )
Pr{X ∈ S|Y ∈ T } = = . (8.7)
Pr{Y ∈ T } PY (T )
Further, X and Y are said to be independent if
PZ (S × T ) = PX (S) × PY (T ), ∀S ∈ G, T ∈ H. (8.8)
In the definition of the conditional probability (8.7), it is assumed that PY (T ) > 0.
A common application of conditional probabilities arises when both X and Y are finite sets. In this
case X, Y, Z are random variables assuming values in finite sets X , Y, X × Y. Suppose to be specific that
X = {x1 , · · · , xn } and Y = {y1 , · · · , ym }. Then it is convenient to represent the joint probability distribution
of Z = (X, Y ) as an n × m matrix Θ, where
θij = Pr{Z = (xi , yj )} = Pr{X = xi &Y = yj }.
Let us denote the marginal probabilities as
φi = Pr{X = xi }, ψj = Pr{Y = yj }.
Then it is easy to infer that
φ> = Θ1n , ψ = 1>
n Θ,
where 1k denotes a column vector of k ones. Note that we follow the convention that a probability distribution
is a row vector. Also, in this simple situation, it can be assumed without any loss of generality that φi > 0
for all i, and ψj > 0 for all j. If φi = 0 for some index i, it means that θij = 0 for all j; therefore the element
xi can be deleted from the X without affecting anything. Similar remarks apply to ψ as well. With these
notational conventions, it is easy to see that
P
Pr{Z = (X, Y ) ∈ S × T } xi ∈S,yj ∈T θij
Pr{X ∈ S|Y ∈ T } = = P .
Pr{Y ∈ T } yj ∈T ψj
All of the above definitions can be extended to more than two random variables.
8.1. RANDOM VARIABLES AND STOCHASTIC PROCESSES 95
If p = ∞, we define L∞ (Ω, P ) to be the set of functions that are essentially bounded, that is, bounded
except on a set of measure zero, and define the corresponding norm as the “essential supremum” of f (·),
that is
kf k∞ = inf{c : P {|f (ω)| ≥ c} = 0}. (8.12)
Next, for a given p ∈ [1, ∞], and define the conjugate index q ∈ [1, ∞] as the unique solution of the
equation
1 1
+ = 1. (8.13)
p q
In particular, if p ∈ (1, ∞), then q = p/(p − 1). If p = 1, then q = ∞ and vice versa. Then Hölder’s
inequality [4, p. 113] states that if f ∈ Lp (Ω, P ) and g ∈ Lq (Ω, P ) where p and q are conjugate indices, then
the product f g ∈ L1 (Ω, P ), and
Z Z 1/p Z 1/q
p q
kf gk1 ≤ kf kp · kgkq , or |f (ω)g(ω)|P (dω) ≤ |f (ω)| P (dω) · |g(ω)| P (dω) . (8.14)
Ω Ω Ω
By using Hölder’s inequality and the fact that P (Ω) = 1, it is easy to show that
In particular, if a real random variable X is square-integrable, it is also absolutely integrable, and thus has
a well-defined mean. Moreover, with µ := E[X, P ], we can write
Z Z
2
[X(ω) − µ] P (dω) = X 2 (ω)P (dω) − µ2 =: V (X, P ),
Ω Ω
Any two conditional expectations E(X|G) agree almost surely, and each is called a “version” of the conditional
expectation.
Note that E(X|G) is a (Ω, G)-measurable approximation to X such that, when restricted to sets in G,
E(X|G) is functionally equivalent to X, as stated in (8.17). Note that (8.17) can also be expressed as
Z Z
X(ω)ID (ω)P (dω) = Y (ω)ID (ω)P (dω), ∀D ∈ G
Ω Ω
Now we interpret some of the statements in the theorem. The obvious ones are not discussed. Item 3
states that the expected value of the conditional expectation is the same as the expected value of the original
random variable.5 Item 5 states that if X is multiplied by a bounded random variable Z ∈ M(G),6 then
the multiplier Z just passes through the conditional expectation operation. Item 6 states that if we were
to first take the conditional expectation of X with respect to G, and then take the conditional expectation
of the resulting random variable with respect to a smaller σ-algebra H, then the answer would be the same
as if we had directly taken the conditional expectation with respect to H. Note that this property is called
the “tower property” on [48, p. 88]. A ready consequence of Items 7 and 8 is that, if X1 ≥ X2 almost
surely, then E(X1 |G) ≥ E(X2 |G) almost surely. Finally, Item 9 states that if X belongs to the smaller space
L2 (Ω, F, P ) which is an inner product space, then its conditional expection also belongs to L2 (Ω, F, P ), and
can be computed using the projection theorem.
Example 8.3. In this example, we illustrate the concept of a conditional expectation in a very simple case,
namely, that of a random variable assuming only finitely many values. Suppose X = {x1 , · · · , xn } and
V = {v1 , · · · , vm } are finite sets, and that Z = (X, V ) is a joint random variable assuming values in X × V.
Let Θ ∈ [0, 1]n×m denote the joint probability distribution of Z written out as a matrix, and let φ, ψ denote
the marginal probability distributions of X and V respectively, written out as row vectors. Finally, suppose
f : X × V → R is a given function. Then f (Z) is a real-valued random variable assuming values in some
finite set.
Because both X and V are finite-valued, we can use the canonical representation, and choose Ω = X × V,
F = 2Ω , and P = Θ. Now suppose we define G to be the σ-algebra generated by V alone. Thus G =
{∅, X } ⊗ 2V . Again, because f (X, V ) assumes only finitely many values over a finite set, it is a bounded
random variable. Therefore E(f |G) is the best approximation to f (X, V ) using a function of V alone. From
Item 9 of the theorem this conditional expectation can be determined using projections.
As pointed out after Definition 8.8, it can be assumed without loss of generality that every component
of ψ is positive. Therefore the ratio
θij
= Pr{X = xi |V = vj }
ψj
is well-defined, though it could be zero.
In order to determine E(f |G), we should find a function g : V → R such that the error E[(f − g)2 , Θ] is
minimized. Let g1 , · · · , gm denote the values of g(·), and define the objective function
m n
1 XX
J= (gj − fij )2 θij .
2 j=1 i=1
5 To streamline the notation wherever possible, we write E[X, P ] to denote the expected value, which is a real number, and
Then the objective is to choose the constants g1 , · · · , gm so as to minimize J. This happens when
n
∂J X
0= = (gj − fij )θij .
∂gj i=1
For future use, we introduce definitions of what it means for a sequence of real-valued random variables to
converge. Three commonly used notions of convergence are convergence probability, almost sure convergence,
and convergence in the mean. All are defined here.
Definition 8.10. Suppose {Xn }n≥0 is a sequence of real-valued random variables, and X ∗ is a real-valued
random variable, on a common probability space (Ω, F, P ). Then the sequence {Xn }n≥0 is said to converge
to X ∗ in probability if
P ({ω ∈ Ω : |Xn (ω) − X ∗ (ω)| > }) → 0 as n → ∞, ∀ > 0. (8.23)
The sequence {Xn }n≥0 is said to converge to X ∗ almost surely (or almost everywhere) if
P ({ω ∈ Ω : Xn (ω) → X ∗ (ω) as n → ∞}) = 1. (8.24)
Now suppose that Xn , X ∗ ∈ L1 (Ω, P ). Then the sequence {Xn }n≥0 is said to converge to X ∗ in the mean
if
kXn − X ∗ k1 → 0 as n → ∞. (8.25)
Note that, for any p ∈ (1, ∞), “convergence in the p-th mean” can be defined in the space Lp (Ω, P ), as
kXn − X ∗ kp → 0 as n → ∞. Also, the extension of Definition 8.10 to random variables assuming values in
a vector space Rd is obvious and is left to the reader.
The relationship between the various types of convergence is as follows:
Theorem 8.2. Suppose {Xn }, X ∗ are random variables defined on some probability space (Ω, F, P ). Suppose
Xn → X ∗ in probability as n → ∞. Then every subsequence of {Xn } contains a subsequence that converges
almost surely to X ∗ .
The converse of Theorem 8.2 is also true. See [4, Section 21, Problem 10(c)].
Theorem 8.3. Suppose {Xn }, X ∗ ∈ L1 (Ω, P ). Then
1. Xn → X ∗ a.s. implies that Xn → X ∗ in probability.
2. Xn → X ∗ in the mean implies that Xn → X ∗ in probability
3. Suppose there is a nonnegative random variable Z ∈ L1 (Ω, P ) such that |Xn | ≤ Z a.e., and suppose
that Xn → X ∗ a.s.. Then Xn → X ∗ in the mean.
These statements also apply to Rd -valued random variables.
Problem 8.1. Show that a consequence of Definition 8.1 is that ∅ ∈ F.
Problem 8.2. Show that a consequence of Definition 8.2 is that P (∅) = 0.
8.2. MARKOV PROCESSSES 99
In other words, the conditional probability of the state Xt+1 depends only on the most recent value of Xt ;
adding information about the past values of Xτ for τ < t does not change the conditional probability. One
can also say that Xt+1 is independent of X0t−1 given Xt . This property is sometimes paraphrased as “the
future is conditionally independent of the past, given the present.”
A Markov process over a finite set X is completely characterized by its state transition matrix A,
where
aij := Pr{Xt+1 = xj |Xt = xi }, ∀xi , xj ∈ X .
Thus in aij , i denotes the current state and j the future state. The reader is cautioned that some authors
interchange the roles of i and j in the above definition. If the transition probability does not depend on t,
then the Markov process is said to be stationary; otherwise it is said to be nonstationary. We do not
deal with nonstationary Markov processes in these notes.
Note that aij ∈ (0, 1) for all i, j. Also, at any time t + 1, it must be the case that Xt+1 ∈ X , no matter
what Xt is. Therefore, the sum of each row of A equals one, i.e.,
n
X
aij = 1, i = 1, . . . n. (8.27)
j=1
A1n = 1n , (8.28)
100 CHAPTER 8. BACKGROUND MATERIAL
where 1n denotes the column vector consisting of n ones. For future purposes, let us refer to a matrix
A ∈ [0, 1]n×n that satisfies (8.27) as a row-stochastic matrix, and denote by Sn×n the set of all row-
stochastic matrices of dimension n × n.
The matrix A is often called the “one-step” transition matrix, because row i of A gives the probablity
distribution of Xt+1 if Xt = xi . So we can ask: What is the k-step transition matrix? In other words, what
is the probability distribution of Xt+k if Xt = xi ? It is not difficult to show that this conditional probability
is just the i-row of Ak . Thus the k-step transition matrix is just Ak .
Example 8.4. A good example of a Markov process is the “snakes and ladders” game. Take for example the
board shown in Figure 8.1. In this case, we can let Xt denote an integer between 1 and 100, corresponding to
the square on which the player is. Thus X = {1, · · · , 100}. Suppose the player throws a four-sided die with
each of the outcomes (1, 2, 3, 4) being equally probable. Then the resulting sequence of positions {Xt }t≥0
is a stochastic process. Suppose for example that the player is on square 60. Note that what happens next
after a player has reached square 60 (or any other square) does not depend on how the player reached that
square. That is why the sequence of positions is a Markov process. Now, if the player is on square 60 so that
Xt = 60, then with probability of 1/4, the position at time t + 1 will be 61, 19 (snake on 62), 81 (ladder on
63) and 60 (snake on 64). Hence, in row 60 of the 100 × 100 state transition matrix, there are elements of 1/4
in columns 19, 60, 61, 81 and zeros in the remaining 96 columns. In the same manner, the entire 100 × 100
state transition matrix can be determined.
Let us suppose that the snakes and ladders game always starts with the player being in square 1. Thus
X0 is not random, but is deterministic, and the “probability distribution” of X0 , viewed as a row vector,
has a 1 in column 1 and zeros elsewhere. If we multiply this row vector by Ak for any integer k, we get the
probability distribution of the player’s position after k moves.
An application of the Gerschgorin circle theorem [18, Theorem 6.1.1] shows that, whenever A is row-
stochastic, the spectral radius ρ(A) ≤ 1. Moreover, the relationship (8.27) shows that λ = 1 is an eigenvalue
of A with column eigenvector 1n , so that in fact ρ(A) = 1. Thus one can ask: What does the row eigenvector
corresponding to λ = 1 look like? If there is a nonnegative row eigenvector µ ∈ Rn+ , then it can be scaled
so that µ1n = 1. Such a µ is called a stationary distribution of the Markov process, because if Xt has
the probability distribution µ, then so does Xt+1 . More generally, if X0 has the probability distribution µ,
then so does Xt for all t ≥ 0.
8.2. MARKOV PROCESSSES 101
Theorem 8.4. (See [5, Theorem 3.2, p. 8].) Every row-stochastic matrix A has a nonnegative row eigen-
vector corresponding to the eigenvalue λ = 1.
Note that Theorem 8.4 is a very weak statement. It states only that there exists a stationary distribution;
nothing is said about whether this is unique or not. To proceed further, it is helpful to make some assumptions
about A.
Definition 8.13. A row-stochastic matrix A is said to be irreducible if it is not possible to partition the
permute the rows and columns symmetrically (via a permutation matrix Π) such that
B11 0
Π−1 AΠ = .
B21 B22
Thus a row-stochastic matrix is irreducible if it is not possible to turn it into a block-triangular matrix
through symmetric row and column permutations. The notion of irreducibility plays a crucial role in the
theory of Markov processes. So it is worthwhile to give an alternate characterization of irreducibility.
Lemma 8.1. A row-stochastic matrix A is irreducible if and only if, for any pair of states ys , yf ∈ X , there
exists a sequence of states y1 , · · · yl ∈ X such that, with y0 = ys and yl+1 = yf , we have that
Thus the matrix A is irreducible if and only if, for every pair of states ys and yf , there is a path from
ys to yf such that every step in the path has a positive probability. In such a case we can say that yf
is reachable from ys . There are several equivalent characterizations of irreducibility, and for nonnegative
matrices in general, not necessarily satisfying (8.27); see [46, Chapter 3]. Another useful reference is [5],
which is devoted entirely to the study of nonnegative matrices. One such characterization is given next.
Theorem 8.5. (See [46, Corollary 3.8].) A row-stochastic matrix A is irreducible if and only if
n−1
X
Al > 0,
l=0
So we can start with M0 = I and define recursively Ml+1 = I + AMl . If Ml > 0 for any l, then A is
irreducible. If we get up to Mn−1 and this matrix is not strictly positive, then A is not irreducible.
Theorem 8.6. (See [46, Theorem 3.25].) Suppose A is an irreducible row-stochastic matrix. Then
1. λ = 1 is a simple eigenvalue of A.
3. Thus A has a unique stationary distribution, whose elements are all positive.
4. There is an integer p, called the period of A, such that the spectrum of A is invariant under rotation
by exp(i2π/p).
Definition 8.14. A row-stochastic matrix A is said to be primitive if there exists an integer l such that
Al > 0.
102 CHAPTER 8. BACKGROUND MATERIAL
Theorem 8.7. (See [46, Theorem 3.15].) A row-stochastic matrix A is primitive if and only if it is irreducible
and aperiodic.
Theorem 8.8. (See [46, Lemma 4.12].) Suppose A is an irreducible row-stochastic matrix, and let µ denote
the corresponding stationary distribution. Then
T −1
1 X t
lim A = 1n µ. (8.29)
T →∞ T
t=0
Therefore, the average of I, A, · · · , AT −1 approaches the rank one matrix 1n µ. So, if φ is any probability
distribution on X , and the Markov process is started off with the initial distribution φ, then the distribution
of the state Xt is φAt . Note that, because φ is a probability distribution, we have that φ1n = 1. Therefore
(8.45) implies that
T −1
1 X
lim φAt = φ1n µ = µ, ∀φ. (8.30)
T →∞ T
t=0
The above relationship holds for every φ and forms the basis for the so-called Markov chain Monte
Carlo (MCMC) algorithm. Suppose {Xt }t≥0 is a Markov process evolving over the state space X , with
an irredcible state transition matrix A and stationary distribution µ. Suppose further that f : X → R is a
real-valued function defined on the state space X . We wish to compute the expected value of the random
variable f (Xt ) with respect to the stationary distribution µ, namely
X
E[f (X), µ] = f (xi )µi . (8.31)
xi ∈X
While we may know A, often we may not know µ or may not wish to spend the effort to compute it due to
the high dimension of A. In such a case, we start off the Markov process with an arbitrary initial probability
distribution φ, let it run for some time t0 , and then compute the quantity
t0 +T
1 X
fˆT = f (Xt ). (8.32)
T t=t +1
0
Because this quantity is based on the observed state Xt which is random, fˆT is also random. However, the
expected value of fˆT is precisely E[f (X), µ]. Moreover, its sample-path average fˆT converges to E[f (X), µ]
as T → ∞, and is a good approximation for the expected value for finite T .
The next result is analogous to Theorem 8.8 for primitive matrices.
Theorem 8.9. (See [46, Corollary 4.13].) Suppose A is a primitive row-stochastic matrix, and let µ denote
the corresponding stationary distribution. Then
Al → 1>
n µ as l → ∞. (8.33)
8.2. MARKOV PROCESSSES 103
Now we prove a couple of useful lemmas about irreducible and primitive matrices respectively. These are
useful when we study so-called Markov Decision Processes.
Theorem 8.10. Suppose A is a nontrivial convex combination of row stochastic matrices A1 , · · · , Ak , and
that at least one Ai is irreducible. Then A is irreducible.
Without loss of generality, write
k
X
A= γ i Ai ,
i=1
where again the inequality is componentwise. Combining this with the above inequality shows that
n−1
X n−1
X n−1
X
l
A ≥ γ1l An−1
1 ≥ γ1n−1 An−1
1 > 0.
l=0 l=0 l=0
Therefore A is irreducible.
Corollary 8.1. The set of irreducible matrices is convex.
Theorem 8.11. Suppose A is a nontrivial convex combination of row stochastic matrices A1 , · · · , Ak , and
that at least one Ai is primitive. Then A is primitive.
The proof is similar to that of Theorem 8.9, except that Theorem 8.5 is replaced by Definition 8.14.
Corollary 8.2. The set of primitive matrices is convex.
where C ∈ Ss×s is a row stochastic matrix in itself, and the matrix B has at least one nonzero element. Note
too that the set S can be absorbing, even if no individual state in S is absorbing. For example, suppose C
104 CHAPTER 8. BACKGROUND MATERIAL
is a permutation matrix over s indices. However, if C = Is , the identity matrix, then not only is the set S
absorbing, but every individual state in S is absorbing. In this case the matrix M looks like
A B
M= . (8.35)
0 Is
An illustration of an absorbing state is provided by the snakes and ladders game. If the player’s position
hits 100, then the game is over. So 100 is an absorbing state. In other games like Blackjack, there are two
absorbing states, namely W and L (for win and lose). In the Markov process literature, any sample path
X0l such that Xl is an absorbing state is called an episode.
It can be shown that if the state Xt of the Markov process enters the absorbing set S with probability
one as t → ∞, then B 6= 0, that is, B contains at least one nonzero element, and further, ρ(A) < 1. See
specifically Items 3 and 6 of [46, Theorem 4.7]. More details can be found in [46, Section 4.2.2]. (Note that
notation in [46] is different.) For the purposes of RL, it is useful to go beyond these facts, and to compute
the probability distribution of the time at which the state trajectory enters S. In turn this gives the average
number of time steps needed to reach the absorbing set. In case there are multiple absorbing states, it is also
possible to compute the probability of hitting an individual absorbing state ai within the overall absorbing
set S. To be specific, define θiS to be the first time that a sample path {X0∞ } hits the set S, starting at
X0 = xi . Further, if M is of the form (8.35) so that each set in S is absorbing, define θik to be the first
time that a sample path {X0∞ } hits the absorbing state ak , starting at X0 = xi . Then we have the following
result:
Pr{θiS = l} = e>
i A
l−1
B1s ∀l ≥ 1, (8.36)
where ei denotes the i-th elementary column vector with a 1 in row i and zeros elsewhere. If M has the form
(8.35), the for each k ∈ [s], we have
Pr{θik = l} = e>
i A
l−1
bk ∀l ≥ 1 (8.37)
where bk denotes the k-th column of B. The probability that a sample path X0∞ with X0 = xi terminates in
the absorbing state ak is given by
pik = e>
i (I − A)
−1
bk . (8.38)
Moreover,
s
X
pik = 1, ∀i ∈ [m].
k=1
The vector of probabilities that a sample path X0∞ terminates in the absorbing state ak is given by
pk = (I − A)−1 bk . (8.39)
For each transient initial state xi ∈ T , define the average hitting time to reach the absorbing set S starting
from the initial state xi to be the expected value of θiS , that is
∞
X
θ̄iS = l Pr{θiS = l},
l=1
Proof. We begin by deriving the expressions for the probability distributions. For each pair of indices
i, j ∈ [m] and each integer l, the value (Al )ij is the probability that, starting in state xi at time t = 0, the
state at time l equals xj , while staying within the set X . Thus the probability that θiS = l is given by
m
X
Pr{θiS = l} = (Al−1 )ij (B1s )j = e>
i A
l−1
B1s .
j=1
This is (8.36). If S consists of individual absorbing states, and we wish to determine the probability distri-
bution that Xl = ak given that X0 = xi , then we simply replace B1s by the corresponding k-th column of
B. This is (8.37). Equation (8.38) is obtained by observing that, since ρ(A) < 1, we have that
∞
X
Al−1 = (I − A)−1 .
l=1
This is (8.38). Stacking these probabilities as i varies over [m] gives (8.39).
Next we deal with the hitting times. Define the vector b = B1s , and consider the modified Markov
process with the state transition matrix
A b
M= .
0 1
In effect, we have aggregated the set of absorbing states into one “virtual state.” From the standpoint of
computing θ̄, this is permissible, because once the trajectory hits the set S, or the virtual “last state” in the
modified formulation, the time counter stops. To prove (8.40), suppose the Markov process starts in state
xi . Then there are two possibilities: First, with probability bi , the trajectory hits the last virtual state. In
this case the counter stops, and we can say that the hitting time is 1. Second, with probability aij for each
j, the trajectory hits the state xj . In this case, the hitting time is now 1 + θ̄j . Therefore we have
n
X
θ̄i = bi + aij (1 + θ̄j ).
j=1
or in matrix form
(I − A)θ̄ = 1m .
Clearly this is equivalent to (8.40).
Example 8.6. Consider the “toy” snakes and ladders game with two extra states, called W and L for win
and lose respectively. The rules of the game are as follows:
Initial state is S.
106 CHAPTER 8. BACKGROUND MATERIAL
S 1 4 5 6 7 8 W L
S 0 0.25 0.25 0.25 0 0.25 0 0 0
1 0 0 0.25 0.50 0 0.25 0 0 0
4 0 0 0 0.25 0.25 0.25 0.25 0 0
5 0 0.25 0 0 0.25 0.25 0.25 0 0
6 0 0.25 0 0 0 0.25 0.25 0.25 0
7 0 0.25 0 0 0 0 0.25 0.25 0.25
8 0 0.25 0 0 0 0 0.25 0.25 0.25
W 0 0 0 0 0 0 0 1 0
L 0 0 0 0 0 0 0 0 1
The average duration of a game, which is the expected time before hitting one of the two absorbing states
W or L, is given by (8.40), and is
5.5738
5.4426
4.7869
θ= 4.9180 .
3.9344
3.1475
3.1475
To compute the probabilities of reaching the absorbing states W or L from any nonabsorbing state, define
A to be the 7 × 7 submatrix on the top left, and B to be the 7 × 2 submatrix on the top left. Then the
probabilities of hitting W and L are given by (8.39), and are given by
0.5433 0.4567
0.5457 0.4543
0.5574 0.4426
[PW PL ] = (I − A)−1 B =
0.5550 0.4450 .
0.6440 0.3560
0.5152 0.4848
0.5152 0.4848
Not surprisingly, the two columns add up to one in each row, showing that, irrespective of the starting
state, the sample path with surely hit either W or L. Also not surprisngly, the probability of hitting W
is maximum in state 6, because it is possible to win in one throw of the die, but impossible to lose in one
throw.
Problem 8.3. Suppose the last rule of the toy snakes and ladders game is modified as follows: If imple-
menting a move causes the player to go past L, then the player moves to L (and loses the game). With this
modification, compute the average length of a game and the probability of winning / losing from each state.
8.2. MARKOV PROCESSSES 107
S 1 2 3 4 5 6 7 8 9 W L
For a given sample path y0l , the likelihood that this sample path is generated by a Markov process with state
transition matrix A is given by
l
Y
L(y0l |A) = Pr{y0 } P r{Xt = yt |Xt−1 = yt−1 , A}
t=1
l
Y
= Pr{y0 } ayt −1 yt . (8.42)
t=1
The formula becomes simpler if we take the logarithm of the above. Clearly, maximizing the log-likelihood
of observing y0l is equivalent to maximizing the likelihood of observing y0l . Thus
l
X
LL(y0l |A) = log Pr{y0 } + log ayt −1 yt . (8.43)
t=1
A further simplification is possible. For each pair (xi , xj ) ∈ X 2 , let νij denote the number of times that the
string xi xj occurs (in that order) in the sample path y0l . Next, define
n
X
ν̄i := νij . (8.44)
j=1
It is easy to see that, instead of summing over strings yt−1 yt , we can sum over strings xi xj . Thus yt−1 yt =
xi xj precisely νij times. Therefore
n X
X n
LL(y0l |A) = log Pr{y0 } + νij log aij . (8.45)
i=1 j=1
108 CHAPTER 8. BACKGROUND MATERIAL
We can ignore the first term as it does not depend on A. Also, A needs to satisfy the stochasticity constraint
(8.28). So we want to maximize the right side of (8.45) (without the term log Pr{y0 }) subject to (8.41). For
this purpose we form the Lagrangian
n X
X n n
X n
X
J= νij log aij + λi 1 − aij ,
i=1 j=1 i=1 j=1
∂J νij
= − λi .
∂aij aij
Therefore the maximum likelihood estimate for the state transition matrix of a Markov process, based on
the sample path y0l , is given by
νij
aij = . (8.46)
ν̄i
Theorem 8.13. Suppose f : Rn → Rn and that there exists a constant ρ < 1 such that
f (x∗ ) = x∗ . (8.48)
To find x∗ , choose an arbitrary x0 ∈ Rn and define xl+1 = f (xl ). Then {xl } → x∗ as l → ∞. Moreover, we
have the explicit estimate
ρl
kx∗ − xl k ≤ kx1 − x0 k. (8.49)
1−ρ
Therefore x∗ satisfies (8.48). To show that x∗ is unique, suppose f (y ∗ ) = y ∗ . Then it follows from (8.47)
that
kx∗ − y ∗ k = kf (x∗ ) − f (y ∗ )k ≤ ρkx∗ − y ∗ k.
Since ρ < 1, the only way in which the above inequality can hold is if kx∗ − y ∗ k = 0, i.e., if x∗ = y ∗ . Finally,
let m → ∞ in (8.51) so that xm → x∗ and kxm − xl k → kx∗ − xl k. Then (8.51) becomes (8.49).
The bound (8.49) is extremely useful. Note that kx1 − x0 k = kf (x0 ) − x0 k. Therefore kx1 − x0 k is a
measure of how far off the initial guess x0 is from being a fixed point of f . Then (8.49) gives an explicit
estimate of how far x∗ is from xl , for each iteration xl . Note that the bound on the right side of (8.49)
decreases by a factor of ρ at each iteration.
[2] P. Auer, N. Cesi-Bianchi, and F. Fischer. Finite-time analysis of the multiarmed bandit problem.
Machine Learning, 47:235–256, 2002.
[3] A. Benveniste, M. Metivier, and P. Priouret. Adaptive Algorithms and Stochastic Approximation.
Springer-Verlag, 1990.
[5] A. Berman and R. J. Plemmons. Nonnegative Matrices in the Mathematical Sciences. Academic Press,
1979.
[7] V. S. Borkar. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization,
36(3):840–851, 1998.
[8] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press,
2008.
[9] V. S. Borkar and S. P. Meyn. The O.D.E. method for convergence of stochastic approximation and
reinforcement learning. SIAM Journal on Control and Optimization, 38:447–469, 2000.
[10] R. J. Boucherie and N. M. van Dijk, editors. Markov Decision Processes in Practice. Springer Nature,
2017.
[11] L. Breiman. Probability. SIAM: Society for Industrial and Applied Mathematics, 1992.
[12] Deep Mind. Alphazero: Shedding new light on chess, shogi, and go.
https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go.
[13] C. Derman and J. Sacks. On dvoretzky’s stochastic approximation theorem. Annals of Mathematical
Statistics, 30(2):601–606, 1959.
[14] R. Durrett. Probability: Theory and Examples (5th Edition). Cambridge University Press, 2019.
[16] E. G. Gladyshev. On stochastic approximation. Theory of Probability and Its Applications, X(2):275–
278, 1965.
111
112 BIBLIOGRAPHY
[17] J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer-Verlag, Berlin and
Heidelberg, 2001.
[18] R. A. Horn and C. R. Johnson. Matrix Analysis (Second Edition). Cambridge University Press, 2013.
[20] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Annals of
Mathematical Statistics, 23(3):462–466, 1952.
[21] A. N. Kolmogorov. Foundations of Probability (Second English Edition). Chelsea, 1950. (English
translation by Kai Lai Chung).
[22] V. Konda and J. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization,
42(4):1143–1166, 2003.
[23] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Neural Information Processing Systems
(NIPS1999), pages 1008–1014, 1999.
[24] H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained
Systems. Applied Mathematical Sciences. Springer-Verlag, 1978.
[25] H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications.
Springer-Verlag, 1997.
[26] T. L. Lai. Stochastic approximation (invited paper). The Annals of Statistics, 31(2):391–406, 2003.
[28] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley,
2005.
[29] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics,
22(3):400–407, 1951.
[30] C. E. Shannon. Programming a computer for playing chess. Philosophical Magazine, Ser.7, 41(314),
March 1950.
[31] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran,
T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement learning algorithm
that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, December 2018.
[32] S. P. Singh and R. S. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning,
22(1–3):123–158, 1996.
[33] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction (Second Edition). MIT Press,
2018.
[34] R. S. Sutton, D. McAllester, S. Singh, and Y.Mansour. Policy gradient methods for reinforcement learn-
ing with function approximation. In Advances in Neural Information Processing Systems 12 (Proceedings
of the 1999 conference), pages 1057–1063. MIT Press, 2000.
[35] C. Szepesvári. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010.
[36] G. Tesauro. Gammon, a self-teaching backgammon program, achieves master-level play. Neural Com-
putation, 6(2):215–219, 1994.
BIBLIOGRAPHY 113
[37] G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58–68,
1995.
[38] G. Tesauro. Programming backgammon using self-teaching neural nets. Artificial Intelligence, 134(1-
2):181–199, 2002.
[39] J. N. Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine Learning, 16:185–202,
1994.
[40] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic
gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, September
1986.
[41] J. N. Tsitsiklis and B. V. Roy. Feature-based methods for large-scale dynamic programming. Machine
Learning, 22:59–94, 1996.
[42] J. N. Tsitsiklis and B. V. Roy. An analysis of temporal-difference learning with function approximation.
IEEE Transactions on Automatic Control, 42(5):674–690, May 1997.
[43] J. N. Tsitsiklis and B. V. Roy. Average cost temporal-difference learning. Automatica, 35:1799–1808,
1999.
[44] M. Vidyasagar. Nonlinear Systems Analysis (SIAM Classics Series). Society for Industrial and Applied
Mathematics (SIAM), 2002.
[45] M. Vidyasagar. Learning and Generalization: With Applications to Neural Networks and Control Sys-
tems. Springer-Verlag, 2003.
[46] M. Vidyasagar. Hidden Markov Processes: Theory and Applications to Biology. Princeton University
Press, 2014.
[47] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.