0% found this document useful (0 votes)
8 views

RL-Notes Book

This document is a draft on Reinforcement Learning (RL) and Empirical Processes by Mathukumalli Vidyasagar, outlining key concepts and methodologies related to RL, including Markov Decision Processes and stochastic approximation. It emphasizes the distinction between RL and traditional decision-making under uncertainty, highlighting the dynamic nature of systems influenced by current decisions. The document also includes various topics such as approximate solutions, empirical processes, and finite-time bounds relevant to RL research.

Uploaded by

fcessubmission
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

RL-Notes Book

This document is a draft on Reinforcement Learning (RL) and Empirical Processes by Mathukumalli Vidyasagar, outlining key concepts and methodologies related to RL, including Markov Decision Processes and stochastic approximation. It emphasizes the distinction between RL and traditional decision-making under uncertainty, highlighting the dynamic nature of systems influenced by current decisions. The document also includes various topics such as approximate solutions, empirical processes, and finite-time bounds relevant to RL research.

Uploaded by

fcessubmission
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

Reinforcement Learning and Empirical Processes

Mathukumalli Vidyasagar

September 28, 2021


2
Preface

This is a draft. To quote Oliver Goldsmith, “There are a hundred faults in this Thing and a hundred things
might be said to prove them beauties.” I hope that, as time passes, the faults will decrease while the
“beauties” will increase. In the meantime, “caveat emptor” is the watchword for the reader.
Feedback of all kinds would be gratefully received at [email protected]

i
ii
Contents

Preface i

1 Introduction 1
1.1 Introduction to Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Some Examples of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 About These Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Markov Decision Processes 11


2.1 Markov Reward Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Stochastic Approximation 29
3.1 An Overview of Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Introduction to Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Standard Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Batch Asynchronous Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Two Time Scale Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Finite-Time Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Approximate Solution of MDPs via Simulation 51


4.1 Monte-Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Temporal Difference Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Parametric Approximation Methods 65


5.1 Value Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Value Approximation via TD(λ) -Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Policy Gradient and Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Zap Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Introduction to Empirical Processes 87


6.1 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Vapnik-Chervonenkis and Pollard Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Uniform convergence of Empirical Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 PAC Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.5 Mixing Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7 Finite-Time Bounds 89
7.1 Finite Time Bounds on Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Finite Time Bounds for Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3 Probably Approximately Correct Markov Decision Processes . . . . . . . . . . . . . . . . . . . 90

iii
iv CONTENTS

7.4 Unification of Regret and RL Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90


7.5 Empirical Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8 Background Material 91
8.1 Random Variables and Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Markov processses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3 Contraction Mapping Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.4 Some Elements of Lyapunov Stability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 1

Introduction

1.1 Introduction to Reinforcement Learning


As with many phrases in common usage, there is no precise definition of what constitutes “reinforcement
learning,” often abbreviated to just RL. In the present set of notes, this phrase is used to refer to decision-
making with uncertain models, and in addition, current decisions alter the model of the system. One conse-
quence of this alteration is that, if the same decision is taken at a future time, the consequences might not
be the same. This additional feature, namely that current decisions alter the dynamics of the system under
study, usually though not always by altering the surrounding environment, is what distinguishes RL from
“mere” decision-making under uncertainty.
Figure 1.1 rather arbitrarily divides decision-making problems into four quadrants. Examples from each
quadrant can be given.

ˆ Many if not most decision-making problems fall into the lower-left quadrant of “good model, no alter-
ation.” For example, a well-studied control system such as a fighter aircraft has an excellent model
thanks to aerodynamical modelling and/or wind tunnel tests. To be specific, the dynamical model of
a fighter aircraft depends on the so-called “flight condition,” consisting of the altitude and velocity
(measured as its Mach number). While the dependence of the dynamical model on the flight condition
is nonlinear and somewhat complex, usually sufficient modelling studies are carried out, both before
the aircraft is flown and afterwards, that the dynamical model can be assumed to be “known.” In
turn this permits the control system designers to formulate an optimal (or some other form of) control
problem, which can be solved.

ˆ Controlling a chemical reactor would be an example from the lower-right quadrant. As a traditional
control system, it can be assumed that the dynamical model of such a reactor does not change as a
consequence of the control strategy adopted. However, due to the complexity of a reactor, it is difficult
to obtain a very accurate model, in contrast with a fighter aircraft for example. In such a case, one can
adopt one of two approaches. The first, which is a traditional approach in control system theory, is to
use a nominal model of the system and to treat the deviations from the nominal model as uncertainties
in the model. The second, which would move the problem from the upper left to the upper right
quadrant, is to attempt to “learn” the unknown dynamical model by probing its response to various
inputs. This approach is suggested in [33, Example 3.1]. A similar statement can be made about
robots, where the geometry determines the form of the dynamical equations describing it, but not the
parameters in the equations; see for example SHV20. In this case too, it is possible to “learn” the
dynamics through experimentation. In practice, such an approach is far slower than the traditional
control systems approach of using a nominal model and designing a “robust” controller. However,
“learning control” is a popular area in the world of machine learning.

1
2 CHAPTER 1. INTRODUCTION

Action
Interacts with
Environment

Good model Poor model


Alteration Alteration

Good model Poor model


No alteration No alteration

Model Quality

Figure 1.1: The four quadrants of decision-making under uncertainty

ˆ A classic example of a problem belonging to the upper-left corner is a Markov Decision Process (MDP).
In this class of problems, at each time instant the actor (or agent) decides on the action to be taken
at that time. In turn the action affects the probabilities of the future evolution of the system. As this
class of problems forms the starting point for RL (upper-right quadrant), the first part of these notes
are addressed to a study of MDPs. Board games without an element of randomness would also belong
to the upper-left quadrant, at least in principle. Games such as tic-tac-toe belong here, because the
rules of the game are clear, and the number of possible games is manageable. In principle, games such
as chess which are “deterministic” (i.e., there is no throwing of dice as in Backgammon for example)
would also belong here. Chess is a two-person game in which, for each board position, it is possible to
assign the likelihood of the three possible outcomes: White wins, Black wins, or it is a draw. However,
due to the enormous number of possibilities, it is often not possible to determine these likelihoods
precisely. It is pointed out explicitly in [30] that, merely because we cannot explicitly compute this
likelihood function, that does not mean that it does not exist! However, as a practical matter, it is not
a bad idea to treat this likelihood function as being unknown, and to infer it on the basis of experiment
/ experience. Thus, as with chemical reactors, it is not uncommon to move chess-playing from the
lower-right corner to the upper-right corner.
ˆ The upper-right quadrant is the focus of these notes. Any problems where the actions taken by
the learner alter the environment, in ways that are not known to the learner, are referred to as
“reinforcement learning” (RL). Despite the lack of knowledge about the consequences, the learner
has no option but to keep trying out various actions in order to “explore” the environment in which
the unknown system is operating. As time goes on, some amount of knowledge is gained, and it is
therefore possible, at least in principle, to “exploit” the knowledge to improve decision making. The
trade-off between exploration and exploitation is a standard topic in RL. A canonical example is MDPs
where the underlying parameters are not known, and these occupy a major part of these notes. As
mentioned above, often complex problems from the lower-right quadrant (such as chemical reactors),
or the upper-left quadrant (such as Chess), are also treated as RL problems.
Now we will give a general description of the problem. In a RL problem, there are “states” and then
there are “actions.” At each time t, the agent, also sometimes referred to as the learner, measures the state
Xt at time t, which belongs to a state space X . Based on this measurement, the agent chooses a control Ut
from a menu of “actions,” which we denote by U. While it is possible for the state space X and the range
of possible actions U to be infinite, in these notes we simplify our lives by restricting U to be a finite set.
1.1. INTRODUCTION TO REINFORCEMENT LEARNING 3

In the same way, it is possible to treat “time” as a continuum, but again we simplify life by treating t as a
discrete variable assuming values in the set of natural numbers N = {0, 1, · · · }. Thus RL requires the agent
to take a set of sequential decisions from a finite menu, at discrete instants of time. When the agent chooses
an action Ut ∈ U, two things happen.

1. The agent receives a “reward” Rt . The reward could either be deterministic, or random, and both
possibilities are permitted in these notes. The reward could be a negative number, suggesting a
penalty instead of a reward, but the phrase “reward” is standard phraseology. In case the reward is
random, it is assumed that the reward lies in a bounded interval in R which is known a priori, in
which case the reward can be translated to belong to an interval [0, M ]. The same transformation can
of course be applied if the reward is deterministic. Note that some authors speak of a “cost” which
is to be minimized, rather than a reward which is to be maximized. The modifications required to
tackle this situation are obvious and we will not comment upon this further. The reward depends not
just on the action chosen Ut , but also the state Xt of the environment at time t. There can be two
sources of uncertainty in the reward. In a Markov Decision Problem (MDP), the reward could be a
random function of Xt and Ut , but with a known probability distribution. In an RL problem, even
the probability distribution of the reward is not necessarily known. However, for technical reasons, it
is assumed that the upper bound M on the reward is known.

2. The action Ut affects the dynamics of the system. A consequence is that the same action taken at a
different time need not lead to the same reward, because in the meantime the “state” of the environment
may have changed.

Over the years, the RL research community has given some “structure” to the above rather vague and
general description. Specifically:

1. The environment is taken as a Markov process (see Section 8.2) in which the state transition matrix
depends on the action taken. So there are |U| state transition matrices, one for each possible action.

2. If Xt denotes the state of the Markov process at time t and Ut is the action taken at time t, then the
reward R is taken to be a function R(Xt , Ut ). This formalism explains why the same action Ut ∈ U
taken at a different time may lead to a different reward, because the state Xt may have changed. It is
also possible for R to be a “random” function of Xt and Ut , so that Xt , Ut only specify the probability
distribution of R(Xt , Ut ). In such a case, even if the same state-action pair (Xt , Ut ) were to occur at
a different time, the resulting reward need not the same.

3. Yet another variation is that the reward R(Xt , Ut ) (whether random or deterministic) is paid at the
next time instant t + 1. This is the case in some books, notably [35, 33]. In other words, if the Markov
process is in state Xt and the action Ut is applied, the reward is Rt+1 = R(Xt , Ut ). This allows those
authors to consider the situation where the “next state” Xt+1 and “next reward” Rt+1 can share a
joint probability distribution, which depends on Xt and Ut . This is the convention adopted in these
notes. Note that some other authors assume that the reward is immediate, so that Rt = R(Xt , Ut ).

4. There are two distinct types of Markov Decision Processes that are widely studied, namely: Discounted
reward processes and average reward processes. Each of them has rather a distinct behavior from the
other. In discounted reward processes, there is a “discount factor” γ ∈ (0, 1) that is applied to future
rewards. The objective is to maximize the sum of the future rewards, where the reward at time t is
discounted by the factor γ t . Because this future discounted reward is itself random, we maximize the
expected value of this random variable. In the average reward process, the objective is to minimize the
expected value of the average of future rewards over time. Because there is no discounting of future
rewards, a reward paid at any time contributes just as much to the average as a reward paid at any
other time.
4 CHAPTER 1. INTRODUCTION

5. In the simplest version of the problem, the |U| state transition matrices, one for each possible action,
are assumed to be known, as is the reward function. In the case where the reward is a random function
of Xt and Ut , it is assumed that the probability distribution of R(Xt , Ut ) is known. It is also assumed
that the state Xt of the Markov process can be observed by the agent, and can be used to decide the
action Ut . A key concept in RL is that of a “policy” π which is a map from the state space X of
the Markov process to the set of actions U. The objective here is to choose the optimal policy, which
maximizes the expected value of the discounted future reward over all possible policies. This version
of the problem is usually known as a Markov Decision Process (MDP).1 It is usually viewed as
a precursor to RL. In “proper” RL, neither the Markovian dynamics nor the reward are assumed to
be known, and must be learned on the fly so to speak. However, knowing the solution approaches to
the MDP is very useful in solving RL problems. It should be pointed out that some authors also use
the phrase RL to the problem of finding the optimal policy in an MDP where the parameters of the
problem are completely known.
A dominant theme in RL is the trade-off between “exploration” and “exploitation.” By definition, the
agent in an RL problem is operating in an unknown environment. However, after sometime a reasonably good
model of the environment is available, and a set of actions that is reasonably “rewarding” is also identified.
Should the agent then persist with this set of actions, or occasionally attempt something new, just on the
off-chance that there is a better set of actions available? Let us take a concrete example. A successful chess
player would have evolved, over the years, a set of strategies that work well for him/her. Should the player
persist with the time-proven strategies (exploitation) until someone starts beating him/her, or occasionally
try something completely different just to see what happens (exploration)? The answer is not clear, and is
likely to vary from one domain to another. To illustrate the domain dependence of the solution, suppose
a person moves to a new town and wishes to find the best coffee shop. Then it is probably sufficient to
try each nearby coffee shop just once (or just a few times), because most coffee shops have standardized
protocols for preparing coffee, so that the quality is not likely to vary very much from one visit to the next.
Therefore a person can stick to the coffee shop that is most appealing after a few visits, and there is very
little incentive for further “exploration,” only “exploitation.” In contrast, it can be assumed that the course
of a chess match between two players at the highest level almost invariably leads to a previously unexplored
set of positions. Thus persisting with a stock strategy would invariably lead to suboptimal results, and there
must be greater emphasis on exploration than in the coffee shop example.
There are a couple of methods for quantifying the trade-off between exploration and exploitation. We
begin with the observation that almost any “sensible” learning algorithm would converge to a nearly optimal
policy within a finite number of time steps. Here are two ways to measure how good the algorithm is:
1. Given an accuracy , one can measure how many time steps are required for the policy to be within
 of the optimal policy.2 The faster a policy becomes -suboptimal, the better it is. Implicit in this
characterization is the assumption that a policy is not penalized for how badly it performs before it
achieves -suboptimality – just the time it takes to achieve -suboptimality status.
2. The other measure is to see what the reward would have been, had the learner somehow magically
implemented the optimal policy right at the outset, and compare it against the actually achieved
performance. This quantity is called the “regret” and is defined precisely later on. The difference
between minimizing the regret and minimizing the time for achieving -optimality is that in the latter,
the performance of the algorithm before achieving -optimality is not penalized, whereas it is counted
as a part of the regret.
Clearly, the two criteria are not the same. A learning strategy that converges relatively quicky, but performs
poorly along the way would be rated highly under the first criterion, and poorly under the second criterion.
1 There is a variant where the state X cannot be observed directly; instead one observes an output Y which is either a
t t
deterministic or a random function of Xt . This problem is known as a Partial Observed Markov Decision Process (POMDP).
This problem is not discussed at all in these notes.
2 This idea is made precise in subsequent chapters.
1.2. SOME EXAMPLES OF REINFORCEMENT LEARNING 5

Such issues are examined in Chapter 7.


Within the broad area of Machine Learning (ML) or Artificial Intelligence (AI), RL stands quite distinctly
apart from other popular areas such as supervised learning (which is what many people mean when they
talk about ML), and unsupervised learning. In supervised learning, the main goal is generalization. Thus
the learner is shown an amount of “training data” which consists of labelled data. After the training phase,
the learner is then shown “testing data” for which the correct labels are known to the evaluator, and the
learner is asked to predict these correct labels. The extent to which the learner is able to match the correct
labels serves as a measure of the quality of the learning algorithm. A well-known recent example is the
ImageNet database [19], created as a part of the LSVRC (Large Scale Visual Recognition Challenge). It
consists of roughly 14 million images that are hand-curated. The full set, or some subset thereof, is presented
to some supervised learning algorithm, whose parameters are then adjusted to achieve good performance on
the training inputs. At the other end of the spectrum from supervised learning lies “unsupervised learning,”
a popular part of which is “clustering.” In this application, unlabelled inputs are collected into several
groups or clusters, whereby the elements of each cluster are closer to the centroid of their own cluster than
they are to that of any other cluster. Then a new input is assigned to the cluster whose centroid is closest
to the test input. Reinforcement learning lies in-between supervised and unsupervised learning. Expand.
There are several excellent texts on these two topics and we will not discuss these two branches of ML
hereafter. Book-length treatments of RL can be found, among others, in the following: [35, 33]. A book
length treatment of MDPs can be found in [28]. Add some references of Bertsekas-Tsitsiklis here.

1.2 Some Examples of Reinforcement Learning


In this section we briefly discuss a few motivating problems that can serve as illustrations of reinforcement
learning. We will return to a couple of these problems again in future chapters.
There are several examples of reinforcement learning available in the literature. The books [28, 33]
contain several examples, while the book [10] is primary devoted to examples of RL in a variety of areas,
including healthcare, transportation, finance etc. Perhaps the most “famous” application of RL is a general-
purpose algorithm that can be taught to play a variety of games, including Chess, Shogi and Go [12, 31].
Robot control, including path-planning in the presence of (possibly unknown) obstacles is another popular
application. Some RL texts and papers study the problem of balancing a stick on a moving cart, which
is known in control theory as the “inverse pendulum” problem. This might not be a good application of
RL, because the system can be modelled very precisely, which in turn leads to very efficient control laws.
However, by viewing this well-studied problem in control theory as a problem in RL, the research community
has developed several new and interesting learning paradigms. Another application, which is of moderate
size, is that of deciding an optimal strategy for the game of Blackjack, sometimes also called Twenty One.
We will study this example, either in its full form or in a simplified form, in detail at appropriate places in
these notes.

1.2.1 Multi-Arm Bandit Problems


This problem is a generalization of the “slot machine” in gambling casinos around the world, whereby the
player pulls a lever and receives a random payoff. In order to pull the lever, the player has to insert some
money, and the expected value of the payoff is less than the amount to be inserted; that is how the casino
makes money. However, in our model, we ignore the fact that a player has to pay to play, and focus strictly
on the payout part of it.
Suppose a player is facing m slot machines, or “bandits,” each of which has random payout. Specifically,
let Xi denote the random payout of the i-th bandit. Then Xi has an unknown expected (mean value) payout,
as well as an unknown probability distribution around this mean value. To avoid unnecessary technicalities,
it is assumed that all returns are nonnegative, and that there is a fixed known upper bound M on the payout
of each machine, which can be taken as 1 without any loss of generality. Therefore the return of each arm
6 CHAPTER 1. INTRODUCTION

S 1 2 3 4 5 6 7 8 9 W L

Figure 1.2: Toy Snakes and Ladders Game

has a probability distribution φi is supported on the set [0, 1]. Define


Z 1
µi = xφi (x)dx
0

to be the mean or expected value of Xi . Of course, the player does not know either µi or φi (·). But the player
is able to “pull the arm” of each bandit and see what happens. This generates (we assume) statistically
independent samples xi1 , · · · , xim of the random variable Xi . Based on the outcome of these experiments,
the player is able to make some estimate of µi for each bandit i. These estimates can be used to determine
future strategies.
Note that if the quantities µ1 , · · · , µm are known, then the problem is simple: The player should always
play the machine that has the highest expected payout. But the challenge is to determine which machine
this is, on the basis of experimentation. As stated above, there are many reasonable algorithms that will
asymptotically (as the number of trials increases towards infinity) determine the arm(s) with the best re-
turn(s). Therefore one way to assess the performance of an algorithm is its “regret,” that is, the return
achieved over the course of learning, subtracted from the optimal return of always choosing the arm with
the highest return. Interestingly, there are theorems that give quite tight upper and lower bounds on the
achievable regret, and these are discussed in Section 7.1.

1.2.2 Snakes and Ladders


We all know the ancient snakes and ladders game, where the objective is for a player to pass from the start
to the end while avoiding the snakes and taking advantage of the ladders. We will modify the game slightly
by adding the possibility of losing if the player overshoots the last square. A toy version of the game is
shown below (it is also studied in Section 8.2.
The rules of the game are as follows:
ˆ Initial state is S.
ˆ A four-sided, fair die is thrown at each stage.
ˆ Player advances as many squares as the outcome of the throw, followed by the impact of the snake or
ladder, if any.
ˆ Player must land exactly on W to win.
ˆ If implementing a move causes the crossing of L, then the player loses. Landing exactly on L also loses.
ˆ Hitting the square W leads to a reward of 5 and hitting the square L leads to a reward of −5. The
reward in every other square is 0.
1.2. SOME EXAMPLES OF REINFORCEMENT LEARNING 7

(P, H) R
P <H -2
P =H 1
P > H, P 6= W 2
(P, H) = (W, ∗) 5
(P, H) = (L, ∗) -5

Table 1.1: Reward Table for Simplified Blackjack Game

At each stage of the game, the player has two choices: to roll the die and take a chance on the outcome, or
not to roll it. We can ask: What is the best strategy for a player as a function of the square currently being
occupied? Clearly, it depends on whether the expected return from playing exceeds the expected return
from not playing.

1.2.3 Blackjack
Blackjack is a popular game in gambling casinos around the world. The player plays against the “house.”3
The player and the house draw cards in alternation. The objective is to draw cards such that the total of
the cards is as close to 21 as possible without exceeding it. That is why sometimes Blackjack is also called
“Twenty-One.” The formulation of Blackjack as a problem in RL is discussed in [33, Example 5.1]. At each
time instant, the player has ony two possible actions: To ask for one more card, or not. These are known
as “hit” and “stick” respectively. So the set of possible actions U has cardinality two. If the player draws a
card, the outcome is obviously random. Either way, the house also draws a card whose outcome is random.
It is shown in [33, Example 5.1] that the process can be modelled by a Markov process with 200 states, so
that |X | = 200. However, tracing out all possible future evolutions of the game, starting from the current
state, is nearly impossible, and simulations are the only way to analyze the problem.
We now present a simplified version of Blackjack. Obviously, drawing a card leads to the player’s total
increasing by anywhere from 1 to 11.4 So if the player’s current total is 10 or less, the player cannot possibly
lose by drawing, and may get closer to winning. So the optimal strategy from such a position is not in doubt.
With that in mind, we replace the drawing of a card by the rolling of a fair four-sided die, with all four
outcomes being equally probable. It does not matter what the “target” total is, because if the target total
is T , then so long as the player’s total is T − 4 or less, the player should roll the die. With this in mind,
we can think of the player’s states as {0, 1, 2, 3, W, L}, with W and L denoting Win and Lose respectivey.
If the player’s current total plus the outcome of the die exactly equals 4, the player wins, and if the total
exceeds 4, the player loses. But there is an added complication, which is the total of the “House.” Let us
assume that the House policy is to “stick” whenever it gets within 3 of the designated total. Hence it can
be assumed that the House total is in {1, 2, 3}. Now the object of the game is not merely to get as close to
W without going over, but also to beat the House total. Hence the reward for this game can be specified
as shown in Table 1.1. With this reward structure, at each position, the player has the option of rolling the
die, or not. It turns out that this game is more complex than just the player playing snakes and ladders.
We will analyze this game also in later chapters.

1.2.4 Backgammon
Backgammon is a board game played by two players on a board with (essentially) 24 positions, with each
player throwing two six-sided dice at each turn. Figure 1.3 shows a typical board position. The game
3 Actually, it is possible to have more than one player plus the “house.” However, to simplify the problem, we study only

the case of one player against the “house.”


4 An Ace can be counted as either 1 or 11 as per the player’s choice.
8 CHAPTER 1. INTRODUCTION

Figure 1.3: A typical board position in backgammon

combines chance (random outcome of throwing the dice) and strategy (what a player does based on the
outcome of the dice).
Unlike in Blackjack, the range of possible actions available to a player at each turn is quite large. This
game is well-suited to a technique called “temporal difference” or TD-learning, which is studied in Section
4.2. Tesauro has published several articles on how to program a computer to play backgammon, including
[36, 37, 38]. See [33, Section 16.1] for a detailed description of the rules of backgammon and the TD
implementation of Tesauro.

1.2.5 AlphaGo and AlphaZero


It would not be a exaggeration to say that a great deal of the public attention to artificial intelligence arises
from the success of two programs, namely AlphaGo and AlphaZero. In 2016, a UK-based company called
Deep Mind (since acquired by Google) created a program called AlphaGo to play Go, a board game played
on a grid of 19 × 19 places. In a five-game match held in Seoul, Korea between the 9th and 15th of March,
AlphaGo played against Lee Sedol, who was an eighteen-time world champion, though he was not world
champion at that time. AlphaGo won four out of the five games. It was the first instance of a computer
defeating a ranking Go player. A year later, in 2017, AlphaGo defeated the top-ranked player Ke Jie. In a
series of three matches played between 23rd and 27th May, AlphaGo won all three matches.
Twenty years earlier IBM had developed the Deep Blue platform to play chess. Obviously, over such a
long period of time, there would be massive improvements in computing hardware. Indeed, AlphaGo ran
on a collection of Tensor Processing Units (TPUs), which are specially designed to carry out the type of
computations required by AlphaGo (as opposed to general-purpose CPUs, or Central Processing Units).
Even at that time, Deep Mind had in its possession a more advanced program called AlphaZero, but did
not deploy it against Ke Jie. AlphaZero could be programmed to play chess, Go and shogi (Japanese chess).
AlphaZero defeated AlphaGo while playing Go, defeated Stockfish (a popular chess-playing program), and
Elmo (a popular program to play shogi). However, in the eyes of many, the real interest in AlphaZero arose
from the manner in which is trained itself. Recall that the Deep Blue platform developed by IBM relied
on human inputs, and a search technique, in order to analyze board positions and determine its next move.
In contrast, AlphaZero used an entirely different approach, whereby it improved itself through “self-play”,
through a mathematical method known as Monte Carlo tree search (MCTS) algorithm. Thus the same
1.3. ABOUT THESE NOTES 9

Figure 1.4: Deep Mind’s AlphaGo program playing Lee Sedol. Source [50].

program is able to “teach itself” to play different games. A popular description how AlphaZero goes about
its self-appointed task can be found in [12]. Those interested in the mathematical details can find them in
[31].
One of the intriguing philosophical aspects of AlphaZero is the fact that, as its name implies, AlphaZero
starts from zero, that is, without any prior knowledge. Its superior performance compared to other programs
that make use of prior knowledge has been interpreted by some AI researchers to claim that “prior knowledge”
is not necessary to achieve top performance. To understand why this is interesting, let us consider the same
question, but changing “chess” to “cooking.” Suppose you wish to become a master chef. Should you first
learn under someone who is already a master chef, and experiment on your own only after you have achieved
some level of proficiency? Or is it better for you to undertake trial and error right from Day One? Most of
us would instinctively answer that learning from a master (i.e., tapping domain knowledge) would be better.
One of the intriguing aspects of the success of AlphaZero is that, when it comes to a computer learning to
play chess, domain knowledge apparently does not confer any advantage. However, at the moment the role
of prior domain knowledge in AI is still a topic for further research. It is not clear whether the success of
AlphaZero is a one-off phenomenon, or a manifestation of a more universally applicable principle.

1.3 About These Notes


This section is to be rewritten in its entirety. In particular, the role of stochastic approximation in RL is to
be highlighted. In the first part of these notes, the emphasis is on solution techniques for the conventional
Markov Decision Processes (MDPs) with known dynamics, and then MDPs with unknown dynamics. The
latter problems are the domain of Reinforcement Learning (RL). The techniques presented in this part of
the notes can either lead to “exact” solutions to the MDP under study, or to “approximate” solutions.
These techniques are useful when the number of possible policies (the number of possible maps from the
state space to the action space) is not overly large. When the size of the set of policies is too large to be
handled by a computer, the usual approach is to reduce the dimensionality of the problem. Specifically,
instead of considering all possible policies, attention is restricted to a subset. Suppose that the state space
X and the action space U are both finite. Then the number of possible policies is the set of all possible
10 CHAPTER 1. INTRODUCTION

maps from X into U, which has cardinality |U||X | . The value function V : X → R (see Chapter 2) can be
viewed as a vector of dimension |X |. However, the action-value function Q : X × U → R can be viewed
as a vector of dimension d := |X | × |U |, which can be a large number even if |X | and |U| are individually
rather small. In such a case, instead of examining all possible vectors of dimension d, we can choose a set of
linearly independent functions ψi (·, ·) : X × U → R for i = 1, · · · , m (basically, just a set of m  d linearly
independent vectors, and then express Q as a linear combination of these m vectors. By resticting Q to be
a linear combination of these vectors, we lose generality but gain tractability of the problem. At some point
we may decide that limiting Q to be a linear combination of some basis functions is too restrictive, and
opt for a nonlinear function, that is, a multi-layer neural network. To over-simplify grossly, that is one way
to think of deep reinforcement learning. The final theme studied in these notes is that of PAC (Probably
Approximately Correct) solutions to Markov Decision Problems. The idea here is that it is not necessary to
strive for absolutely the best possible solution – it is sufficient if the solution is nearly optimal. Moreover,
if these nearly optimal solutions are to be found using probabilistic methods, then it is permissible if these
probabilistic methods work with high probability. To put it in colloquial terms, a PAC algorithm is one that
gives more or less the right answer most of the time. PAC learning is a well-established subject, and [45]
offers a fairly complete (though advanced and rather abstract) treatment of the topic. For reinforcement
learning, it is not clear whether so much generality and abstraction are required. Perhaps there are ways
to present the theory in a simplified setting, and at the same time, explore questions that are relevant to
reinforcement learning. These questions are still being studied, and an attempt will be made to summarize
the current research in PAC-MDP in the last part of the notes.
Chapter 2

Markov Decision Processes

A widely used mathematical formalism for reinforcement learning problems is Markov Decision Processes
(MDPs) where the dynamics of the Markov process are not known, and must somehow be “inferred” on the
fly. Before tackling that problem, we must first understand MDPs when the dynamics are known. That is
the aim of the present chapter. In the interests of simplicity, the discussion is limited to the situation where
the state and action spaces underlying the MDP are finite sets. MDPs where the underlying state space
and/or action space is countable, or an arbitrary measurable space, are also of interest in some applications.
However, we do not study the more general situations in these notes. The area of MDP is quite well-studied,
and there are several excellent books on the subject. The reader is directed to [28] for a comprehensive
treatment of the subject, which also studies the case of infinite state and action spaces. The book [10]
contains several practical examples of MDPs. The theory of MDPs is also studied in [33] and [35].

2.1 Markov Reward Processes


Recall the introduction to Markov processes in Section 8.2. Further facts about Markov processes can be
found in [46].
Suppose X is a finite set of cardinality n, written as {x1 , . . . , xn }. If {Xt }t≥0 is a stationary Markov
process assuming values in X , then the corresponding state transition matrix A is defined by

aij = Pr{Xt+1 = xj |Xt = xi }. (2.1)

Thus the i-th row of A is the conditional probability vector of Xt+1 when Xt = xi . Clearly the row sums of
the matrix A are all equal to one. Therefore the induced norm kAk∞→∞ also equals one.
Up to now there is nothing new beyond the contents of Section 8.2. Now suppose that there is a “reward”
function R : X → R associated with each state. There is no consensus within the community about whether
the reward corresponding to the state Xt is paid at time t, or time t + 1. We choose to follow [28, 33] and
assume that the reward is paid at time t + 1.1 This allows us to talk about the joint probability distribution
Pr{(Xt+1 , Rt+1 )|Xt }, where both the next state Xt+1 and the reward Rt+1 are random functions of the
current state Xt . In particular, if Rt+1 is a deterministic function of Xt , then we have that

aij if rl = R(xi ),
Pr{(Xt+1 , Rt+1 ) = (xj , rl )|Xt = xi } =
0 otherwise,

where R : X → R is the reward function.


Two kinds of Markov reward processes are widely studied, namely: Discounted reward processes, and
average reward processes. Each of these is studied in a separate subsection.
1 Note that in [35], the reward is assumed to be “immediate,” that is, paid at time t.

11
12 CHAPTER 2. MARKOV DECISION PROCESSES

2.1.1 Discounted Reward Processes


To study discounted Markov Reward Processes, we choose a “discount factor” γ ∈ (0, 1). Suppose xi ∈ X is
the “state of interest.” Then the expected discounted future reward V (xi ) is defined as
"∞ #
X
V (xi ) = E γ t Rt+1 |X0 = xi . (2.2)
t=0

We often just use “discounted reward” instead of the longer phrase. Note that, because the set X is finite,
the reward function Rt+1 is bounded if it is a deterministic function of Xt . If Rt+1 is a random variable
dependent on Xt , then it customary to assume that it is bounded. With these assumptions, because γ < 1,
the above summation converges and is well-defined. The quantity V (xi ) is referred to as the value function
associated with xi , and the vector
v = [ V (x1 ) · · · V (xn ) ]> , (2.3)
is referred to as the value vector. Note that, throughout these notes, we view the value as both a function
V : X → R as well as a vector v ∈ Rn . The relationship between the two is given by (2.3). We shall use
whichever interpretation is more convenient in a given context.
This raises the question as to how the value function and/or value vector is to be determined.
Define the vector r ∈ Rn ,
r := [ r1 · · · rn ]> , (2.4)
where, if Rt+1 is a random function of Xt , then
ri := E[Rt+1 |Xt = xi ]. (2.5)
If Rt+1 is a deterministic function Rd (Xt+1 ), then
n
X
ri = aij Rd (xj ).
j=1

Of course, if Rt+1 is a deterministic function R(Xt ), then ri is just R(xi ).


Theorem 2.1. The vector v satisfies the recursive relationship
v = r + γAv, (2.6)
or, in expanded form,
n
X
V (xi ) = ri + γ aij V (xj ). (2.7)
j=1

Proof. Let xi ∈ X be arbitrary. Then by definition we have


"∞ # "∞ #
X X
t t
V (xi ) = E γ Rt+1 |X0 = xi = ri + E γ Rt+1 |X0 = xi . (2.8)
t=0 t=1

However, if X0 = xi , then X1 = xj with probability aij . Therefore we can write


"∞ # n
"∞ #
X X X
t t
E γ Rt+1 |X0 = xi = aij E γ Rt+1 |X1 = xj
t=1 j=1 t=1

n
" #
X X
t
= γ aij E γ Rt+1 |X0 = xj
j=1 t=0
Xn
= γ aij V (xj ). (2.9)
j=1
2.1. MARKOV REWARD PROCESSES 13

S 1 2 3 4 5 6 7 8 9 W L

Example 2.1.

In the second step we use fact that the Markov process is stationary. Substituting from (2.9) into (2.8) gives
the recursive relationship (2.14).
Until now it has been assumed that the discount factor γ is strictly less than one. However, if the Markov
process has one or more absorbing states, then (8.40) gives an explicit formula for the average time before
a sample path terminates in an absorbing state. Moreover, this time is always finite. Therefore, in Markov
processes with absorbing states, it is possible to use an undiscounted sum, with the understanding that the
summation ends as soon as Xt is an absorbing state. Equivalently, it is possible to assign a reward of zero
to every absorbing state, so that the infinite sum of rewards contains only a finite number of nonzero terms.
It is left to the reader to state and prove the analog of Theorem 2.1 for this case.
We analyze the toy snakes and ladders game of Example 8.3. As shown therein, the state transition
matrix of this game is given by

S 1 4 5 6 7 8 W L
S 0 0.25 0.25 0.25 0 0.25 0 0 0
1 0 0 0.25 0.50 0 0.25 0 0 0
4 0 0 0 0.25 0.25 0.25 0.25 0 0
5 0 0.25 0 0 0.25 0.25 0.25 0 0
6 0 0.25 0 0 0 0.25 0.25 0.25 0
7 0 0.25 0 0 0 0 0.25 0.25 0.25
8 0 0.25 0 0 0 0 0.25 0.25 0.25
W 0 0 0 0 0 0 0 1 0
L 0 0 0 0 0 0 0 0 1

To define a reward function for this problem, we will set Rt+1 = f (Xt+1 ), where f is defined as follows:
f (W ) = 5, f (L) = −2, f (x) = 0 for all other states. Thus there is no immediate reward. However, there
is an expected reward depending on the state at the next time instant. For example, if X0 = 6, then the
expected value of R1 is 5/4, whereas if X0 = 7 or X0 = 8, then the expected value of R1 is 3/4.
Now let us see how the implicit equation (2.6) can be solved to determine the value vector v. Since the
induced matrix norm kAk∞→∞ = 1 and γ < 1, it follows that the matrix I − γA is nonsingular. Therefore,
for every fixed assignment of rewards to states, there is a unique v that satisfies (2.6). In principle it is
possible to deduce from (2.6) that
v = (I − γA)−1 r. (2.10)
The difficulty wth this formula however is that in most actual applications of Markov Decision Problems,
the integer n denoting the size of the state space X is quite large. Moreover, inverting a matrix has cubic
complexity in the size of the matrix. Therefore it may not be practicable to invert the matrix I − γA. So
we are forced to look for alternate approaches. A feasible approach is provided by the Contraction Mapping
Theorem (CMT), namely Theorem 8.13. With the contraction mapping theorem in hand, we can apply it
to the problem of computing the value of a discounted Markov reward process.
14 CHAPTER 2. MARKOV DECISION PROCESSES

Theorem 2.2. The map y 7→ T y := r + γAy is monotone and is a contraction with respect to the `∞ -norm,
with contraction constant γ.

Proof. The first statement is that if y1 ≤ y2 componentwise (and note that the vectors y1 , y2 need not
consist of only positive components), then T y1 ≤ T y2 . This is obvious from the fact that the matrix A has
only nonnegative components, so that Ay1 ≤ Ay2 . For the second statement, note that, because the matrix
A is row-stochastic, the induced norm of A with respect to k · k∞ is equal to one. Therefore

kT y1 − T y2 k∞ = kγA(y1 − y2 )k∞ ≤ γky1 − y2 k∞ .

This completes the proof.

Therefore one can solve (2.6) by repeated application of the contraction map T . In other words, we can
choose some vector y0 arbitrarily, and then define

yi+1 = r + γAyi .

Then the contraction mapping theorem tells us that yi converges to the value vector v. Moreover, from
(8.49) one can estimate how far the current iteration is from the solution v. Note that the contraction
constant ρ in the statement of the theorem can be taken as the discount factor γ. Define the constant

c := kr + γAy0 − y0 k∞ ,

which measures how far away the initial guess y0 is from satisfying (2.6). Then we have the estimate

γi
kyi − vk∞ ≤ c. (2.11)
1−γ

In this approach to finding the value function, each iteration has quadratic complexity in n, the size of the
state space. Moreover, (2.11) can be used to decide how many iterations should be run to get an acceptable
estimate for v. This approach to determining v (albeit approximately) is known as “value iteration.” Thus
if we use I iterations, then the complexity of value iteration is O(In2 ) as opposed to O(n3 ) for using (2.10).
Hence the value iteration approach is preferable if I  n. Note that the faster future rewards are discounted
(i.e., the smaller γ is), the faster the iterations will converge. Moreover, if the reward vector r is nonnegative,
and we choose y0 = r, then the value iterations result in a sequence of estimates {yi } that is componentwise
monotonically nondecreasing. (See Problem 2.1.)

2.1.2 Average Reward Markov Processes


Now we discuss average reward Markov processes. As before, there is a Markov process {Xt }t≥0 on a finite
space X of cardinality n, with the state transition matrix A ∈ [0, 1]n×n , and a reward function R : X → R.
If the reward is random, it is assumed that the reward is bounded almost surely (to avoid technicalities),
and the symbol ri is used to denote the expected value of the reward to be paid at time t + 1, when Xt = xi .
The objective is to compute the average reward
T
1X
c∗ := lim E[R(Xt )|X0 ∼ φ], (2.12)
T →∞ T
t=0

where φ ∈ S(X ) is a probability distribution on X . Compared with the definition (2.2) of the discounted
reward, two points of contrast would strike us at once.

1. In (2.2), the existence of the sum is not in question, because γ < 1. However, in the present instance,
there is no a priori reason to assume that the limit in (2.12) exists.
2.1. MARKOV REWARD PROCESSES 15

2. The value function V in (2.2) is associated with an initial state xi . It is implicit in the definition that
V (xi ) need not equal V (xj ) if xi 6= xj . In (2.12), the initial state is replaced by an initial distribution
φ, which is more general. However, we write c∗ , instead of c∗ (φ), suggesting that the limit, if it exists,
is independent of φ.
Theorem 2.3 presents a simple sufficient condition to address both of the above observations.
Theorem 2.3. Suppose A is irreducible, and let µ denote its unique stationary distribution. Then

c∗ = µr = E[R, µ], ∀φ ∈ S(X ), (2.13)

where r is the reward vector defined in (2.4).


Proof. If X0 ∼ φ, then Xt ∼ φAt . Therefore

E[R(Xt )|X0 ∼ φ] = φAt r.

Also, as stated in Theorem 8.8, we have


T
1X t
lim A = 1n µ.
T →∞ T
t=0

Therefore " #
T
1 X
c∗ = φ lim At r = φ1n µr = µr = E[R, µ], (2.14)
T →∞ T
t=0
because φ1n = 1. This is the desired result.
Next we introduce an important concept known variously as the bias or the transient reward. For a
discussion (albeit with “reward” replaced by “cost”), see [28, Section 8.2.3] or [1, Section 4.1].
Definition 2.1. Suppose A is primitive,2 and define c∗ as in (2.14) For each index i, the transient reward
Ji∗ ∈ R is defined as
X∞
Ji∗ = {E[R(Xt |X0 = xi ] − c∗ }. (2.15)
t=0

A priori it is not clear why the sum in (2.15) is well-defined, because there is no averaging over time. It
is now shown that the transient reward is indeed well-defined, and several explicit expressions are given for
it.
Theorem 2.4. Suppose A is primitive, and let µ denote its stationary distribution. Define M := 1n µ ∈
[0, 1]n×n , and J∗ ∈ Rn as [Ji∗ ]. Then the following statements are true:
1. The vector J∗ is well-defined.
2. An explicit expression for J∗ is given by

J∗ = (I − A + M )−1 (I − M )r = (I − A + M )−1 (r − c∗ 1n ). (2.16)

3. The vector J∗ satisfies the “Poisson equation”

J = r − c∗ 1n + AJ. (2.17)

Moreover, J∗ is the unique solution of (2.17) that satisfies

µJ = 0. (2.18)
2 This is equivalent to assuming that A is irreducible and aperiodic; see Theorem 8.7.
16 CHAPTER 2. MARKOV DECISION PROCESSES

Proof. Note that µ, 1n are row and column eigenvectors of A corresponding to the eivenvalue λ = 1, and
that all other eigenvalues of A have magnitude less than one. So if we define

A2 = A − 1n µ = A − M,

then the spectrum of A2 is the same as that of A, except that the eigenvalue at 1 is replaced by 0. In
particular, ρ(A2 ) < 1, and as a consequence

X
At2 = (I − A2 )−1 = (I − A + M )−1 . (2.19)
t=0

Next, suppose v ∈ Rn satisfies µv = 0. Then it is easy to verify that Av = A2 v, and moreover,


µA2 v = 0. Repeated application of this relationship shows that At v = At2 v, for all t ≥ 1. Therefore, for
every such v, we have that
X∞ ∞
X
t
Av= At2 v = (I − A + M )−1 v. (2.20)
t=0 t=0

Now in particular, choose


v = r − c∗ 1n = (I − M )r.
Then it follows from (2.14) that µv = 0. Hence (2.20) implies that

X
At (r − c∗ 1n ) = (I − A + M )−1 (I − M )r.
t=0

To prove Statements 1 and 2, let ei denote the i-th elementary basis vector. Then X0 = xi is equivalent
to X0 ∼ e> > t
i . Then Xt ∼ et A , and

X
Ji∗ = [e> t ∗
i A r − c ],
t=0


X ∞
X
J∗ = (At r − c∗ 1n ) = At (r − c∗ 1n )
t=0 t=0
= (I − A + M )−1 (I − M )r. (2.21)

Here we use the fact that c∗ 1n = c∗ At 1n for all t. This establishes Statements 1 and 2.
Now we come to Statement 3. From (2.15), we get

X
Ji∗ = {E[R(Xt |X0 = xi ] − c∗ }
t=0

X
= ri − c∗ + {E[R(Xt |X0 = xi ] − c∗ }
t=1
n
X ∞
X
= ri − c∗ + aij {E[R(Xt |X1 = xj ] − c∗ }
j=1 t=1
n
X
= ri − c∗ + aij Jj∗ ,
j=1

which is just (2.17) written out in component form. Hence J∗ is a particular solution of (2.17).
2.2. MARKOV DECISION PROCESSES 17

Finally, observe that if J is another solution of (2.17), then (J∗ − J) = A(J∗ − J), which implies that
J = J∗ + α1n for some constant α. Thus {J∗ + α1n : α ∈ R} is the set of all solutions to (2.17). Now, since
µ(r − c∗ 1n ) = 0, it follows that

X ∞
X
µJ∗ = µ At (r − c∗ 1n ) = µ(r − c∗ 1n ) = 0.
t=0 t=0

Moreover, if µ(J∗ +α1n ) = 0, then α = 0. Hence J∗ is the unique solution of (2.17) that satisfies µJ = 0.
It is possible to give an alternate proof of Statement 3, and we do so now. Suppose J∗ is given by (2.21).
Observe that
µ(I − A + M ) = µM = µ, or µ(I − A + M )−1 = µ.
Also, µ(I − M ) = 0. Therefore

µJ∗ = µ(I − A + M )−1 (I − M )r = µ(I − M )r = 0.

Next, (2.21) implies that


(I − A + M )J∗ = (I − M )r.
However, M J∗ = 1n µJ∗ = 0, and (I − M )r = r − c∗ 1n . Therefore

J∗ − AJ∗ = r − c∗ 1n .

This is just (2.17). The above derivation avoids infinite sums.


Problem 2.1. Suppose the reward vector r ≥ 0. Show that if we carry out value iteration with y0 = r,
then the sequence of iterations {yi } is componentwise nondecreasing, that is, yi+1 ≥ yi .

2.2 Markov Decision Processes


2.2.1 Markov Decision Processes: Problem Formulation
In a Markov process, the state Xt evolves on its own, according to a predetermined state transition matrix. In
contrast, in a MDP, there is also another variable called the “action” which affects the dynamics. Specifically,
in addition to the state space X , there is also a finite set of actions U. Each action uk ∈ U leads to a distinct
state transition matrix Auk = [auijk ]. So at time t, if the state is Xt , and an action Ut ∈ U is applied, then

Pr{Xt+1 = xj |Xt = xi , Ut = uk } = auijk . (2.22)

Obviously, for each fixed uk ∈ U, the corresponding state transition matrix Auk is row-stochastic. In addition,
there is also a “reward” function R : X × U → R. Note that in a Markov reward process, the reward depends
only on the current state, whereas in a Markov decision process, the reward depends on both the current
state as well as the action taken. As in Markov reward processes studied in Section 2.1, it is possible to
permit R to be a random function of Xt and Ut as opposed to a deterministic function. Moreover, to be
consistent with the earlier convention, it is assumed that the reward R(Xt , Ut ) is paid at the next time
instant t + 1 and not at time t.
The most important aspect of an MDP is the concept of a “policy,” which is just a systematic way
of choosing Ut given Xt . One can make a distinction between deterministic and probabilistic policies. A
deterministic policy is just a map from X to U. A probabilistic policy is a map from X from the set of
probability distributions on U, denoted by S(U). Let Πd , Πp denote respectively the set of deterministic,
and the set of probabilistic, policies. Clearly the number of deterministic policies is |U||X | , while Πp is
uncountable. Observe that a policy π ∈ Πd can be represented by a |X | × |U | matrix P , where each row of
P contains a single one and the rest are zeros. Thus in row i, the one is in column π(xi ) and the rest are
18 CHAPTER 2. MARKOV DECISION PROCESSES

zeros. If π ∈ Πp , then P need not be binary, but P must have only nonnegative elements, and the sum of
each row must equal one.
Now we make an important observation. Whether a policy π is deterministic or probabilistic, the resulting
stochastic process {Xt } is Markov with the state transition matrix determined as follows: If π ∈ Πd , then
π(xi )
Pr{Xt+1 = xj |Xt = xi , π} = aij . (2.23)

If π ∈ Πp and
π(xi ) = [ φi1 ··· φim ], (2.24)
where m = |U|, then
m
X
Pr{Xt+1 = xj |Xt = xi , π} = φik auijk . (2.25)
k=1

Equation (2.25) contains (2.24) as a special case, by setting φij = 1 if π(xi ) = uj , and zero otherwise. In a
similar manner, for every policy π, the reward function R : X × U → R can be converted into a reward map
Rπ : X → R, as follows: If π ∈ Πd , then

Rπ (xi ) = R(xi , π(xi )), (2.26)

whereas if π ∈ Πp , then
m
X
Rπ (xi ) = φik R(xi , uk ). (2.27)
k=1

Despite its simplicity, the above observation is very useful, because it states that a Markov process with
an action variable remains a Markov process under any choice of policy, deterministic or probabilistic. The
reason why this is so is that the policy has no memory. To illustrate, suppose π : X → U is a deterministic
policy, with (for example) π(xi ) = uk . Then the action variable U (t) equals uk , whenever the state X(t)
equals xi ; the time t at which this happens is immaterial. Similar remarks apply to the reward function.
For this reason, we define Aπ to be the state transition matrix that results from applying the policy π, and
Rπ to be the reward function that results from applying the policy π.
For a MDP, one can pose three questions:

1. Policy evaluation: For a given policy π, define Vπ (xi ) to be the “value” associated with the policy
π and initial state xi , that is, the expected discounted future reward with X0 = xi . How can Vπ (xi )
be computed for each xi ∈ X ?

2. Optimal Value Determination: For a specified initial state xi , define

V ∗ (xi ) := max Vπ (xi ), (2.28)


π∈Πp

to be the optimal value over all policies for that initial state. How can V ∗ (xi ) be computed? Note
that in (2.28), the optimum is taken over all probabilistic policies. It is shown in Theorem 2.9 in the
sequel that the optimum can actually be achieved by a deterministic policy.

3. Optimal Policy Determination: Define the optimal policy map X → Πd via

π ∗ (xi ) := arg max Vπ (xi ). (2.29)


π∈Πd

How can the optimal policy map π ∗ be determined? Note that we can restrict to π ∈ Πd because, as
stated above, the maximum over π ∈ Πp is not any larger. Moreover, it is again shown in Theorem 2.9
that there exists one common optimal policy for all initial states.
2.2. MARKOV DECISION PROCESSES 19

2.2.2 Markov Decision Processes: Solution


In this subsection we present answers to the three questions above.

Policy Evaluation:
Suppose a policy π ∈ Πd is specified. Then the corresponding state transition matrix and reward are given
by (2.23) and (2.26) respectively. Now suppose we define the vector vπ by

vπ = [ Vπ (x1 ) . . . Vπ (xn ) ], (2.30)

and the reward vector rπ by


rπ = [ Rπ (x1 ) . . . Rπ (xn ) ], (2.31)
where R(xi ) is defined by (2.26) or (2.27) as appropriate. Then it readily follows from Theorem 2.1 that vπ
satisfies an equation analogous to (2.6), namely

vπ = rπ + γAπ vπ . (2.32)

As before, it is inadvisable to compute vπ via vπ = (I − γAπ )−1 rπ . Instead, one should use value iteration
to solve (2.32).
For future use we introduce another function Q : X × U → R, known as the action-value function,
which is defined as follows:
"∞ #
X
t
Qπ (xi , uk ) := R(xi , uk ) + Eπ γ Rπ (Xt )|X0 = xi , U0 = uk . (2.33)
t=1

Apparently this function was first defined in [47]. Note that Qπ is defined only for deterministic policies.
In principle it is possible to define it for probabilistic policies, but this is not commonly done. In the above
definition, the expectation Eπ is with respect to the evolution of the state Xt under the policy π. When the
reward is a random function of Xt and Ut , then inside the summation we would need to take the expected
value of R(Xt , π(Xt )) for a deterministic policy.
The way in which a MDP is set up is that at time t, the Markov process reaches a state Xt , based on the
previous state Xt−1 and the state transition matrix Aπ corresponding to the policy π. Once Xt is known,
the policy π determines the action Ut = π(Xt ), and then the reward Rπ (Xt ) = R(Xt , π(Xt )) is generated at
time t+1. In particular, when defining the value function Vπ (xi ) corresponding to a policy π, we start off the
MDP in the initial state X0 = xi , and choose the action U0 = π(xi ). However, in defining the action-value
function Q, we do not feel compelled to set U0 = π(X0 ) = π(xi ), and can choose an arbitrary action uk ∈ U.
From t = 1 onwards however, the action Ut is chosen as Ut = π(Xt ). This seemingly small change leads to
some simpifications. Specifically, it will be seen in later chapters that it is often easier to approximate (or
to “learn”) the action-value function than it is to approximate the value function.
Just as we can interpret V : X → R as a |X |-dimensional vector, we can interpret Q : X × U → R as an
|X | · |U|-dimensional vector. Consequently the Q-vector has higher dimension than the value vector.

Theorem 2.5. The function Q satisfies the recursive relationship


n
X
Qπ (xi , uk ) = R(xi , uk ) + γ auijk Qπ (xj , π(xj )). (2.34)
j=1

Proof. Observe that at time t = 0, the state transition matrix is Auk . So, given that X0 = xi and U0 = uk ,
the next state X1 has the distribution

X1 ∼ [auijk , j = 1, · · · , n].
20 CHAPTER 2. MARKOV DECISION PROCESSES

Moreover, U1 = π(X1 ) because the policy π is implemented from time t = 1 onwards. Therefore
 !
Xn ∞
X
u t
Qπ (xi , uk ) = R(xi , uk ) + Eπ  aijk γR(xj , π(xj )) + γ Rπ (Xt )|X1 = xj , U1 = π(xj ) 
j=1 t=2
 !
n
X ∞
X
uk
= R(xi , uk ) + Eπ γ aij R(xj , π(xj )) + γ t Rπ (Xt )|X1 = xj , U1 = π(xj ) 
j=1 t=1
n
X
= R(xi , uk ) + γ auijk Q(xj , π(xj )).
j=1

This is the desired conclusion. Note that in the above summation, we have written R(Xt ) for reward to be
paid at time t + 1.

Theorem 2.6. The functions Vπ and Qπ are related via

Vπ (xi ) = Qπ (xi , π(xi )). (2.35)

Proof. If we choose uk = π(xi ) then (2.34) becomes


n
π(xj )
X
Qπ (xi , π(xi )) = Rπ (xi ) + γ aij Q(xj , π(xj )).
j=1

This is the same as (2.22) written out componentwise. We know that (2.22) has a unique solution. This
shows that (2.35) holds.

In view of (2.35), the recursive equation for Qπ can be rewritten as


n
X
Qπ (xi , uk ) = R(xi , uk ) + γ auijk Vπ (xj ). (2.36)
j=1

Optimal Value Determination:


For a policy π ∈ Πd or π ∈ Πp , define the associated map Tπ : Rn → Rn via

Tπ v = rπ + γAπ v. (2.37)

Then it follows from Theorem 2.2 that Tπ is monotone and is a contraction with respect to the `∞ -norm,
with contraction constant γ.
Now we introduce one of the key ideas in Markov Decision Processes. Define the Bellman iteration
map B : Rn → Rn via  
n
X
(Bv)i := max R(xi , uk ) + γ auijk vj  . (2.38)
uk ∈U
j=1

Theorem 2.7. The map B is monotone and a contraction with respect to the `∞ -norm.

Proof. The theorem has two claims: The first claim is that the map B is monotone, meaning that if v1 ≤ v2
componentwise, then B(v1 ) ≤ B(v2 ) componentwise. The second claim is that B is a contraction with
respect to the `∞ -norm. Note that, unlike the value iteration map Tπ defined in (2.37), the map B is not
affine.
2.2. MARKOV DECISION PROCESSES 21

Let us begin with the first claim. Suppose v1 ≤ v2 . Then


 
n
X
(B(v1 ))i = max R(xi , uk ) + γ auijk v1j 
uk ∈U
j=1
 
n
X
≤ max R(xi , uk ) + γ auijk v2j 
uk ∈U
j=1

= (B(v2 ))i .

Here we use the fact that auijk ≥ 0 for all i, j. This establishes that B is monotone, which is the first claim.
The proof of the second claim is a bit more elaborate. We begin by establishing that

max g(xi , uk ) − max h(xi , uk ) ≤ max |g(xi , uk ) − h(xi , uk )|, ∀xi ∈ X . (2.39)
uk ∈U uk ∈U uk ∈U

To prove (2.39), we begin with the obvious observation that, if α, β are real numbers, then

α − β ≤ |α − β| =⇒ α ≤ |α − β| + β.

Note that this inequality holds irrespective of the signs of α and β. Fix xi ∈ X , uk ∈ U and apply the above
inequality with α = g(xi , uk ), β = h(xi , uk ). This gives

g(xi , uk ) ≤ |g(xi , uk ) − h(xi , uk )| + h(xi , uk ).

Now take the maximum of both sides over uk ∈ U. This gives

max g(xi , uk ) ≤ max [|g(xi , uk ) − h(xi , uk )| + h(xi , uk )]


uk ∈U uk ∈U
≤ max |g(xi , uk ) − h(xi , uk )| + max h(xi , uk ).
uk ∈U uk ∈U

Rearranging gives
max g(xi , uk ) − max h(xi , uk ) ≤ max |g(xi , uk ) − h(xi , uk )|.
uk ∈U uk ∈U uk ∈U

By symmetry, we can interchange g and h, which gives

max h(xi , uk ) − max g(xi , uk ) ≤ max |g(xi , uk ) − h(xi , uk )|.


uk ∈U uk ∈U uk ∈U

Combining these two inequalities gives (2.39).


Now we make use of (2.39) to show that B is a contraction with respect to the `∞ -norm. Let v1 , v2 ∈ Rn
be arbitrary, and fix xi ∈ X . Then, by using the definition of B and (2.39), we get
   
X n n
X
|(B(v1 ))i − (B(v2 ))i | = max R(xi , uk ) + γ auijk v1j  − max R(xi , uk ) + γ auijk v2j 
uk ∈U uk ∈U
j=1 j=1

n
X n
X
≤ max R(xi , uk ) + γ auijk v1j − R(xi , uk ) − γ auijk v2j
uk ∈U
j=1 j=1

n
X n
X
= max γ auijk (v1j − v2j ) ≤ max γ auijk |v1j − v2j |
uk ∈U uk ∈U
j=1 j=1

≤ γkv1 − v2 k∞ . (2.40)
22 CHAPTER 2. MARKOV DECISION PROCESSES

Here we use the facts


n
X
|v1j − v2j | ≤ kv1 − v2 k∞ ∀j, auijk = 1, ∀i, ∀uk ∈ U
j=1

Because the inequality (2.40) holds for every index i, it follows that

kB(v1 ) − B(v2 )k∞ ≤ γkv1 − v2 k∞ .

This shows that the map B is a contraction with respect to the `∞ -norm, which is the second claim.
Theorem 2.8. Define v̄ ∈ Rn to be the unique fixed point of B, and define v∗ ∈ Rn to equal [V ∗ (xi ), xi ∈ X ],
where V ∗ (xi ) is defined in (2.28). Then v̄ = v∗ .
Proof. By definition, for every π ∈ Πd , we have that
n
π(xi )
X
[Tπ (v̄)]i = R(xi , π(xi )) + aij V̄j
j=1
 
n
X
≤ max R(xi , uk ) + γ auijk V̄j  = (B(v̄))i = V̄i , (2.41)
uk ∈U
j=1

because v̄ is a fixed point of the map B. If π ∈ Πp , say

π(xi ) = [ φi1 ··· φim ] ∈ Sm ,

then
 
l
X n
X
[Tπ (v)]i = φil R(xi , ul ) + auijl V̄j 
l=1 j=1
 
n
X
≤ max R(xi , uk ) + auijk V̄j 
uk ∈U
j=1

= (B(v̄))i = V̄i . (2.42)

Because (2.41) and (2.42) hold for every index i, it follows that

Tπ (v̄) ≤ v̄.

Next, because Tπ is monotone as per Theorem 2.2, it follows that

Tπ2 (v̄) = Tπ (Tπ (v̄)) ≤ Tπ (v̄) ≤ v̄.

The reasoning can be repeated to show that

Tπl (v̄) ≤ v̄, ∀l.

Now let l → ∞. Then the left side approaches the fixed point of the map Tπ , which is vπ . Thus we conclude
that, for all policies in Πd or Πp , we have that

vπ ≤ v̄. (2.43)

Therefore, for each xi ∈ X , we infer that

V ∗ (xi ) = max V (xi ) ≤ V̄i , ∀i, or v∗ ≤ v̄. (2.44)


π
2.2. MARKOV DECISION PROCESSES 23

To show that v̄ ≤ v∗ , define a deterministic policy π̄ ∈ Πd by


 
n
X
π̄(xi ) = arg max R(xi , uk ) + auijk V̄j  . (2.45)
uk ∈U j=1

In case of ties, choose any deterministic tie-breaking rule, e.g., choose the uk with the lowest index. Then,
since the right side of (2.45) equals (B(v̄))i = V̄i , we conclude that
n
π̄(xi )
X
V̄i = R(xi , π̄(xi )) + aij V̄j , ∀i. (2.46)
j=1

Hence Tπ̄ (v̄) = v̄. But since Tπ̄ is a contraction, it has a unique fixed point, which shows that V̄i = Vπ̄ (xi )
for all i. Therefore, for each index i, we have that

V̄i = Vπ̄ (xi ) ≤ V ∗ (xi ), ∀i, or v̄ ≤ v∗ .

Taken together with (2.43), this shows that v̄ = v∗ .

By replacing v̄ in Theorem 2.8 by v∗ (which equals v̄), we derive the following fundamental result for
Markov Decision Processes.

Theorem 2.9. Define the optimal value function V ∗ (xi ) as in (2.28). Then

1. The optimal value function V ∗ : X → R is the unique solution of the following recursive relationship,
known as the Bellman optimality equation:
 
n
X
V ∗ (xi ) = max R(xi , uk ) + γ auijk V ∗ (xj ) . (2.47)
uk ∈U
j=1

2. There is at least one deterministic policy π ∈ Πd such that

Vπ (xi ) = V ∗ (xi ), ∀i ∈ X . (2.48)

Specifically, the policy π̄ defined by restating (2.45) with V̄j replaced by Vj∗ , namely
 
n
X
π ∗ (xi ) = arg max R(xi , uk ) + auijk Vj∗  . (2.49)
uk ∈U j=1

satisfies (2.48) and is thus an optimal policy.

Note that Item 2 of the theorem states that enlarging the policy space to include probabilistic policies
does not increase the maximum value. Also, there is one common policy that achieves the optimal value for
every state xi . Perhaps neither of these statements is obvious on the surface.
In analogy with the optimal value function, we can also define an optimal action-value function.

Theorem 2.10. Define Q∗ : X × U → R by


n
X

Q (xi , uk ) = R(xi , uk ) + γ auijk V ∗ (xj ). (2.50)
j=1
24 CHAPTER 2. MARKOV DECISION PROCESSES

Then Q∗ (·, ·) satisfies the following relationships:


n
X
Q∗ (xi , uk ) = R(xi , uk ) + γ auijk max Q∗ (xj , wl ). (2.51)
wl ∈U
j=1

V ∗ (xi ) = max Q∗ (xi , uk ), (2.52)


uk ∈U

Moreover, every policy π ∈ Πd such that

π ∗ (xi ) = arg max Q∗ (xi , uk ) (2.53)


uk ∈U

is optimal.
Proof. Since Q∗ (·, ·) is defined by (2.50), it follows that
 
n
X
max Q∗ (xi , uk ) = max R(xi , uk ) + γ auijk V ∗ (xj ) = V ∗ (xi ),
uk ∈U uk ∈U
j=1

by (2.47). This establishes (2.52) and (2.53). Substituting from (2.52) into (2.50) gives (2.51).
Now we define an iteration on action-functions that is analogous to (2.38) for value functions. As with
the value function, the action-value function can either be viewed as a map Q : X × U → R, or as a vector
in R|X |·|U | . Define F : R|X |×|U | → R|X |×|U | by
n
X
[F (Q)](xi , uk ) := R(xi , uk ) + γ auijk max Q(xj , wl ). (2.54)
wl ∈U
j=1

Theorem 2.11. The map F is monotone and is a contraction. Therefore for all Q0 : X × U → R, the
sequence of iterations {F t (Q0 )} converges to Q∗ as t → ∞.
Proof. The proof is very similar to that of Theorem 2.9. Given a map Q : X × U → R, define the associated
map M(Q) : X → R by
[M(Q)](xi ) = max Q(xi , uk ),
uk ∈U

and rewrite (2.54) as


n
X
[F (Q)](xi , uk ) := R(xi , uk ) + γ auijk [M(Q)](xj ). (2.55)
j=1

Also, if Q, Q0 : X × U → R, let Q ≤ Q0 denote that Q(xi , uk ) ≤ Q0i (xi , uk ) for all xi , uk . Then it is clear that
if Q ≤ Q0 , then M(Q) ≤ M(Q0 ). Because auijk is always nonnegative, it follows that the map F is monotone.
Next, as in the proof of Theorem 2.7, for arbitrary maps Q1 , Q2 : X × U → R, we have

|[M(Q1 )](xi ) − [M(Q2 )](xi )| = max Q1 (xi , uk ) − max Q2 (xi , uk )


uk ∈U uk ∈U

≤ max |Q1 (xi , uk ) − Q2 (xi , uk )|, ∀xi ∈ X .


uk ∈U

As a result
kM(Q1 ) − M(Q2 )k∞ ≤ kQ1 − Q2 k∞ .
Substituting this into (2.55) gives

kF (Q1 ) − F (Q2 )k∞ ≤ γkQ1 − Q2 k∞ . (2.56)

The desired conclusion now follows.


2.2. MARKOV DECISION PROCESSES 25

If we were to rewrite (2.47) and (2.51) in terms of expected values, the advantages of the Q-functioon
would become apparent. We can rewrite (2.47) as

V ∗ (Xt ) = max {R(Xt , Ut ) + γE[V ∗ (Xt+1 )|Xt ]}, (2.57)


Ut ∈U

and (2.51) as  
∗ ∗
Q (Xt , Ut ) = R(Xt , Ut ) + γE max Q (Xt+1 , Ut+1 ) . (2.58)
Ut+1 ∈U

Thus in the Bellman formulatioon and iteration, the maximization occurs outside the expectation, whereas
with the Q-formulation and F -iteration, the maximization occurs inside the expectation. As shown in later
chapter, learning Q∗ is easier than learning V ∗ .
The idea of learning Q∗ instead of learning V ∗ is introduced in [47].

Optimal Policy Determination:


Theorems 2.8 and 2.9 together show the following: Start with any initial guess v0 ∈ Rn , and apply the
Bellman iteration B defined in (2.38). Then the sequence {vk } with vk+1 = Bvk converges monotonically
to the optimal value v∗ . Once v∗ is determined, then an optimal policy can be determined using (2.49).
This approach to determining v∗ is known as value iteration. While this is a useful result, a shortcoming
is that the intermediate vectors vk do not necessarily correspond to any policy. An easy remedy is to choose
the starting point of the iterations v0 to be the value of some policy π0 . Then each successive iteration
vk also corresponds to a policy πk . In this way, we generate a sequence of suboptimal policies πk with the
property that the associated value vector vk = vπk converges to the optimal value. This approach is known
as policy iteration. This is made precise as follows:
Theorem 2.12. Choose an arbitrary policy π0 ∈ Πd , and compute the corresponding value vπ0 . At the k-th
iteration, choose an updated policy πk+1 ∈ Πd according to
 
n
X
πk+1 (xi ) = arg max R(xi , uk ) + γ auijk (vπk )j  . (2.59)
uk ∈U j=1

Then
1. vπk+1 ≥ vπk , where the dominance is componentwise.
2. {vπk } ↑ v∗ as k → ∞.
The proof is quite straightforward. The key step is to verify that if we define the updated policy πk+1
according to (2.59), then the corresponding value vπk+1 is just Bvπk ; but this is obvious.
Example 2.2. Now we return to the game of Blackjack. A detailed discussion of the game is given in [33,
Example 5.1]. To describe the original game briefly, it is played between a player and the “House.” (It is
possible to have more than one player playing against the House, but we don’t study that problem in the
interests of simplicity.) At each turn, the player and the House have the option of drawing a card (“hit”) or
not drawing (“stick”). Each card is counted as its face value, with picture cards counted as 10. An ace can
count as either 1 of 11 at the player’s preference. The objective of the player is to exceed the total of the
House without going over 21.
From the description, it is obvious that if the player’s current total is eleven or less, then the best strategy
is to hit, because there is no chance of losing on the next draw. Hence the issue of what to do arises only
when the player’s total reaches 12 or higher. Indeed, if the target were to be changed to some number N ,
then it is clear that if the player’s total is N − 10 or less, then the correct solution is to hit. It can also be
assumed that the probability of any particular card being the next card drawn is the same, no matter what
26 CHAPTER 2. MARKOV DECISION PROCESSES

cards have been drawn until then (infinitely many card decks being used). In the original Blackjack game,
only one card of the House is visible. In what follows, for the purposes of illustration, we eliminate all of
these complications, and introduce a simplified game.
Suppose that, instead of drawing a card, the player rolls a fair four-sided die. Since there are only four
possible outcomes, irrespective of what the target total might be, it is reasonable to suppose that the state Pt
of the player lies in the set {0, 1, 2, 3, W, L}, with 0 being the start state. It can be assumed that the current
state is in {0, 1, 2, 3}, while W and L are terminal states. To simplify the problem further, suppose that the
House adopts the strategy that it does not roll the die further once its state is in {1, 2, 3} (i.e., it does not try
for a win from any of these states). Therefore the state Ht of the house lies in the set {1, 2, 3}. The overall
state (Pt , Ht ) lies in the Cartesian product {0, 1, 2, 3, W, L} × {1, 2, 3}. Out of these, there are twelve possible
current states, namely {0, 1, 2, 3} × {1, 2, 3} where the first number is the state of the player and the second
is the state of the House. If the player rolls the die, the possible next states are {1, 2, 3, W, L} × {1, 2, 3},
or a total of fifteen states. In this game, as in the snakes and ladders game, the reward is random and is a
function of the next state.
As a part of the problem statement, we need to specify the dynamics of the Markov process. For the
House, it does not play, so its state transition matrix is the 3 × 3 identity matrix, which ensures that
Ht+1 = Ht . As for the player’s state Pt , if the action is to “stick,” then the state transition matrix AS is
the 5 × 5 identity matrix. If the action is to “hit,” then the state transition matrix AH is given by
0 1 2 3 W L
0 0 0.25 0.25 0.25 0.25 0
1 0 0 0.25 0.25 0.25 0.25
AH = 2 0 0 0 0.25 0.25 0.50 .
3 0 0 0 0 0.25 0.75
W 0 0 0 0 1 0
L 0 0 0 0 0 1
To complete the problem formulation, we need to specify the reward. Unlike the state transition matrix
above, which is based on nothing more than the assumption that all four outcomes of the die are equally
likely, the reward is to some extent arbitrary. Let us assign the following rewards:
Pt > Ht 2
Pt = H t 1
Pt < Ht 0
Pt = W 5
Pt = L −5
With this problem specification, we should strive to find an optimal policy. Note that the action space
U = {H, S} (for “hit” or “stick”) has cardinality two. Hence the number of policies is 212 = 4, 096, which is
already large enough that simply enumerating all possibilities is not practicable.3 Hence some kind of policy
iteration is the only way.
For evaluating a specific policy, it can be noted that the duration of the game cannot exceed four time
steps. This is because the player’s position has to advance by at least one at each time step. So discount
factors very close to 1 do not make sense. The discount γ should be chosen much smaller, say 0.5.
Problem 2.2. Suppose that a Markov decision problem has four states and two actions. Suppose further
that the two row-stochastic matrices corresponding to the two actions are as follows:
   
0.1 0.3 0.3 0.3 0.3 0.2 0 0.5
 0.3 0.4 0.1 0.2  u  0.1 0.1 0.2 0.6 
Au 1 = 
 0 0.4 0.4 0.2  , A =  0.2 0.5 0.1 0.2  .
 2  

0.4 0.2 0.2 0.2 0 0.1 0.5 0.4


3 For the full Blackjack game, the number of policies is 2200 as shown in [33, Example 5.1].
2.2. MARKOV DECISION PROCESSES 27

Suppose further that the reward map R : X ×U is as follows (note that we write e.g., (3, 1) instead of (x3 , u1 )
to save space):
(1, 1) (1, 2) (2, 1) (2, 2) (3, 1) (3, 2) (4, 1) (4, 2)
R= .
2 5 −1 4 3 3 6 −1
ˆ Suppose we define a deterministic policy π by
 
0 1 1 0
π= .
1 0 0 1

In other words, π(x1 ) = u2 , π(x2 ) = u1 , π(x3 ) = u1 , π(x4 ) = u2 . Compute the corresponding state
transition matrix Aπ and reward map Rπ .
ˆ Suppose we define a probabilistic policy π by
 
0.3 0.4 0.2 0.6
π= .
0.7 0.6 0.8 0.4

Compute the corresponding state transition matrix Aπ and reward map Rπ .

ˆ How many deterministic policies can there be for this problem?

ˆ With a discount factor of γ = 0.9, compute the optimal value and optimal policy using Theorem 2.12.

Problem 2.3. Prove Theorem 2.11.


Problem 2.4. Using the policy iteration method of Theorem 2.12, compute the optimal value function and
optimal policy for the Markov decision process of Problem 2.2.
Problem 2.5. Formulate the simplified blackjack game as a Markov Decision Problem with discount factor
γ = 0.5 and find the optimal policy using policy iteration.
28 CHAPTER 2. MARKOV DECISION PROCESSES
Chapter 3

Stochastic Approximation

In this chapter, we discuss stochastic approximation (SA), which provides the mathematical foundation for
many of the reinforcement learning (RL) algorithms presented in subsequent chapters. In RL, the learner
attempts to identify an optimal (or nearly optimal) policy for an MDP with possibly unknown dynamics.
Even if the MDP dynamics were to be known, computing an optimal policy using the policy iteration
approach described in Theorem 2.12 would require, at each iteration, the exact computation of the value
function associated with the policy. One of the applications of SA is that the policy can be updated even as
the value function is being estimated. In the case where the MDP dynamics are not known, the learner has
two possible approaches:

1. Estimate the MDP parameters and use this estimate to formulate the corresponding optimal policy.
This is often referred to as “indirect” RL, drawing inspiration from the phrase “indirect adaptive
control” from control theory.

2. Directly estimate the optimal policy without attempting to estimate the MDP parameters. This is
often referred to as “direct” RL, again drawing upon the phrase “direct adaptive control.”

We will see that SA is very useful in analyzing direct RL.

3.1 An Overview of Stochastic Approximation


Stochastic approximation (SA) can be viewed as an iterative technique for finding a zero of a function
f : Rd → Rd using noisy measurements. It is not necessary for the function to be “known” (e.g., in closed
form). All that is required is that, given an argument θ ∈ Rd , an “oracle” gives us a noise-corrupted
measurement in the form
yt+1 = f (θ t ) + ξ t+1 , (3.1)
where {ξ t }t≥1 is a noise sequence. We defer a discussion of the nature of the noise sequence to a later point
in this chapter. For the moment, let us focus on how the noisy measurements could be used to construct a
sequence of approximations {θ t } that we hope would converge to a θ ∗ ∈ Rd such that f (θ ∗ ) = 0.
Observe that a solution to f (θ) = 0 is also a minimizer of h(θ) = (1/2)kf (θ)k22 . Moreover, the gradient
of h(·) is given by
∇h(θ) = f (θ).
If we had access to noise-free measurements of f (·), we could apply the steepest descent algorithm to minimize
h(·), as follows: Pick some θ 0 ∈ Rd , and at step t, define

θ t+1 = θ t − αt ∇h(θ) = θ t − αt f (θ t ), (3.2)

29
30 CHAPTER 3. STOCHASTIC APPROXIMATION

where {αt } is a predetermined sequence of step sizes.1 Note that αt is permitted to be a random number; in
this case, its probability distribution depends only on information up to and including time t. This is made
precise later on. If only noisy measurements of the form (3.1) are available, then (3.2) is replaced by

θ t+1 = θ t − αt yt+1 = θ t − αt [f (θ t ) + ξ t+1 ]. (3.3)

However, there is some ambiguity about the updating rule (3.3). Note that h(θ) also equals k − f (θ)k22 . So
we could just as easily use the updating rule

θ t+1 = θ t + αt yt+1 = θ t + αt [f (θ t ) + ξ t+1 ]. (3.4)

Which one should we choose?


The answer is that it does not matter so long as we are consistent. More to the point, the answer depends
on what we think the shape of the function f (·) is. As above, let θ ∗ satisfy f (θ ∗ ) = 0. If f (·) satisfies

hθ − θ ∗ , f (θ) − f (θ ∗ )i < 0 (3.5)

if θ 6= θ ∗ , then we should use the updating (3.4). However, if the sign is reversed in (3.5), then we should
use the updating rule (3.3). Going forward, we will use (3.4). The main reason is that, if the function f (·)
satisfies (3.5), then under mild additional conditions θ ∗ is a globally attractive equilibrium of the associated
differential equation
θ̇ = f (θ). (3.6)
We will refer to a function f : Rd → Rd for all θ as a passive function, borrowing a term from circuit
theory.
Stochastic approximation theory is devoted to the study of conditions under which a sequence {θ t } defined
as in (3.3) converges to a zero of the function f (·). Due to the presence of the noise, the convergence can only
be probabilistic. The SA algorithm was introduced in [29], and some generalizations and/or simplifications
followed very quickly; see [49, 20, 15, 13]. An excellent survey can be found in [26]. Book-length treatments
of SA can be found in [24, 3, 25, 8].
The above formulation of SA can be used to address some related problems. For instance, suppose
g : Rd → Rd is some function, and it is desired to find a fixed point of the map g, that is, a vector θ ∗ such
that g(θ ∗ ) = θ ∗ . As shown in Chapter 2, computing the value of a Markov reward problem, or the value
of a policy in an MDP, both fall into this category. This problem can be formulated as that finding a zero
of the function f (θ) = g(θ) − θ. If we were to substitute this expression into (3.3), we get what might be
called the “fixed point version” of SA, namely

θ t+1 = θ t + αt [g(θ t ) − θ t + ξ t+1 ] = (1 − αt )θ t + αt [g(θ t ) + ξ t+1 ]. (3.7)

Another application is that of finding a stationary point of a function J : Rd → R, that is, finding a θ ∗ ∈ Rd
such that ∇J(θ ∗ ) = 0. Again, the above problem can be formulated in the present framework by defining
f (θ) := −∇J(θ). Here again, one can ask by f (θ) := −∇J(θ) and not f (θ) := ∇J(θ). The answer is that
if we write f (θ) := −∇J(θ), then under suitable conditions we can expect SA to find a local (or global)
minimum of J. If we wish to maximize J, then of course we should choose f (θ) := ∇J(θ). As in the previous
case, the learner has available only noisy measurements of the gradient, in the form yt+1 = −∇J(θ t ) + ξ t+1 .
For this reason, the above formulation is sometimes referred to as “stochastic gradient descent.” The reader is
cautioned that the same phrase is also used with an entirely different meaning in the deep learning literature.
Equation 3.3 describes what might be called the “standard” SA. Two variants of SA are germane to RL,
namely “asynchronous” SA and “two time-scale” SA. Each of these is briefly described next.
1 In earlier years, methods such as steepest descent used to consist of two parts: (i) a choice of the search direction, and

(ii) solution of a minimum along the search direction to determine the step size. However, in recent times, one-dimensional
minimization has been dispensed with. Instead, the step size is chosen according to a predetermined schedule.
3.1. AN OVERVIEW OF STOCHASTIC APPROXIMATION 31

We begin with a description of “asynchronous” SA. Let [d] denote the set {1, · · · , d}, and let there be a
rule I that maps Z+ , the set of nonnegative integers, into [d]. At time t ∈ Z+ , let I(t) be the corresponding
element of [d]. Then the update rule (3.3) is applied only to the I(t)-th component of θ. Thus

θt,j − αt yt+1,j , if j = I(t),
θt+1,j = (3.8)
θt,j , if j 6= I(t).

From an implementation standpoint, the asynchronous update rule (3.8) presumably requires less storage.
However, convergence with the asynchronous update rule might be slower than with (3.3).
Note that there is a great deal of flexibility in the update rule, which could either be deterministic or
probabilistic. To illustrate, updating each component of θ sequentially would be an example of a deterministic
update rule, while choosing an index i from [d] according to some probability distribution would be an
example of a probabilistic rule. We shall see in subsequent chapters that, in some RL applications, the
process {I(t)} is itself a Markov process assuming values in [d]. For a given integer T , let T (i) denote the
number of times that component i ∈ [d] is chosen to be updated, until time T . Then we insist that there
exists a ν > 0 such that
T (i)
lim inf ≥ ν > 0, ∀i ∈ [d]. (3.9)
T →∞ T
If i is chosen in a random fashion, for example in accordance with some probability distribution on [d], or
as a Markov process on [d], then we insist only that (3.9) must hold almost surely. The purpose of (3.9) is
to ensure that, though only one component of θ t is updated at time t, as time goes on, the fraction of times
that each component of θ t is updated is bounded below by some positive constant.
This variant of asynchronous SA is introduced in [39], and builds on “asynchronous optimization” intro-
duced in [40]. Another variant of asynchronous SA is introduced in [7], and studied further in [9]. In this
variant, the update rule (3.8) is changed to

θt,j − αν(t+1;i) yt+1,j , if j = I(t),
θt+1,j = (3.10)
θt,j , if j 6= I(t),

where
t+1
X
ν(t + 1; i) = I{I(τ )=i} .
τ =1

Thus ν(t+1; i) counts the number of time instants up to time t+1 when component i is selected for updating.
The difference between (3.8) and (3.10) is this: In (3.8), the step size is αt , which depends on the “global
counter” t. In contrast, in (3.10), the step size is αν(t+1;i) , which depends on the “local counter” ν(t + 1; i).
Establishing the convergence of this particular variant makes use of far more stringent requirements on the
noise {ξ t }, compared to the version in (3.8). This can be seen by comparing the contents of [39] with those
of [7]. Moreover, standard RL algorithms such as Q-learning, introduced in subsequent chapters, correspond
to the update rule (3.8) and not (3.10). For this reason, the update rule in (3.10) is not studied further.
The third variant of SA studied here is “two time-scale” SA, in which we attempt to solve “coupled”
equations of the form
f (θ, φ) = 0, g(θ, φ) = 0, (3.11)
where θ ∈ Rd , φ ∈ Rl , f : Rd × Rl → Rd , and g : Rd × Rl → Rl . As before, for given “current” guesses
θ t , φt , we have access only to noise-corrupted measurements of the form

yt+1 = f (θ t , φt ) + ξ t+1 , zt+1 = g(θ t , φt ) + ζ t+1 . (3.12)

We update the current guesses in a manner analogous with (3.3), namely

θ t+1 = θ t − αt yt+1 , φt+1 = φt − βt+1 zt+1 . (3.13)


32 CHAPTER 3. STOCHASTIC APPROXIMATION

Note the provision to have different step sizes for updating θ t and φt . Now, if we were to choose the step
sizes in such a way that αt = βt , or that the ratio αt /βt is bounded above and also below away from zero,
then nothing much would be gained by permitting two different step sizes. However, suppose that αt /βt → 0
as t → ∞. Then φt is updated more rapidly than θ t (or θ t is updated more slowly than φt ). In this setting,
(3.13) is said to represent “two time-scale” SA.
The final section of this chapter deals with “finite-time” SA. Traditional results in SA are asymptotic,
and assert that, under suitable conditions, the iterates converge to a solution of the problem under study as
t → ∞. In contrast, in “finite-time” SA, the emphasis is on providing (probabilistic) estimates of how far
the current guess is from a solution. The finite-time approach is applicable to each of the three variants of
SA mentioned above.
Throughout this chapter, the emphasis is on stating, and wherever possible proving, theorems about the
behavior of various SA algorithms. The application of these results to problems in RL is deferred to later
chapters.

3.2 Introduction to Martingales


Because martingale difference sequences play a central role in SA, in this subsection we quickly summarize
some of the key aspects. In particular, we state without proof a couple of very useful results on the conver-
gence of martingales. Further details about this topic can be found in [48, 11, 6, 14]. In particular, [48, Part
B] is a very good source of theorems and examples, while the corresponding exercises in [48, Part E] provide
additional useful material. Similarly, [14, Chapter 4] has a wealth of material, including several examples
and problems, that is relevant to the material below.
Suppose that (Ω, F, P ) is a probability space, as described in Section 8.1. A sequence of σ-algebras
{Ft }t≥0 on Ω is called a filtration if
Ft ⊆ Ft+1 ⊆ F, ∀t ≥ 0. (3.14)
Clearly (3.14) implies that
F0 ⊆ F1 ⊆ · · · ⊆ Ft ⊆ Ft+1 ⊆ F, ∀t ≥ 0. (3.15)
Now suppose that {Zt }t≥0 is an Rd -valued stochastic process on (Ω, F, P ). We say that {Zt } is adapted to
the filtration {Ft }, or that the pair ({Zt }, {Ft }) is adapted, if Zt is measurable with respect to (Ω, Ft ), (i.e.,
with F replaced by Ft ). Since the underlying set Ω and probability measure P are fixed, and the only thing
varying is Ft , we denote this by Zt ∈ M(Ft ). In view of (3.14), we can make the following observations:

1. Zt ∈ M(Fτ ) whenever τ ≥ t.

2. Let Z0t ∈ Rd(t+1) denote (Z0 , Z1 , · · · , Zt ). Then Z0t ∈ M(Ft ).

If {Zt }t≥0 is an Rd -valued stochastic process on (Ω, F, P ), then we can define the “natural filtration” by

Ft = σ(Z0t ),

where σ(Z0t ) ⊆ F is the σ-algebra generated by Z0t . However, we do not always use the natural filtration.
Suppose {Ft } is a filtration on (Ω, F), and that {Zt }t≥0 is an Rd -valued stochastic process on (Ω, F, P ).
Then the pair ({Zt }, {Ft }) is said to be a martingale if the following hold:

(M1.) E(|Zt |, P ) < ∞, for all t ≥ 0.

(M2.) The pair ({Zt }, {Ft }) is adapted.

(M3.) We have that


E(Zt+1 |Ft ) = Zt , a.s., ∀t ≥ 0. (3.16)
3.2. INTRODUCTION TO MARTINGALES 33

If we use the natural filtration Ft = σ(Z0t ), then (3.16) can be replaced by Condition (M3’), namely

E(Zt+1 |Z0t ) = Zt , a.s., ∀t ≥ 0. (3.17)

If (3.16) is replaced by
E(Zt+1 |Ft ) ≤ Zt , a.s., ∀t ≥ 0, (3.18)
then {Zt }t≥0 is called a supermartingale, whereas if (3.17) is replaced by

E(Zt+1 |Ft ) ≥ Zt , a.s., ∀t ≥ 0, (3.19)

then {Zt }t≥0 is called a submartingale.


Several useful consequences of the definition are obtained by applying Theorem 8.1. If {Zt } is a martin-
gale, then by the iterated conditioning property (Item 5 of Theorem 8.1), it follows that

E(Zτ |Ft ) = Zt , a.s., ∀τ ≥ t + 1, ∀t ≥ 0. (3.20)

The equality is replaced by ≤ for a supermartingale, and by ≥ for a submartingale. Next, by the expected
value preservation property (Item 3 of Theorem 8.1), it follows that2

E[Zt , P ] = E[Z0 , P ], ∀t ≥ 0. (3.21)

It similarly follows that if {Zt } is a supermartingale, then

E[Zt , P ] ≤ E[Z0 , P ], ∀t ≥ 0, (3.22)

where as if {Zt } is a submartingale, then

E[Zt , P ] ≥ E[Z0 , P ], ∀t ≥ 0. (3.23)

Thus, in a supermartingale, {E[Zt , P ]} is a nonincreasing sequence of real numbers, while in a submartingale,


{E[Zt , P ]} is a nondecreasing sequence of real numbers.
Next, let {ξt }t≥0 be a stochastic process adaptated to a filtration {Ft }, such that E[|ξt |, P ] < ∞ for all
t, and define
Xt
Zt = ξτ . (3.24)
τ =0

Then it is obvious that {Zt } is also adapted to {Ft }. The sequence ({ξt }, {Ft }) is said to be a martingale
difference sequence if ({Zt }, {Ft }) is a martingale. It is easy to show using (3.16) that, if {ξt } is a
martingale difference sequence, then

E(ξt+1 |Ft ) = 0, a.s., ∀t ≥ 0. (3.25)

If ξ0 = 0 almost surely (that is, if ξ0 is a constant), then it follows that E[ξt , P ] = 0 for all t ≥ 1. The picture
is clearer if each ξt belongs to L2 (Ω, Ft ). Then, by the projection property (Item 9) of Theorem 8.1, (3.26)
is equivalent to the statement that ξt+1 is orthogonal to every element of L2 (Ω, Ft ).

Example 3.1. Suppose {ξt } is a sequence of independent (but not necessariy identically distributed) zero-
mean random variables. Then {ξt } is a martingale difference sequence, and the sequence {Zt } is a martingale.
If E[ξt , P ] ≥ 0 for all t, then {Zt } is submartingale, whereas if E[ξt , P ] ≤ 0 for all t, then {Zt } is super-
martingale.
2 The reader is reminded that, wherever possible, we use parentheses for the conditional expectation, which is a random

variable, and square brackets for the expected value, which is a real number.
34 CHAPTER 3. STOCHASTIC APPROXIMATION

Now we present some results on the convergence of martingales, which are useful in proving the conver-
gence of various SA algorithms. The material is taken from [48, Chapter 12] and/or [14, Chapter 4] and is
stated without proof. Citations from these sources are given for individual results stated below.
We begin with a preliminary concept. Given a filtration {Ft } and a stochastic process {At } that is
adapted to Ft , we say that {(At , Ft )} is predictable if At ∈ M(Ft−1 ) for all t ≥ 1. Note that there is no
A0 . Also, note that in [48], such processes are said to be “previsible.” However, the phrase “predictable”
is used in [14] and appears to be more commonly used. We say that a martingale {Zt } (adapted to Ft ) is
null at zero if Z0 = 0 a.s., and that a predictable process {At } is null at zero if A1 = 0 a.s..3 With these
preliminaries, we can now state the following:
Theorem 3.1. (Doob decomposition theorem. See [48, Theorem 12.11] or [14, Theorem 4.3.2].) Suppose
{Ft } is a filtration and {Yt } is a stochastic process adapted to {Ft }. Then Yt can be expressed as
Yt = Y0 + Zt + At , (3.26)
where {Zt }t≥0 is a martingale null at zero, and {At } is a predictable process null at zero. If {Zt0 } and {A0t }
also satisfy the above conditions, then
P {ω : Zt (ω) = Zt0 (ω)&At (ω) = A0t (ω), ∀t} = 1. (3.27)
Moreover, {Yt } is a submartingale if and only if {At } is an increasing process, that is
P {ω : At+1 (ω) ≥ At (ω), ∀t} = 1. (3.28)
Note that an “explicit” expression for At is given by
t−1
X t−1
X
At = E((Yτ +1 − Yτ )|Fτ ) = [E(Yτ +1 |Fτ ) − Yτ ]. (3.29)
τ =0 τ =0

Next, suppose Yt = Mt2 , where {Mt } is a martingale in L2 (Ω, P ) null at zero. Then it is easy to show using
the conditional Jensen’s inequality (not covered here) that {Yt } is a submartingale null at zero. Therefore
the Doob decomposition of Yt = Mt2 is
Mt2 = Zt + At , (3.30)
where {Zt } is a martingale and {At } is an increasing predictable process, both null at zero. It is customary
to refer to {At } as the quadratic variation process and to denote it by hMt i. Note that
2
At+1 − At = E((Mt+1 − Mt2 )|Ft ) = E((Mt+1 − Mt )2 |Ft ). (3.31)
Define A∞ (ω) = limt→∞ At (ω) for (almost all) ω ∈ Ω. Then we have the following:
Theorem 3.2. (See [48, Theorem 12.13].) If A∞ (·) is bounded almost everywhere as a function of ω, then
{Mt (ω)} converges almost everywhere at t → ∞.
Actually [48, Theorem 12.13] is more powerful and gives “almost necessary and sufficient” conditions for
convergence. We have simply extracted what is needed for present purposes.
Theorem 3.3. (See [14, Theorem 4.2.12].) If {Zt } is nonnegative (i.e., Zt ≥ 0 a.s.) supermartingale, then
there exists a Z ∈ L1 (Ω, P ) such that Zt → almost surely, and E[Z, P ] ≤ E[Z0 , P ].
Theorem 3.4. Suppose {Zt } is a martingale wherein Zt ∈ Lp (Ω, P ) for some p > 1, and suppose further
that the martingale is bounded in k · kp , that is
sup E[Ztp , P ] < ∞. (3.32)
t

Then there exists a Z ∈ Lp (Ω, P ) such that Zt → Z as t → ∞, almost surely and in the p-th mean.
The above theorem is false if p = 1. The convergence is almost sure but need not be in the mean. See
[14, Example 4.2.13].
3 This is a slight inconsistency because actually At is “null at time one,” but the usage is common.
3.3. STANDARD STOCHASTIC APPROXIMATION 35

3.3 Standard Stochastic Approximation


Recall the problem under study: There is a function f : Rd → Rd for which only noisy measurements are
available, and it is desired to find a zero of this function, using these noisy measurements. Specifically, if at
time t if the current guess is θ t , then we get a measurement
yt+1 = f (θ t ) + ξ t+1 , (3.33)
where ξ t+1 is the measurement noise. The guess is updated according to
θ t+1 = θ t + αt yt+1 = θ t + αt (f (θ t ) + ξ t+1 ), (3.34)
where {αt }t≥1 is a predetermined sequence of step sizes. It is desired to study the limit behavior of the
sequence {θ t }. More specifically, suppose θ ∗ is the only solution to the equation f (θ) = 0. We would like
to find suitable conditions to ensure that the iterates {θ t } converge almost surely to θ ∗ .
In this section, we provide two different approaches to analyzing the SA algorithm. First, we analyze the
situation where the function f is “passive,” and second, where there exists a “Lyapunov” function V .

3.3.1 Stochastic Approximation for Passive Functions


In this subsection we show that the stochastic approximation algorithm of (3.34) converges when the function
f (·) is “passive,” which is made precise in (3.35) below. The proof given here follows that in [16]. However,
the reader is cautioned that in [16], the SA algorithm uses a minus sign in front of αt , that is, it uses the
formulation (3.3).
Definition 3.1. Suppose f : Rd → Rd , and suppose further that θ ∗ is the only solution to the equation
f (θ) = 0. Then f is said to be passive with a zero at θ ∗ if
inf hθ − θ ∗ , f (θ)i < 0, ∀ > 0. (3.35)
<kθk2 <−1

Note that if f (·) is continuous, then (3.35) can be replaced by


hθ − θ ∗ , f (θ)i < 0 ∀θ 6= θ ∗ . (3.36)
In circuit theory, a nonlinear characteristic that satisfies (3.36) with θ ∗ = 0 would be called “passive,” so
we borrow that terminology.
To determine θ ∗ , we use the iterations defined by (3.34). The step sizes αt are positive, and satisfy the
conditions (referred to hereafter as the Robbins-Monro or RM conditions)

X ∞
X
αt = ∞, αt2 < ∞. (3.37)
t=1 t=1

Theorem 3.5. (See [16, Theorem 1].) Suppose the following assumptions hold:
1. f (·) is passive with a zero at θ ∗ , that is, (3.35) holds.
2. There exists a constant d1 such that
kf (θ)k22 ≤ d1 (1 + kθ − θ ∗ k22 ). (3.38)

3. Let Ft := σ(θ t0 , ξ t1 ).4 Then the noise sequence {ξt+1 } satisfies two conditions: First,
E(ξ t+1 |Ft ) = 0 a.s.. (3.39)
Second, there exists a constant d2 > 0 such that
E(kξ t+1 k22 |Ft ) ≤ d2 (1 + kθ t k22 ) a.s., ∀t. (3.40)
4 The reader is reminded that θ t0 is a shorthand for (θ 0 , · · · , θ t etc., and σ(θ t0 , ξt1 ) denotes the σ-algebra generated by θ t0
and ξt1 .
36 CHAPTER 3. STOCHASTIC APPROXIMATION

4. The step size sequence {αt } satisfies the RM conditions (3.37).


Under these assumptions, we have that
1. The sequence {θ t } is bounded almost surely.
2. Further,
θ t → θ ∗ w.p. 1 as t → ∞. (3.41)

For notational simplicity, suppose θ is changed to θ − θ ∗ , and f (θ) is changed to f (θ − θ ∗ ), so that the
unique solution to f (θ) = 0 is θ = 0. Then (3.40) continues to hold with possibly a different (but still finite)
constant d2 . Next,

kθ t+1 k22 = kθ t k22 + αt2 kf (θ t )k22 + αt2 kξ t+1 k22


+ 2αt θ > 2 > >
t ξ t+1 + 2αt (f (θ t )) ξ t+1 + 2αt θ t f (θ t )

Now we use (3.38), (3.39), and (3.40), and define d = d1 + d2 . This gives

E(kθ t+1 k22 |Ft ) = kθ t k22 + αt2 kf (θ t )k22 + αt2 E(kξ t+1 k22 |Ft ) + 2αt θ >
t f (θ t )

≤ kθ t k22 + αt2 d(1 + kθ t k22 ) + 2αt θ >


t f (θ t )
≤ (1 + dαt2 )kθ t k22 + dαt2 , (3.42)

where we use the fact that θ >


t f (θ t ) ≤ 0 from (3.36).
Next we define a new stochastic process

Zt := at kθ t k22 + bt , (3.43)

and choose the constants at , bt recursively so as to ensure that Zt is a nonnegative supermartingale. For this
purpose, note that

E(Zt+1 |Ft ) = at+1 E(kθ t+1 k22 |Ft ) + bt+1


≤ at+1 [(1 + dαt2 )kθ t k22 + dαt2 ] + bt+1 .

Now substitute
1 bt
kθ t k22 = Zt − .
at at
This gives
at+1 (1 + dαt2 ) at+1 (1 + dαt2 )
E(Zt+1 |Ft ) ≤ Zt − bt + at+1 dαt2 + bt+1 .
at at
So we get the supermartingale property
E(Zt+1 |Ft ) ≤ Zt
provide we define at+1 and bt+1 in such a way that

at+1 (1 + dαt2 ) at
= 1, or at+1 = ,
at 1 + dαt2

−bt + at+1 dαt2 + bt+1 = 0, or bt+1 = bt − dαt2 at+1 .


The above recursive relationships define at and bt completely once we specify a0 and b0 . To ensure that
Zt ≥ 0 for all t, we choose
Y∞ ∞
X
a0 = (1 + dαt2 ), bt = dαt2 at+1 .
t=0 t=0
3.3. STANDARD STOCHASTIC APPROXIMATION 37

The square-summability of the step size sequence {αt } ensures that both a0 and b0 are well-defined. This
gives the closed-form expressions

Y ∞
X ∞
X ∞
Y
at = (1 + dαk2 ), bt = dαk2 ak+1 = dαk2 2
(1 + dαm ). (3.44)
k=t k=t k=t m=k+1

Moreover, we have
1 < at+1 < at < a0 , 0 < bt+1 < bt < b0 , ∀t.
Therefore at ↓ a∞ ≥ 1 and bt ↓ b∞ ≥ 0.
With the above choices for at , bt , {Zt } is a nonnegative supermartingale. By Theorem 3.3, {Zt } converges
almost surely to some some random variable. Because at → a∞ > 1 and bt → b∞ , it follows from (3.43) that
kθ t k2 also converges almost surely to some random variable ζ. Thus the proof is complete once it is shown
that ζ = 0 a.s..
Since {Zt } is a nonnegative supermartingale, it follows that

E[Zt , P ] ≤ E[Z0 , P ], ∀t ≥ 0.

Moreover, since at > 1 for all t, we conclude that

E[kθ t k22 , P ] ≤ c < ∞ ∀t, (3.45)

for some constant c. This is the first desired conclusion. Now let us return to (3.42). After taking the
expected value with respect to P and noting that

E[E(kθ t+1 k22 |Ft ), P ] = E[kθ t+1 k22 , P ],

we get
E[kθ t+1 k22 , P ] ≤ E[kθ t k22 , P ] + αt2 d(1 + E[kθ t k22 , P ]) + 2αt E[θ >
t f (θ t ), P ].

Applying this bound recursively leads to


t
X t
X
E[kθ t+1 k22 , P ] ≤ E[kθ 0 k22 , P ] + dαk2 (1 + E[kθ k k22 , P ]) + 2 αk E[θ >
k f (θ k ), P ]
k=0 k=0

X t
X
≤ c + d(1 + c) αk2 + 2 αk E[θ >
k f (θ k ), P ]. (3.46)
k=0 k=0

Recall that θ > > >


k f (θ k ) ≤ 0. Hence −θ k f (θ k ) = |θ k f (θ k )|. Now we can rearrange (3.46) as

t
X ∞
X ∞
X
2 αk E[θ >
k f (θ k ), P ] ≤ c + d(1 + c) αk2 − E[kθ t+1 k22 , P ] ≤ c + d(1 + c) αk2 .
k=0 k=0 k=0

In short, we have

t
" #
X 1 X
αk E[θ >
k f (θ k ), P ] ≤ c + d(1 + c) 2
αk =: c1 , ∀t.
2
k=0 k=0
Now let t → ∞ and also, change the index of summation from k to t. This gives

X
αt |E[θ >
k f (θ k ), P ]| ≤ c1 < ∞. (3.47)
t=0
P
Now recall that t αt = ∞. This implies that

lim inf |E[θ >


t f (θ t ), P ]| = 0. (3.48)
t→∞
38 CHAPTER 3. STOCHASTIC APPROXIMATION

Otherwise, there would exist an integer K and an  > 0 such that

|E[θ >
t f (θ t ), P ]| ≥ , ∀t ≥ K.

Also

X
αt = ∞,
t=K

because dropping a finite number of terms in the summation does not affect the divergence of the sum.
These two observations taken together contradict (3.47). Therefore (3.48) holds. As a result, there is a
subsequence of {θ t }, call it {θ tk }, such that θ >
tk f (θ tk ) → 0 in probability. In turn this implies that there is
yet another subsequence, which is again denoted by {θ tk }, such that θ > tk f (θ tk ) → 0 almost surely as k → ∞.
Again, if
lim inf kθ tk k22 > 0,
k→∞

then applying (3.35) leads to a contradiction. Hence there exists a subsequence of times {tk } such that
kθ tk k22 → 0 almost surely as k → ∞. Now kθ tk k22 → ζ a.s., and a subsequence converges almost surely to 0.
This shows that ζ = 0 a.s..
As a concrete application of Theorem 3.5, we consider the convergence of the iterations of a contraction
map with noisy measurements. Though the result is a corollary of Theorem 3.5, it is stated separately as a
theorem in view of its importance.
First we describe the set-up. Suppose g : Rd → Rd is a global contraction with respect to the Euclidean
norm. Specifically, suppose there exists a constant ρ < 1 such that

kg(θ) − g(φ)k2 ≤ ρkθ − φk2 , ∀θ, φ ∈ Rd . (3.49)

To determine the unique fixed point θ ∗ of g(·), define the iterations as in (3.7), namely

θ t+1 = (1 − αt )θ t + αt (g(θ t ) + ξ t+1 ), (3.50)

where {ξ t } is the measurement noise sequence.


Theorem 3.6. Define Ft := σ(θ t0 , ξ t1 ), and suppose there exists a constant d such that (3.39) and (3.40)
hold. Suppose further that the step size sequence {αt } satisfies the RM conditions (3.37). Then θ t → θ ∗ a.s.
as t → ∞.
Proof. Define f : Rd → Rd as f : θ 7→ g(θ) − θ. Then, for all θ, φ ∈ Rd , we have

hθ − φ, f (θ) − f (φ)i = −kθ − φk22 + hθ − φ, g(θ) − g(φ)i


≤ −kθ − φk22 + ρkθ − φk22 = −(1 − ρ)kθ − φk22 .

In particular, with φ = θ ∗ , we have that

hθ − θ ∗ , f (θ)i ≤ −(1 − ρ)kθ − θ ∗ k22 .

Hence f (·) is passive, and moreover, θ ∗ is the unique zero of f (·). Also, we have that

kf (θ)k2 ≤ kg(θ)k2 + kθk2 ≤ ρkθ − θ ∗ k2 + kθ − θ ∗ k2 + kθ ∗ k2


= (1 + ρ)kθ − θ ∗ k2 + kθ ∗ k2 = a + bkθ − θ ∗ k2 ,

where the definitions of a and b are obvious. Now using the easily proven identity

(a + bx)2 ≤ (a2 + 2ab) + (b2 + 2ab)x2 ∀x ≥ 0,

we conclude that f (·) satisfies (3.38). Then the desired result follows from Theorem 3.5.
3.3. STANDARD STOCHASTIC APPROXIMATION 39

Another application of Theorem 3.5 is to minimizing a convex function h : Rd → R using noisy measure-
ments of the gradient ∇h. Recall that a function h : Rd → R is said to be convex if

h[λθ + (1 − λ)φ] ≤ λh(θ) + (1 − λ)h(φ), ∀θ, φ ∈ Rd .

Moreover, if h ∈ C 1 , then (see [17, Theorem B.4.1.1])

h(φ) ≥ h(θ) + h∇h(θ), φ − θi, ∀θ, φ ∈ Rd . (3.51)

Moreover, θ ∗ is a global minimum of h if and only if ∇h(θ ∗ ) = 0. Hence one can attempt to find θ ∗ by
finding a zero of the function f (θ) = −∇h(θ), using noisy measurements of the form

yt+1 = −∇h(θ t ) + ξ t+1 = f (θ t ) + ξ t+1 ,

and the updating rule


θ t+1 = θ t + αt yt+1 = θ t + αt [−∇h(θ t ) + ξ t+1 ]. (3.52)
This is sometimes referred to as “stochastic gradient descent.” The next result establishes the convergence
of this approach under very mild conditions. The theorem below is noteworthy because it is not assumed
that the Hessian ∇2 h is bounded. So the theorem applies to, for example, h(θ) = θ4 , whereas many existing
results do not.
Theorem 3.7. Define Ft = σ(θ t0 , ξ t1 ), and suppose that the conditions (3.39) and (3.40) hold. Suppose
further that the convex, C 1 function h has a unique global minimum at θ ∗ , and that there exists a finite
constant d1 such that
k∇h(θ)k22 ≤ d1 (1 + kθ − θ ∗ k22 ), ∀θ ∈ Rd .
Then θ t → θ ∗ almost surely as t → ∞, provided the RM conditions (3.37) hold.
The proof consists of showing that the function f (·) = −∇h(·) satisfies (3.35) and is thus passive in the
sense of Definition 3.1. Since θ ∗ is the unique global minimum of h, we have that h(θ) > h(θ ∗ ) whenever
θ 6= θ ∗ . Now apply (3.51) with φ = θ ∗ . This gives

hθ − θ ∗ , f (θ)i = h∇h(θ), θ ∗ − θi ≤ h(θ ∗ ) − h(θ) < 0 if θ 6= θ ∗ .

Now the desired result follows from Theorem 3.5.


Corollary 3.1. The conclusions of Theorems 3.6 and 3.7 continue to hold if {ξ t } is an i.i.d. sequence with
zero mean finite variance.

3.3.2 Stochastic Approximation via Lyapunov Functions


In this subsection, we establish the convergencw of the stochastic approximation algorithm (3.34) under a
different set of assumptions than passivity as defined in Definition 3.1. Specifically, the proof of Theorem
3.8 below is based on the existence of a “Lyapunov function” V that satisfies certain assumptions. At first
glance these assumptions might appear to be rather restrictive. However, it can be shown using “converse”
Lyapunov theory that if an associated differential equation θ̇ = f (θ) possesses certain stability properties,
then the existence of a suitable Lyapunov function V is guaranteed; see Section 8.4.
Though the proof of Theorem 3.8 below is broadly similar to that of Theorem 3.5, the assumptions
themselves are quite distinct in that neither set implies the other.
Let us reprise the problem under study. Suppose f : Rd → Rd and that θ ∗ is the unique solution to
f (θ) = 0. We begin with an initial guess θ 0 and update θ t via

θ t+1 = θ t + αt (f (θ t ) + ξ t+1 ), (3.53)

where {ξ t }t≥1 is the sequence of measurement noises.


40 CHAPTER 3. STOCHASTIC APPROXIMATION

Now the following assumptions are made about the function f . The first assumption is that the function
f is globally Lipschitz continuous with constant L. Thus

kf (θ) − f (φ)k2 ≤ Lkθ − φk2 , ∀θ, φ ∈ Rd . (3.54)

The next several assumptions concern the existence of a function V satisfying various assumptions. Though
it is not immediately obvious, these are actually assumptions about the function f . See Section 8.4 to see
the connection. It is now assumed that there exists a C 2 function V : Rd → R+ , and constants a, b, c > 0
and M < ∞ such that
akθ − θ ∗ k22 ≤ V (θ) ≤ bkθ − θ ∗ k22 , ∀θ ∈ Rd , (3.55)
This implies, among other things, that V (θ ∗ ) = 0 and that V (θ) > 0 for all θ 6= θ ∗ . Next

V̇ (θ) ≤ −ckθ − θ ∗ k22 , ∀θ ∈ Rd , (3.56)

where V̇ : Rd → R is defined as
V̇ (θ) := ∇V (θ)f (θ). (3.57)
Note that the gradient ∇V (θ) is taken as a row vector. The last assumption on V is that

k∇2 V (θ)kS ≤ M, ∀θ ∈ Rd , (3.58)

where ∇2 V is the Hessian matrix of V , and k · kS denotes the spectral norm of a matrix, that is, its largest
singular value.
As before, let Ft := σ(θ t0 , ξ t1 ). Then the noise sequence {ξt+1 } is assumed to satisfy the same two
conditions as before, reprised here for convenience: First,

E(ξ t+1 |Ft ) = 0 a.s.. (3.59)

Second, there exists a constant d > 0 such that

E(kξ t+1 k22 |Ft ) ≤ d(1 + kθ t k22 ) a.s., ∀t. (3.60)

Finally, the step size sequence {αt } satisfies the RM conditions (3.37), restated here:

X ∞
X
αt = ∞, αt2 < ∞. (3.61)
t=1 t=1

Theorem 3.8. Under the above set of assumptions, we have that


1. The sequence {θ t } is bounded almost surely.
2. Further,
θ t → θ ∗ w.p. 1 as t → ∞.

Proof. The proof is analogous to that of Theorem 3.5, with kθ t k22 replaced by V (θ t ). To begin with, we
translate coordinates so that θ ∗ = 0. This may cause the constant d in (3.60) to change, but it would still
be finite. Note that
V (θ t+1 ) = V (θ t + αt (f (θ t ) + ξ t+1 )).
Now by Taylor’s expansion around θ t , we get

V (θ t+1 ) = V (θ t ) + αt ∇V (θ t )[f (θ t ) + ξ t+1 ] + rt+1 ,

where the remainder tems rt+1 satisfies

rt+1 = αt2 [f (θ t ) + ξ t+1 ]> ∇V (zt )[f (θ t ) + ξ t+1 ]


3.3. STANDARD STOCHASTIC APPROXIMATION 41

for some zt belonging to the line segment from θ t to θ t+1 . Therefore

|rt+1 | ≤ αt2 M kf (θ t ) + ξ t+1 k22


= αt2 M [kf (θ t )k22 + kξ 2t+1 k22 + 2f (θ t )> ξ t+1 ].

Now using Equations (3.56), (3.59) and (3.60) gives

E(V (θ t+1 )|Ft ) = V (θ t ) + αt ∇V (θ t )f (θ t ) + E(rt+1 |Ft )


≤ V (θ t ) + αt2 M [kf (θ t )k22 + d(1 + kθ t k22 ).

Now apply
L2 1
kf (θ t )k22 ≤ L2 kθ t k22 ≤ V (θ t ), kθ t k22 ≤ V (θ t ).
a a
This leads to
α2 M
 
E(V (θ t+1 )|Ft ) ≤ 1 + t (L2 + d) V (θ t ) + dM αt2 .
a
This is entirely analogous to (3.42). By replicating earlier reasoning, we conclude that V (θ t ) is bounded
almost surely and converges to some random variable ζ. In the last part of the proof, we restore the term
αt ∇V (θ t )f (θ t ) which is neglected earlier, and observe that

∇V (θ t )f (θ t ) ≤ −ckθ t k22 ≤ −(c/b)V (θ t ).

In fact, this part of the proof is simpler than that of Theorem 3.5, because in that case there is no bound
on how small the product |θ t f (θ t )| can be, whereas here the term |∇V (θ t )f (θ t )| is bounded below by the
term (c/b)V (θ t ). This allows us to conclude that ζ = 0 a.s., so that V (θ t ) is bounded almost surely and
approaches zero at t → ∞. Finally, using (3.55) gives that {θ t } is also bounded almost surely and approaches
zero at t → ∞. These details are simple and are left to the reader.
Next we present two examples, one where convergence follows from Theorem 3.5 but Theorem 3.8 does
not apply, and the other one is vice versa.
Example 3.2. As shown in Section 8.4, the hypotheses of Theorem 3.8 imply that the origin is a globally
exponentially stable (GES) equilibrium of the ODE θ̇ = f (θ). Now suppose d = 1 (scalar problem), and
define f (θ) = − tanh(θ). Then |f (θ)| is bounded as a function of θ. Thus the ODE θ̇ = − tanh(θ) cannot be
GES. Hence Theorem 3.8 does not apply. However, since θ tanh(θ) > 0 for every θ 6= 0, Theorem 3.5 applies.
Example 3.3. In the other direction, suppose we wish to solve the fixed point equation

v = Gv + r

for some vector r ∈ Rd . We can cast this into the standard stochastic approximation framework by defining

f (θ) = r + (G − Id )θ.

This is like the standard relationship of a discounted reward Markov reward process. Now choose a matrix
G ∈ Rd×d that satisfies kGkS > 1, and ρ(G) < 1, where ρ(·) denotes the spectral radius. Then choose a
vector x ∈ Rd such that x> x < x> Gx. For example, let d = 2, λ ∈ (0.5, 1), and
   
λ 1 1
G= ,x = .
0 λ 1

Then ρ(G) = λ < 1 whenever λ < 1. Moreover x> x = 2, whereas

x> Gx = 2λ + 1 > 2 if λ > 0.5.


42 CHAPTER 3. STOCHASTIC APPROXIMATION

Because ρ(G) < 1, the matrix G − Id is nonsingular, so that for each vector r, there is a unique solution θ ∗
to the equation f (θ) = 0, namely θ ∗ = (G − Id )−1 r. Now let θ = θ ∗ + x where x is chosen as above. Then

(θ − θ ∗ )> (f (θ) − f (θ ∗ ))> = x> (G − Id )x > 0.

Hence (3.5) does not hold, and Theorem 3.5 does not apply.
On the other hand, because ρ(G) < 1, the eigenvalues of the matrix G − Id all have negative real parts,
and as a result, the Lyapunov matrix equation

G> P + P G = −Id

has a unique solution P which is positive definite. Then, if we define V (θ) = θ > P θ, then Theorem 3.8
applies, and implies that the iterates converge to θ ∗ .

3.4 Batch Asynchronous Stochastic Approximation


In this section, we state and prove a very general result that covers not only the asynchronous stochastic
approximation (ASA) algorithms of (3.8) and (3.10), but also a much more general situation that we call
“batch asynchronous stochastic approximation” (BASA). Specifically, in (3.8) and (3.10), only one component
of θ t is updated at a given instant t. However, in BASA, it is possible to update multiple components at a
given time. Naturally, these results apply also to the tradition asynchronous versions of (3.8) and (3.10).
Throughout we consider vectors θ t ∈ Rd where d is fixed. We use θt,i to denote the i-th component of
θ, which belongs to R. kθk∞ denotes the `∞ -norm of θ. For s ≤ t, θ ts denotes (θ s , · · · , θ t ). Note that

kθ ts k∞ = max kθ τ k∞ .
s≤τ ≤t

The symbol N denotes the set of natural numbers plus zero, so that N = {0, 1, · · · }. If {Ft } is a filtration
with t ∈ N, then M(Ft ) denotes the set of functions that are measurable with respect to Ft . Recall that the
symbol [d] denotes the set {1, · · · , d}.
In Section 3.3, the objective is to compute the zero of a function f : Rd → Rd . Theorems 3.5 and 3.8
provide sufficient conditions under which “synchronous” stochastic approximation can be used to solve this
problem. In particular, if g : Rd → Rd is a contraction map with respect to the `2 -norm, then Theorem 3.6
shows that a fixed point of g can be found using the iterative scheme (3.5). However, as seen in Chapter 2,
often one has to determine the fixed point of a map g : Rd → Rd which is a contraction in the `∞ -norm.
Theorem 3.5 does not apply to this case. Proving a version that works in this case is the main motivation
for introducing “asynchronous” stochastic approximation in [39]. The treatment below basically extends
those arguments to the case where more than one component of θ t is updated at any time. An alternate
version of ASA is introduced in [7] using local clocks. However, the assumptions on the measurement noise
process {ξ t } are fairly restrictive, in that the process is assumed to be an i.i.d. sequence, an assumption that
does not hold in RL problems. In contrast, in [39], the noise process is assumed only to be a martingale
difference sequence, which is satisfied in RL problems. Thus, even if only one component of θ t is updated
at any one time t, the treatment here combines the best features of both [39] (noise process is a martingale
difference sequence) and [7] (updating can use a local clock). The possibility of “batch” updating is a bonus.
One important difference between standard SA and BASA is that even the step sizes can now be random
variables. In principle random step sizes can be permitted even in standard SA, but it is not common.
The proof is divided into two parts. In the first, it is shown that iterations using a very general updating
rule are bounded almost surely. This result would appear to be of independent interest; it is referred to
as “stability” by some authors. Then this result is applied to show convergence of the iterations to a fixed
point of a contractive map. In contrast with both [39] and [7], for the moment we do not permit delayed
information.
3.4. BATCH ASYNCHRONOUS STOCHASTIC APPROXIMATION 43

For the first theorem on almost sure boundedness, we set up the problem as follows: We consider a
function h : N × (Rd )N → (Rd )N , and say that it is nonanticipative if, for each t ∈ N,
θ∞ ∞ d N t t ∞ ∞
0 , φ0 ∈ (R ) , θ 0 = φ0 =⇒ h(τ, θ 0 ) = h(τ, φ0 ), 0 ≤ τ ≤ t.

Note that such functions are also referred to as “causal” in control and system theory. There are also d
distinct “update processes” {νt,i }t≥0 for each i ∈ [d]. Finally, there is a “step size” sequence {βt }t≥0 , which
for convenience we take to be deterministic, though this is not essential. Also, it is assumed that βt ∈ (0, 1)
for all t.
The “core” stochastic processes are the parameter sequence {θ t }t≥0 , and the noise sequence {ξ t }t≥1 .
Note the mismatch in the initial values of t. Often it is assumed that θ 0 is deterministic, but this is not
essential. We define the filtration
F0 = σ(θ 0 ), Ft = σ(θ t0 , ξ t1 ) for t ≥ 1,
where σ(·) denotes the σ-algebra generated by the random variables inside the parentheses.
Now we can begin to state the problem set-up.
(U1). The update processes νt,i ∈ M(Ft ) for all t ≥ 0, i ∈ [d].
(U2.) The update processes satisfy the conditions that ν0,i equals either 0 or 1, and that νt,i equals either
νt−1,i or νt−1,i + 1. In other words, the process can only increment by at most one at each time instant
t, for each index i. This automatically guarantees that νt,i ≤ t for all i ∈ [d].
(S1). For “batch asynchronous updating” with a “global clock,” the step size αt,i for each index i is defined
as 
βt if νt,i = νt−1,i + 1,
αt,i = (3.62)
0, if νt,i = νt−1,i .
Therefore αt,i equals βt for those indices i that get incremented at time t, and zero for other indices.
(S2.) For “batch asynchronous updating” with a “local clock,” the step size αt,i for each index i is defined
as 
βνt ,i if νt,i = νt−1,i + 1,
αt,i = (3.63)
0, if νt,i = νt−1,i .
Note that, with a local clock, we can also write
t
X
αt,i = ν0,i + I{νt,i =νt−1,i +1} , (3.64)
τ =1

where I denotes the indicator function: It equals 1 if the subscripted statement is true, and equals 0
otherwise. Note that with a global clock, the step size for each index that has a nonzero αt,i is the
same, namely βt . However, with a local clock, this is not necessarily true.
With this set-up, we can define the basic asynchronous iteration scheme.
θt+1,i = θt,i + αt,i (ηt,i − θt,i + ξt+1,i ), i ∈ [d], t ≥ 0, (3.65)
where
η t = h(t, θ t0 ). (3.66)
Note that in our view it makes no sense to define
ν
ηt,i = hi (νt,i , θ 0t,i ).
We cannot think of any application for such an updating rule.
The question under study in the first part of the paper is the almost sure boundedness of the iterations
{θ t }. For this purpose we introduce a few assumptions.
44 CHAPTER 3. STOCHASTIC APPROXIMATION

(A1.) There exist constants γ < 1 and c0 ≥ 0 such that

kh(t, θ t0 ) − h(t, φt0 )k∞ ≤ γkθ t0 − φt0 k∞ , ∀θ, φ ∈ (Rd )N , t ≥ 0, (3.67)

kh(t, 0)k∞ ≤ c0 , ∀t ≥ 0. (3.68)


Note that in (3.70), we use the generic symbol 0 to denote a vector with all zero components, whose
dimension is determined by the context. If we choose any ρ ∈ (γ, 1) and define
c0
c1 := , (3.69)
ρ−γ

then it is easy to verify that

kh(t, θ t0 )k∞ ≤ ρ max{c1 , kθ t0 k∞ }, ∀θ, t. (3.70)

(A2.) The step sizes αt,i satisfy the analogs of the Robbins-Monro conditions of [29], namely

X
αt,i = ∞ a.s., ∀i ∈ [d], (3.71)
t=0


X
2
αt,i < ∞ a.s., ∀i ∈ [d]. (3.72)
t=0
P∞ 2
Note that if t=0 βt < ∞, then the square-summability of the αt,i is automatic. However, the
divergence of the summation of αt,i requires additional conditions on both the step sizes βt and the
update processes νt,i .

(A3.) The noise process {ξ t } satisfies the following:

E(ξt+1,i |Ft ) = 0, ∀t ≥ 0, i ∈ [d], (3.73)


2
E(ξt+1 |Ft ) ≤ c2 (1 + kθ t k22 ), (3.74)
for some constant c2 .

Now we can state the result on the almost sure boundedness of the iterates.

Theorem 3.9. With the set up above, and subject to assumptions (A1), (A2) and (A3), we have that the
sequence {θ t } is bounded almost surely.

Next, we state a result on convergence. For this purpose, we significantly reduce the generality of the
function h(·).

Theorem 3.10. Suppose that, in addition to the conditions of Theorem 3.9, we have that the function h(·)
is defined by
h(t, θ t0 ) = g(θ t ),
where the function g : Rd → Rd satisfies

|gi (θ) − gi (φ)| ≤ γkθ − φk∞ , ∀i ∈ [d], ∀θ, φ ∈ Rd ,

for some constant g < 1.5 Let θ ∗ denote the unique fixed point of the map g. Then θ t → θ ∗ almost surely
as t → ∞.
5 In other words, the function g is a contraction with respect to the `∞ -norm.
3.4. BATCH ASYNCHRONOUS STOCHASTIC APPROXIMATION 45

Now we prove the two theorems. The proof of Theorem 3.9 is very long and involves several auxiliary
lemmas. Before proceeding to the lemmas, a matter of notation is cleared up. Throughout we are dealing
with stochastic processes. So in principle we should be writing, for instance, θ t (ω), where ω is the element
of the probability space that captures the randomness. We do not do this in the interests of brevity, but the
presence of the argument ω is implicit throughout.
Lemma 3.1. Define a real-valued stochastic process {Ui (0; t)}t≥0 by

Ui (0; t + 1) = (1 − αt,i )Ui (0; t) + αt,i ξt+1,i , (3.75)

where Ui (0, 0) ∈ F0 , {αt,i } satisfy (3.71) and (3.72), and {ξ t } satisfies (3.73) and (3.74). Then Ui (0; t) → 0
almost surely as t → ∞.
This lemma is a ready consequence of Theorem 3.5. Note that (3.75) attempts to find a fixed-point of
the function f (U ) = −U , and that U f (U ) < 0 whenever U 6= 0. Now apply the theorem for U = Ui .
Lemma 3.2. Define a doubly-indexed real-valued stochastic process Wi (τ ; t) by

Wi (t0 ; t + 1) = (1 − αt,i )Wi (t0 , t) + αt,i ξt+1,i , (3.76)

where Wi (t0 , t0 ) = 0, {αt,i } satisfy (3.71) and (3.72), and {ξ t } satisfies (3.73) and (3.74). Then, for each
δ > 0, there exists a t0 such that |Wi (s; t)| ≤ δ, for all t0 ≤ s ≤ t.
Proof. It is easy to prove using induction that Wi (s; t) satisfies the recursion
"t−1 #
Y
Ui (0; t) = (1 − αr,i ) Ui (0; s) + Wi (s; t), 0 ≤ s ≤ t.
r=s

Note that the product in the square brackets is no larger than one. Hence

|Wi (s; t)| ≤ |Ui (0; t)| + |Ui (0; s)|.

Now, given δ > 0, choose a t0 such that |Ui (0; t)| ≤ δ/2 whenever t ≥ t0 . Then choosing t0 ≤ s ≤ t and
applying the triangle inequality leads to the desired conclusion.
Now we come to the proof of Theorem 3.9.
Proof. For t ≥ 0, define
Γt := max{kθ t0 k∞ , c1 }, (3.77)
where c1 = (ρ − γ)/c0 as before. With this definition, it follows from (3.70) and (3.66) that

kηt k∞ ≤ ρΓt , ∀t. (3.78)

Next, choose any  ∈ (0, 1) such that ρ(1 + ) < 1.6 Now we define a sequence of constants recursively.
Let Λ0 = Γ0 , and define 
Λt , if Γt+1 ≤ Λt (1 + ),
Λt+1 = (3.79)
Γt+1 if Γt+1 > Λt (1 + ).
Define λt = 1/Λt . Then it is clear that {Λt } is a nondecreasing sequence, starting at Γ0 = kθ 0 k∞ . Conse-
quently, {λt } is a bounded, nonnegative, nonincreasing sequence. Further, λt Γt ≤ 1 +  for all t. Moreover,
either Λt+1 = Λt , or else Λt+1 > Λt (1 + ). Hence, saying that Λt+1 > Λt is the same as saying that
Λt+1 > Λt (1 + ). Let us refer to this as an “updating” of Λt at time t + 1.
Next, observe that
6 In the proof of Theorem 3.9, it is sufficient that ρ(1 + ) ≤ 1. However, in the proof of Theorem 3.10, we require that

ρ(1 + ) < 1. To avoid proliferating symbols, we use the same  in both proofs.
46 CHAPTER 3. STOCHASTIC APPROXIMATION

Γt = max{kθ t k∞ , Γt−1 }. (3.80)


It is a ready consequence of (3.79) that Γt ≤ Λt (1 + ) for all t, whence kθ t k∞ λt ≤ 1 +  for all t. Moreover,
if Λ is updated at time t, then
Γt > Λt−1 (1 + ) ≥ Γt−1 .
This, coupled with (3.80), shows that kθ t k∞ = Γt = Λt . Hence kθ t k∞ λt = 1 at any time where Λ is
updated (and at other times kθ t k∞ λt ≤ 1 + ). In the same vein, if λt kθ t+1 k∞ ≤ 1 + , or equivalently
kθ t+1 k∞ ≤ Λt (1 + ), then (3.80) implies that Γt+1 ≤ Λt (1 + ), and there is no update at time t.
Now we make the following claim:
Claim 1: If {θ t } is unbounded, then Λt is updated infinitely often (i.e., for infinitely many values of t).
To establish the claim, suppose {θ t } is unbounded, and let T be arbitrary. It is shown that there exists a
τ ≥ T such that Λτ > Λτ −1 (1 + ), i.e., that Λ gets updated at time τ . Since this argument can again be
repeated, it would show that Λt gets updated infinitely often if {θ t } is unbounded, establishing the claim.
We prove the existence of such a τ as follows. Since {θ t } is unbounded, there exists a t > T such that
kθ t k∞ > ΛT (1 + ). If there already exists a τ between T and t − 1 such that Λ gets updated at time τ ,
then we are done. So suppose this is not the case, i.e., that

ΛT = ΛT +1 = · · · = Λt−2 = Λt−1 .

This implies in particular that Λ is not updated at time t − 1, i.e., that

Γt−1 ≤ Λt−2 (1 + ) = Λt−1 (1 + ) = ΛT (1 + ).

On the other hand, by assumption kθ t k∞ > ΛT (1 + ). Therefore

Γt = kθ t k∞ > Λt−1 (1 + ).

Therefore Λ is updated at time t, i.e.,

Λt = Γt = kθ t k∞ > Λt−1 (1 + ).

Hence we can take τ = t.


The contrapositive of the claim just proven is the following: If there is a time T such that Λ is not
updated after time T , then {θ t } is bounded. From the above discussion, a sufficient condition for this is the
following: If there exists a T such that

kθ t+1 k∞ λt ≤ 1 + , ∀t ≥ T, (3.81)

then Λ is not updated after time t, and as a result, {θ t } is bounded. So henceforth our efforts are focused
on establishing (3.81).
Now we derive a “closed form” expression for λT θT +1,i . Towards this end, observe the following: A closed
form solution to the recursion
at+1 = (1 − xt )at + bt (3.82)
is given by " # " #
T
Y T
X T
Y
aT +1 = (1 − xr ) as + (1 − xr ) bt , (3.83)
r=s t=s r=t+1

for every 0 ≤ s ≤ T . The proof by induction is easy and is left to the reader. Observe that

λT +1 θT +1,i = θT +1,i (λT +1 − λT ) + θT +1,i λT


= θT +1,i (λT +1 − λT ) + [(1 − αT,i )θT,i + αT,i vT,i ]λT , (3.84)
3.4. BATCH ASYNCHRONOUS STOCHASTIC APPROXIMATION 47

where we use the shorthand


vT,i = ηT,i + ξT +1,i .
Now we partition λT +1 θT +1,i as FT +1,i + GT +1,i , and find recursive formulas for F and G such that (3.84)
holds. Thus we must have

FT +1,i + GT +1,i = (1 − αT,i )(FT,i + GT,i ) + θT +1,i (λT +1 − λT ) + αT,i vT,i λT .

This equation holds if

FT +1,i = (1 − αT,i )FT,i + θT +1,i (λT +1 ,


(3.85)
GT +1,i = (1 − αT,i )GT,i + αT,i vT,i λT .

The solutions to (3.85) follow readily from (3.83), and are


" T
# T
" T
#
Y X Y
FT +1,i = (1 − αr,i ) Fs + (1 − αr,i ) (θt+1,i )(λt+1 − λt ), (3.86)
r=s t=s r=t+1

" T
# T
" T
#
Y X Y
GT +1,i = (1 − αr,i ) Gs + (1 − αr,i ) αt,i λt vt,i . (3.87)
r=s t=s r=t+1

Adding these two equations, noting that Fs + Gs = λs θs,i , and expanding vt,i as ηt,i + ξt+1,i gives

X T
T h Y i
λT +1 θTi +1 = (1 − αri ) θt+1
i
(λt+1 − λt )
t=s r=t+1

X T
T h Y i T
hY i
+ (1 − αri ) αti (λt ηti + λt ξt+1
i
)+ (1 − αri ) λs θsi . (3.88)
t=s r=t+1 r=s

Next, let us expand the right side of (3.88) as


T
X −1h T
Y i T
hY i
λT +1 θTi +1 = θT +1,i (λT +1 − λT ) + (1 − αri ) θt+1
i
(λt+1 − λt ) + (1 − αri ) λs θsi
t=s r=t+1 r=s
T h
X T
Y i T h Y
X T i
+ (1 − αri ) αti λt ηti + (1 − αri ) αti λt ξt+1
i
. (3.89)
t=s r=t+1 t=s r=t+1

Cancelling the common term λT +1 θT +1,i and moving λT θT +1,i to the left side gives the desired expression,
namely
λT θT +1,i = Ai (s, T ) + Bi (s, T ) + Ci (s, T ) + Di (s, T ), (3.90)
where for 0 ≤ s < T
T
X −1h T
Y i
Ai (s, T ) = (1 − αri ) θt+1
i
(λt+1 − λt ) (3.91)
t=s r=t+1

T
hY i
Bi (s, T ) = (1 − αri ) λs θsi (3.92)
r=s

T h Y
X T i
Ci (s, T ) = (1 − αri ) αti λt ηti (3.93)
t=s r=t+1
48 CHAPTER 3. STOCHASTIC APPROXIMATION

T h Y
X T i
Di (s, T ) = (1 − αri ) αti λt ξt+1
i
. (3.94)
t=s r=t+1

Now we carry on with the proof of Theorem 3.9. First, we find an upper bound for Ci (s, T ). Observe
that, because η(·) is a contraction with constant ρ, and ρ(1+) ≤ 1, we get from (3.79) that Γt+1 ≤ Λt (1+),
or Γt λt ≤ 1 + . Since kθ t0 k∞ ≤ Γt , all this implies that

λt kηt k∞ ≤ λt ρΓt (1 + ) ≤ 1, ∀t,

by the manner in which  is chosen. Substituting this into (3.93) gives


T h Y
X T i
|Ci (s, T )| ≤ | (1 − αri ) αti λt ηti | (3.95)
t=s r=t+1
T h Y
X T i
≤ (1 − αri ) αti
t=s r=t+1
 T
Y 
= 1− (1 − αri ) (3.96)
r=s

Next, we derive a recursion for Di (s, T ).


T h Y
X T i
Di (s, T ) = (1 − αri ) αti λt ξt+1
i

t=s r=t+1
T
X −1h T
Y i
= αT,i λT ξT +1,i + (1 − αri ) αti λt ξt+1
i

t=s r=t+1
T
X −1
−1h TY i
= αT,i λT ξT +1,i + (1 − αT,i ) (1 − αri ) αti λt ξt+1
i

t=s r=t+1
= αT,i λT ξT +1,i + (1 − αT,i )Di (s, T − 1). (3.97)

Now we make the following claim:


Claim 2: Define a stochastic process {Mi (0, t)}t≥0 by the recursion

Mi (0, t + 1) = (1 − αt,i )λt Mi (0, t) + αt,i λt ξt+1,i ,

with the initial condition Mi (0, 0) = 0. Then Mi (0, t) → 0 almost surely as t → ∞.


The proof of Claim 2 follows readily from Lemma 3.2 by replacing ξt+1,i by λt ξt+1,i , and observing that
λt ξt+1,i satisfies the analogs of (3.73) and (3.74), because λt ∈ M(Ft ) for all t and is bounded.
Next, we make another claim.
Claim 3: For every δ > 0, there exists a t0 such that the solution Di (s, T ) of the recursion (3.97) with
the initial condition Di (s, s − 1) = 0 satisfies |Di (s, T )| ≤ δ for all t0 ≤ s ≤ T .
The proof follows readily from Lemma 3.2 with minor changes.
We continue with the proof of Theorem 3.9. Choose a t0 such that

|Di (s, T )| ≤ /2, ∀t0 ≤ s ≤ T. (3.98)

If Λ is not updated at any time after t0 , then it follows from earlier discussion that {θ t } is bounded.
Otherwise, define s to be any time after t0 + 1 at which Λ is updated. Thus kθ s k∞ λs = 1 due to the
updating. Next, suppose that, for some T ≥ s, we have that

kθ t k∞ λt ≤ 1 + , ∀s ≤ t ≤ T. (3.99)
3.4. BATCH ASYNCHRONOUS STOCHASTIC APPROXIMATION 49

Then it is shown that


kθ T +1 k∞ λT +1 ≤ 1 + .
The proof is by induction on T . Note that (3.99) holds with T = s to start the induction. Now (3.99) implies
that Λ is not updated between times s and T , so that λs = λt = λT for s ≤ t ≤ T . Hence it follows from
(3.91) that A(s, T ) = 0. Next, (3.92) gives
T
Y T
Y
|B(s, T )| ≤ (1 − αr,i )λs kθ s k∞ ≤ (1 − αr,i ).
r=s r=s

Combining this bound with (3.96) gives

|Bi (s, T ) + Ci (s, T )| ≤ |Bi (s, T )| + |Ci (s, T )| ≤ 1.

Finally, from Claim 3, we have that |Di (s, T )| ≤ . Combining all these shows that

|λT +1 θT +1,i | ≤ 1 +  ∀i ∈ [d], or λT +1 kθ T +1 k∞ ≤ 1 + .

This completes the inductive step and proves the theorem.

Proof. Now we come to the proof of Theorem 3.10. We study is the common situation where

h(t, θ t0 ) = g(θ t ),

where g : Rd → Rd is a contraction with respect to the `∞ -norm. In other words, there exists a γ < 1 such
that
|gi θ) − gi (φ)| ≤ γkθ − φk∞ , ∀i ∈ [d], ∀θ, φ ∈ Rd .
In addition, define c0 = kg(0)k∞ . This notation is consistent with (3.67) and (3.70). These hypotheses
imply that there is a unique fixed point θ ∗ ∈ Rd of g(·). Theorem 3.10 states that θ t → θ ∗ almost surely as
t → ∞.
For notational convenience, it is assumed that θ ∗ = 0. The modifications required to handle the case
where θ ∗ 6= 0 are obvious, and can be incorporated at the expense of more messy notation. If θ ∗ = 0, then
g(0) = 0 so that c0 = 0. Hence we can take c1 = 0, ρ ∈ (γ, 1), and (??) becomes

kh(t, θ t0 )k∞ ≤ ρkθ t k∞ ≤ ρkθ t0 k∞ , kη t k∞ ≤ ρkθ t k∞ .

Let Ω0 ⊆ Ω be the subset of Ω such that Ui (0, t)(ω) → 0 as t → ∞, for all i ∈ [d]. So for each ω ∈ Ω0 ,
there exists a bound H0 = H0 (ω) such that

kθ ∞
0 (ω)k∞ ≤ H0 (ω), or |θt,i (ω)| ≤ H0 (ω), ∀i ∈ [d], ∀t ≥ 0.

Hereafter we suppress the dependence of ω. Now choose  ∈ (0, 1) such that ρ(1 + ) < 1. Note that in the
proof of Theorem 3.9, we required only that ρ(1 + ) ≤ 1. For such a choice of , Theorem 3.9 continues to
apply. Therefore kθ ∞0 k∞ ≤ H0 . Now define Hk+1 = ρ(1 + )Hk for each k ≥ 0. Then clearly Hk → 0 as
k → ∞.
Now we show that there exists a sequence of times {tk } such that kθ ∞ tk k∞ ≤ Hk , for each k. This is
enough to show that θ t (ω) → 0 for all ωωΩ0 , which is the claimed almost sure convergence to the fixed point
of g. The proof of this claim is by induction. It is shown that if there exists a tk such that kθ ∞ tk k∞ ≤ Hk ,
then there exists a tk+1 such that kθ ∞ k
tk1 ∞ ≤ Hk1 . The statement holds for k = 0 – take t0 = 0 because H0
∞ ∞
is a bound on kθ 0 k∞ . To prove the inductive step, suppose that kθ tk k∞ ≤ Hk . Choose a τk ≥ t0 such that
the solution to (3.76) satisfies
ρ
|Wi (τk ; t)| ≤ Hk , ∀i ∈ [d], ∀t ≥ τk . (3.100)
2
50 CHAPTER 3. STOCHASTIC APPROXIMATION

Define a sequence {Yt,i } by

Yt+1,i = (1 − αt,i )Yt,i + αt,i ρHk , Yτk ,i = Hk , i ∈ [d], t ≥ τk . (3.101)

Then clearly Yt,i → ρHk as t → ∞ for each i ∈ [d]. Now it is claimed that

− Yt,i + Wi (τk ; t) ≤ θt,i ≤ Yt,i + Wi (τk ; t), ∀i ∈ [d], t ≥ τk . (3.102)

The proof of (3.102) is also by induction on t. The bound (3.102) holds for t = τk because Wi (τk ; τk ) = 0,
and the inductive assumption on k, which implies that

|θτk ,i | ≤ kθ ∞ ∞
τk k∞ ≤ kθ tk k∞ ≤ Hk = Yτk ,i .

Now note that if (3.102) holds for a specific value of k, then for t ≥ τk we have

θt+1,i = (1 − αt,i )θt,i + αt,i ηt,i + αt,i ξt+1,i


≤ 1 − αt,i )θt,i + αt,i ρHk + αt,i ξt+1,i
≤ (1 − αt,i )[Yt,i + Wi (τk ; t)] + αt,i ξt+1,i
= (1 − αt,i )Yt,i + αt,i ρHk + (1 − αt,i )Wi (τk ; t) + αt,i ξt+1,i
= Yt+1,i + Wt+1,i .

This proves the upper bound in (3.102) for t + 1. So the inductive step is established, showing that the
upper bound in (3.102) is true for all t ≥ τk . Now note that Yt,i → ρHk as t → ∞. Hence there exists a
t0k+1 ≥ τk such that
ρ  
Yt,i ≤ ρHk + =ρ 1+ Hk , ∀t ≥ t0k+1 .
2 2
This, combined with (3.100) shows that

Yt,i + Wi (τk ; t) ≤ ρ(1 + )Hk = Hk+1 , ∀t ≥ t0k+1 .

Hence
θt,i ≤ Hk+1 ∀i ∈ [d], t ≥ t0k+1 .
A parallel argument gives a lower bound: There exists a t00k+1 such that

−Hk+1 ≤ θt,i ∀i ∈ [d], t ≥ t00k+1 .

If we define tk+1 = max{t0k+1 , t00k+1 }, then

|θt,i | ≤ Hk+1 ∀i ∈ [d], t ≥ tk+1 .

This establishes the inductive step for k and completes the proof.

3.5 Two Time Scale Stochastic Approximation


3.6 Finite-Time Stochastic Approximation
Chapter 4

Approximate Solution of MDPs via


Simulation

The contents of Chapter 2 are based on the assumption that the parameters of the Markov Decision Process
are all known. In other words, the |U| possible state transition matrices Auk , as well as the reward map
R : X × U → R (or its random version), are all available to the agent to aid in the choice of an optimal
policy. One can say that the distinction between MDP theory and reinforcement learning (RL) theory is
that in the latter, it is not assumed that the parameters of the MDP are known. Thus, in RL, one attempts
to learn these parameters based on observations.
In the RL literature, a couple of phrases are widely used without always being defined precisely. The
first phrase is “tabular methods.” As we will see, the methods presented in this chapter attempt to form
estimates of the value function, or the action-value function, for a specific policy. These estimates are almost
invariably iterative, in that the next estimate is based on the previous one. To elaborate this point, let V̂t (xi )
denote the estimate, at step t, of the value of the state xi . In many if not most iterative procedures, V̂t (xi )
depends on the estimates for all states V̂t−1 (x1 ) through V̂t−1 (xn ) at step t−1. When we attempt to estimate
the action-value function, the estimate Qt (xi , uk ) depends on all previous estimates of Qt−1 (xj , wl ). This
requires that the problems under study be sufficiently small that these estimates will fit into the computer
storage. The second phrase that is used is “on-policy” simulation and its twin, “off-policy” simulation. What
this means is the following: Suppose we wish to approximate the value function Vπ for a particular policy
π, known as the target policy. For this purpose, we have available a sample path of the MDP under some
other policy θ, which is known as the behavior policy. If the behavior policy is the same as the target
policy, that is, if we are able to observe a sample path of the Markov process under the same policy that we
wish to evaluate, then that situation is referred to as on-policy; otherwise the situation is called off-policy.

4.1 Monte-Carlo Methods


The phrase “Monte Carlo” methods is used nowadays to refer to almost any technique wherein an expected
value of a random variable is approximated by its empirical average, that is, an average of its observed
values. The main idea is that, as the number of observations increases, the empirical estimate converges
(in probability and almost surely) to its true value. The original “Monte Carlo” method for estimating
probabilities of discrete random variables, and for estimating the mean value of real-valued random variables,
dates back to the 1940s. The contents of Section 6.1 contain the original Monte Carlo method. When it is
desired to use a common set of samples to estimate, simultaneously, the probability of infinitely many sets,
or the mean values of infinitely many random variables, that is known as “statistical learning theory.” There
are several book-length treatments of statistical learning theory, including [45]. This is among the problems
studied in Chapter 6

51
52 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION

4.1.1 MC Methods for Estimating the Value Function


One application of the Monte Carlo approach is to approximate the value of a policy for a MDP where the
underlying parameters are unknown. While Monte Carlo methods are not always well-suited to address this
situation, many of the philosophical approaches introduced here are applicable to other techniques as well.
Accordingly, it is assumed that the state space X and action space U are known, but everything else is
unknown. This part of the discussion here assumes that some policy π has been chosen and implemented,
and that we observe a time trajectory of triplets {(Xt , Ut , Wt+1 }t≥0 where Ut = π(Xt ) if the policy π is
deterministic, Wt+1 = Rπ (Xt ) where the policy reward Rπ is unknown and possibly random, and the state
transtion matrix Aπ resulting from the policy is also unknown. Because π is fixed throughout, we drop the
subscript and superscript π. Note that we permit the reward R to be random; however, initially we restrict
the study to the case where the policy is deterministic. Thus, if the same state xi occurs more than once in
the trajectory, so that Xt = Xτ = xi for two different t, τ , then we will have Ut = Uτ = π(xi ); however, if the
reward is random, we may not have Wt+1 = Wτ +1 . At the end of the exercise, we will generate an estimate
for the value vector vπ associated with the policy π. Note that, if the objective is to find an optimal policy
(either exactly or approximately), then one would have to compute such estimates for each possible policy,
of which there are |U||X | . This is one of the advantages of Q-learning, studied in Section 4.2.3, in which the
approximations converge to the optimal policy.
It is also worth pointing out that, because the policy π is fixed throughout, one can actually think of the
process under study as a Markov reward process, and not a Markov decision process. This is the viewpoint
advocated in [35, Chapter 3].
The discussion in this section applies to the case where the underlying Markov process contains one or
more absorbing, or terminal, states. Recall that a state xi is said to be “absorbing” if

Pr{Xt+1 = xi |Xt = xi } = 1,

or equivalently, the row of the state transition matrix corresponding to the state xi consists of a 1 in column
i and zeros in other columns. The Markov process can have more than one absorbing state. While the
dynamics of the MDP are otherwise assumed to be unknown, it is assumed that the learner knows which
states are absorbing. By tradition, it is assued that the reward R(xi ) = 0 whenever xi is an absorbing state.
Observe too that the state transition matrix Aπ depends on the policy under study. So different policies
could lead to different absorbing states. However, since we study one policy at a time, it is acceptable to
assume that the set of absorbing states under the policy being studied is known.
In this setting, an episode refers to any sample path {(Xt , Ut , Wt+1 }t≥0 that terminates in an absorbing
state. Since the policy π is chosen by the learner and is deterministic, it is always the case that Ut = π(Xt ).
Therefore Ut does not add any new information. Once Xt reaches an absorbing state, the episode terminates.
The underlying assumption is that, once the Markov process reaches an absorbing state, it can be restarted
with the initial state distributed according to its stationary (or some other) distribution. This assumption
may not always hold in practice.
Now we discuss how to generate an estimate for the discounted future value, based on a single episode.
The assumption mentioned above implies that we can repeat the estimation process over multiple episodes,
and then average all of those estimates, to arrive finally at an overall estimate. Define

X
Gt = γ i Wt+i . (4.1)
i=0

If an absorbing state is reached after a finite time, say T , then the summation can be truncated at time T ,
because Wt+1 = 0 for t ≥ T . In this connection, recall Theorem 8.12, which gives a formula for the average
time needed to hit an absorbing state. Now by definition, for a state xi ∈ X , we have

V (xi ) = E[Gt |Xt = xi ].


4.1. MONTE-CARLO METHODS 53

Accordingly, suppose that an episode contains the state of interest xi at time τ , that is, Xτ = xi . Let us
also suppose that the episode terminates at time T . In such a case, we note that
"T −τ # "∞ #
X X
E γ i Wτ +i = E γ i Wτ +i = V (xi ).
i=0 i=0

Therefore the quantity


T
X −τ
GTτ := γ i Wτ +i
i=0

provides an unbiased estimate for V (xi ). It is therefore one method of estimating V (xi ). Now suppose we
have L episodes, call them E1 , · · · , EL . Let k the number of these episodes in which the state of interest xi
occurs. Without loss of generality, renumber the episodes so that these are E1 through Ek . It is of course
possible that the state of interest xi occurs more than once in the sample path. Therefore there are a couple
of different ways of estimating V (xi ) using this collection of episodes. For each such episode, let τ denote
the first time at which xi appears in the state sequence, and T the time at which the episode terminates.1
Further, define
T
X −τ
Hl := γ i Wτ +i .
i=0

Then
k
1X
Hl (4.2)
k
l=1

provides an estimate for V (xi ), known as the first-time estimate. If the state of interest xi occurs multiple
times within the same episode, then one can form multiple estimates Hl each time the state of interest xi
occurs in the trajectory, and then average them. This called the everytime estimate.
Example 4.1. The objective of this example is to illustrate the difference between a first-time estimate
and an everytime estimate. Suppose n = 3, and for convenience label the states as A, B, C, where A is
an absorbing state and B, C are nonabsorbing. Suppose further that R(C) = 3, R(B) = 2 and of course
R(A) = 0. Suppose L = 3 and that the three episodes (all terminating at A) are:

E1 = CBCBBA, E2 = BBA, E3 = BCCBA.

Now suppose we wish to estimate the value V (C). Then the episode E2 does not interest us because C does
not occur in it. If the discount factor γ equals 0.9, then we can form the following quantities:

H11 = 3 + 2 · (0.9) + 3 · (0.9)2 + 2 · (0.9)3 + 2 · (0.9)4 ,

H12 = 3 + 2 · (0.9) + 2 · (0.9)2 ,


H31 = 3 + 3 · (0.9) + 2 · (0.9)2 , H32 = 3 + 2 · (0.9).
Then (H11 + H31 )/2 is the first-time estimate for V (C), while (H11 + H12 + H31 + H32 )/4 is the everytime
estimate for V (C).
The convergence of the first-time estimate to the true value function depends on the following fact:
In a Markov process with absorbing states, each episode starting from a specified state xi is statistically
independent of every other episode. Thus every first-time estimate for a value V (xi ) can be thought of as
providing an independent sample. By averaging these first-time estimates, it is possible to form an estimate
of V (xi ). Therefore, if the number of episodes in which the state of interest xi occurs approaches infinity,
1 Strictly speaking we should use the notation τ1 , T1 etc., but we do not do this in the interests of clarity.
54 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION

it can be stated that the first-time estimate converges to the true value V (xi ). Moreover, one can invoke
Hoeffding’s inequality of Theorem 8.13 to generate a bound on just how reliable this estimate is, in terms of
accuracy and confidence. The everytime estimates are not statistically independent. Therefore the analysis
of their convergence is more delicate. It is shown in [32] that even the everytime estimate converges to
the true value V (xi ), if the number of episodes in which the state of interest xi occurs approaches infinity.
However, unlike the first-time estimate, the everytime estimate is biased, and has higher variance.

Example 4.2. This example, taken from [32] illustrates the bias of the everytime estimate. Consider a
Markov process with just two states, called S for state and A for absorbing respectively. Suppose the state
transition matrix is
S A
S 1−p p
A 0 1
So all trajectories starting in S look like SS · · · SA, where there are say l occurrences of S. If we were to
attach a reward R(S) = 1, and set the discount factor γ to 1, then the first-time estimate for V (S) would be
l, the length of the sample path before hitting A. Moreover, the analysis of hitting times given in Theorem
8.12 shows that the average length of this sample path is 1/p (which may not be an integer, but which is
the correct answer). On the other hand, there are l everytime estimates of V for such a trajectory, and their
sum is l(l + 1)/2. So the everytime estimate is (l + 1)/2 for a trajectory that consists of l occurrences of S
followed by A. Hence the expected value of the everytime estimate is ((1/p) + 1)/2, which is erroneous by a
factor of 2 if p is very small.

The Monte Carlo method for estimating the value of a policy suffers from many drawbacks. In the case
of first-time estimates, the total number of samples of a particular value V (xi ) equals the total number
of episodes that contain the state xi . However, each episode can be quite long. Thus the total number
of time steps elapsed can be far larger than the number of episodes. This makes the estimation process
very slow, with a long observation leading to a small number of samples. Also, because the number of
observations is very high, the variance of the estimate can also be very high. Another requirement is that,
in order to be able to form an estimate for V (xi ) for a particular state xi , that state must occur in a large
fraction of episodes. In turn this requires that when the Markov process with state transition matrix Aπ
is started from a randomly selected initial state, the resulting sample path must pass through the state xi
with high probability before hitting an absorbing state. Moreover, the method is based on the assumption
that, once the Markov process reaches an absorbing state, it can be restarted with a specified probability
distribution for the initial state. This assumption often does not hold in practice. Finally, note that “partial
episodes,” that is to say, trajectories of the Markov process that do not terminate in an absorbing state,
are of no use in forming an estimate of V (xi ). Everytime estimates provide a larger number of samples for
V (xi ) than first-time estimates. This is because every episode provides only one first-time estimate, but can
provide multiple everytime estimates. The difficulty is that these everytime estimates are not statistically
independent. Moreover, all of the above comments regarding the drawbacks of first-time estimates apply
also to everytime estimates.

4.1.2 MC Methods for Estimating the Action-Value Function


The same Monte Carlo approach can also be used to construct estimates of the action-value function Q(xi , uk )
instead of the value function V (xi ). The idea is the same as before: We follow several complete episodes, and
in each episode, we keep track of how many times a particular pair (xi , uk ) occurs. Note that in estimating
V (xi ), we just keep track of how many times a particular state xi occurs. This raises a very specific issue.
In the case of estimating the value function, it may be justifiable to assume that every state xi ∈ X occurs
in sufficiently many sample paths to permit a reasonable estimation of Vπ (xi ). On the other hand, when it
comes to estimating an action-value function, the only pairs (xi , uk ) that occur will be (xi , π(xi )). Therefore,
if π ∈ Πd , i.e., is a deterministic policy, then the vast majority of state-action pairs will not occur. One way
4.1. MONTE-CARLO METHODS 55

to overcome this problem is to restrict π to be a probabilistic policy, with the additional requirement that
each state-action pair (xi , uk ) has a positive probability under π. To put it another way, for each xi ∈ X ,
the probability distribution π(xi , ·) on U has all positive elements. Even in this case however, given that
the number of state-action pairs is much larger than the number of states alone, the number of episodes
required for reliably estimating the Q-function would be far larger than the number of episodes required
for reliably estimating the V -function. Another possible approach is to use “off-policy” sampling, that is,
generate samples using a policy that is different from the policy that we wish to evaluate. In this case,
the policy π for which we wish to estimate Qπ is called the “target” policy, which can be deterministic,
and/or have several missing pairs (xi , uk ). In contrast, the policy φ that is used to generate the samples is
called the “behavior” policy. One adjusts for the fact that φ may be different from π using a method called
“importance sampling,” which is described next.
Suppose we wish to estimate vπ for a policy π, but all we have is a sample path under another policy φ.
Let the sequence of observations be {Xt , Ut , Wt+1 }Tt=0 . It is assumed that the policy φ is probabilistic and
“dominates” π. That is,
Pr{π(xi ) = uk } > 0 =⇒ Pr{φ(xi ) = uk } > 0. (4.3)
The easiest way to satisfy (4.3) is to choose φ ∈ Πp (a probabilistic policy) such that the right side of the
equation is always positive, that is
Pr{φ(xi ) = uk } > 0 ∀xi ∈ X , uk ∈ U.
If we had a sample path under the policy π, then we could estimate Vπ (xi ) for each fixed state xi ∈ X
as follows: Take all episodes that contain xi , and discard the initial part of the episode that happens before
the first occurrence of xi . For example, suppose a Markov process has the state space {B, C, D, A} where A
is an absorbing state, and B is the state of interest. Suppose there are three episodes, namely
E1 = CBDBCDA, E2 = CDDCA, E3 = BCDBCA.
Then we ignore E2 and discard the first part of E1 . This gives
Ē1 = BDBCDA, Ē2 = BCDBCA.
Now back to the discussion. After discarding sample paths that do not contain xi and the initial parts of
the sample paths that do contain xi , we concatenate them while keeping markers of where one sample path
ends and the next one begins. Let J(xi ) denote the number of distinct time instants when Xt = xi . For
first-time estimates, we count only the occasions when the sample path starts with xi , whereas for everytime
estimates, we count all occasions when Xt = xi . For instance, in the example above, wth xi = B, for
first-time estimates we choose J(B) = {1, 7}, while for everytime estimates, we choose J(B) = {1, 3, 7, 10}.
In either case, we compute
1 X
V̂π (xi ) = Gt , (4.4)
|J(xi )|
t∈J(xi )

where
T
X −t
Gt = γ l Wt+l
l=0
is the discounted reward.
Now we describe importance sampling. In case π is a probabilistic policy, there is a likelihood of the
state-action pairs
Pr{Ut , Xt+1 , Ut+1 , · · · , XT |Xt , UtT −1 ∼ π},
where UtT −1 ∼ π means that Pr{Uτ |Xτ } has the distribution π(Xτ ), for t ≤ τ ≤ T − 1. This quantity can
be expressed as
−1
TY
Pπ = π(Uτ |Xτ ) Pr{Xτ +1 |Xτ , Uτ }.
τ =t
56 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION

However, because the sample path is generated using the policy φ, what we can actually measure is
−1
TY
Pφ = φ(Uτ |Xτ ) Pr{Xτ +1 |Xτ , Uτ }.
τ =t

Now note there is a simple formula for the ratio of the two likelihoods. Specifically
TY −1
Pπ π(Uτ |Xτ )
ρ[t,T −1] := = . (4.5)
Pφ τ =t
φ(Uτ |Xτ )

In other words, the unknown transition probabilities Pr{ Xτ +1 |Xτ , Uτ } simply cancel out. Therefore the
quantity ρ[t,T −1] can be computed because π, φ are known policies, and Uτ , Xτ can be observed.
With this background, we can modify the estimate in (4.4) in one of two possible ways. The estimate
1 X
V̂π (xi ) = ρ[t,T −1] Gt (4.6)
|J(xi )|
t∈J(xi )

is called the “ordinary importance sampling” estimate,” whereas


P
t∈J(xi ) ρ[t,T −1] Gt
V̂π (xi ) = P (4.7)
t∈J(xi ) ρ[t,T −1]

is called the “weighted importance sampling.” In each case, the term ρ[t,T −1] compensates for off-policy
sampling. The ordinary one is unbiased but can have very large variance, whereas the weighted one is biased
but consistent, and has lower variance. For further details, see [33, p. 105].
This approach can also be used to estimate Qπ on the basis of a sample path run on another policy φ;
see [33, p. 110].

4.1.3 MC Methods for Greedy Policy Optimization


Assume that the rather restrictive conditions required to estimate the action-value function for a specific
policy π are satisfied, so that we have an estimate for Qπ (xi , uk ) for each state-action pair (xi , uk ). In this
subsection we show how such an estimate can be used to construct a “greedy” policy improvement procedure.
This material is taken from [33, Sections 5.3 and 5.4]. Before presenting it, we state and prove the “policy
improvement theorem,” from [33, Section 4.2].
Recall the definition of the action-value function Qπ from (2.33), and its equivalent characterization in
(2.36), namely
Qπ (xi , uk ) = E[R(Xt , Ut ) + γVπ (Xt+1 )|Xt = xi , Ut = uk ]. (4.8)
Theorem 4.1. (Policy Improvement Theorem) Suppose π, φ ∈ Πd , and moreover

Qπ (xi , φ(xi )) ≥ Qπ (xi , π(xi )) = Vπ (xi ), ∀xi ∈ X . (4.9)

Then
Vφ (xi ) ≥ Vπ (xi ), ∀xi ∈ X . (4.10)
Moreover, suppose there is a state xi ∈ X such that (4.9) holds with strict inequality, that is

Qπ (xi , φ(xi )) > Qπ (xi , π(xi )).

Then there is a state xj ∈ X such that (4.10) holds with strict inequality, that is,

Vφ (xj ) > Vπ (xj ).


4.1. MONTE-CARLO METHODS 57

Proof. We reason as follows:

Vπ (xi ) = Qπ (xi , π(xi )) ≤ Qπ (xi , φ(xi ))


= E[R(Xt , Ut ) + γVπ (Xt+1 )|Xt = xi , Ut = φ(xi )]
= Eφ [R(Xt , Ut ) + γQπ (Xt+1 , π(Xt+1 )|Xt = xi , Ut = φ(xi )]
≤ Eφ [R(Xt , Ut ) + γQπ (Xt+1 , φ(Xt+1 )|Xt = xi , Ut = φ(xi )]. (4.11)

Now look at the second term on the right side, without the γ. This equals

Eφ [Qπ (Xt+1 , φ(Xt+1 )|Xt = xi ] = Eφ [R(Xt+1 , Ut+1 )|Xt = Xt ]


+ γEφ {Eφ [Vπ (Xt+2 |Xt+1 , Ut+1 = φ(Xt+1 )]|Xt = xi }.

Noting that
Vπ (Xt+2 ) = Qπ (Xt+2 , π(Xt+2 )) ≤ Qπ (Xt+2 , φ(Xt+2 )),
we can repeat the above reasoning. This gives, for every integer l
" l−1 #
X
Vπ (xi ) ≤ Eφ γ i R(Xt+i )|Xt = xi + γ l Eφ [Qπ (Xt+l , φ(Xt+l ))|Xt = xi ].
i=0

As l → ∞, the second term approaches zero, while the first term approaches Vφ (xi ). Hence (4.10) follows.
The statements about strict inequalities are easy to prove and are left as an exercise.
The policy improvement theorem suggests the following “greedy” approach to finding an optimal policy.
Suppose that we have an initial policy and corresponding action-value function Qπ . We can define a new
“greedy” policy via
k ∗ := arg max Qπ (xi , uk ), φ(xi ) = uk∗ , ∀xi ∈ X . (4.12)
uk ∈U

Then (4.9) holds by construction, and it follows from Theorem 4.1 that Vφ (xi ) ≥ Vπ (xi ) for all xi ∈ x,
or equivalently, vφ ≥ vπ . Now we can compute the action-value function corresponding to φ and repeat.
A consequence of Theorem 4.1 is that unless equality holds for all xi ∈ X in (4.9), we have that at least
one component of vφ exceeds that of vπ . Now, it is easy to show that if the above greedy update policy
terminates with (4.9) holding with equality for all xi ∈ X , then not only is vφ = vπ , but both are optimal
policies. Another noteworthy point is that the incremental update in (4.12) can be implemented for just one
index i; in other words, the update can be done asynchronously.
Now we show how to use Theorem 4.1 to construct “-greedy” policies. The update rule (4.12) can initially
perform poorly when the current guess π is far from being optimal. Moreover, in reinforcement learning,
there is always a trade-off between exploration and exploitation. One way to achieve both exploration and
exploitation simultaneously is to use probabilistic policies, where for a given state xk , every policy is applied
with positive probability.
Suppose π ∈ Πp is a probabilistic policy. Then we use the notation

π(uk |xi ) := Pr{Ut = uk |Xt = xi }.

A policy π ∈ Πp is said to be -soft if

π(uk |xi ) ≥ , ∀uk ∈ U, xi ∈ X .

Suppose π ∈ Πp is the current -soft policy. We can generate an updated -soft policy φ as follows: Define
a deterministic policy ψ ∈ Πd by

k ∗ := arg max Qπ (xi , uk ), ψ(xi ) = uk∗ , ∀xi ∈ X .


uk ∈U
58 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION

Now define the -soft policy φ ∈ Πp by


(

|U | k=6 k∗
φ(uk |xi ) = (4.13)

|U | + (1 − ) k = k ∗ .

Theorem 4.2. With φ defined as in (4.13), the inequality (4.9) holds.

Proof. Note that, for each fied xi ∈ X , we have


X
Qπ (xi , φ(xi )) = φ(uk |xi )Qπ (xi , uk )
uk ∈U
 X
= Qπ (xi , uk ) + (1 − ) max Qπ (xi , uk )
|U| uk ∈U
uk ∈U

 X X π(uk |xi ) − |U |
≥ Qπ (xi , uk ) + (1 − ) Qπ (xi , uk )
|U| 1−
uk ∈U uk ∈U
X
= π(uk |xi )Qπ (xi , uk ),
uk ∈U

which is (4.9). In the next to last equation, we reason as follows: Define nonnegative constants λk that add
up to one, as follows:  
1 
λk := π(uk |xi ) − .
1− |U|
Then X
λk Qπ (xi , uk ) ≤ max Qπ (xi , uk ).
uk ∈U
uk ∈U

Because (4.9) holds, we can apply the -greedy updating rule to improve the policy.

The import of Theorem 4.2 is that the above -greedy updating rule will eventually converge to the
optimal -soft policy. The -greedy policy and updating rule can be combined with a “schedule” for reducing
 to zero, which would presumably converge to the optimal policy.

4.2 Temporal Difference Methods


4.2.1 Basic Temporance Difference Method
Temporal difference (TD) methods are another way to approximate the value vπ corresponding to a specific
policy π. Unlike Monte Carlo methods that wait until an entire episode is completed before computing
an estimate for vπ , Temporal Difference (TD) methods update various estimates at each time step. Since
each estimate depends on an earlier estimate, this is known as “bootstrapping.” As before it is assumed
that a particular policy π has been chosen and implemented, and that a sample path {(Xt , Ut , Wt+1 )}t≥0 is
observed. So hereafter we drop the subscript π on V .
Suppose now that xi is the state of interest, and that Xt = xi . There are two equivalent formulas for the
value function associated with a state xi , one “explicit” and the other recursive. The explicit formula is
"∞ #
X
τ
V (xi ) = E γ Rt+τ +1 |Xt = xi , (4.14)
τ =0

where
Rt+τ +1 = R(Xt+τ , Ut+τ )
4.2. TEMPORAL DIFFERENCE METHODS 59

is the reward, but paid at time t + τ + 1. If the reward is random, then the expectation includes this
randomness also. Now (4.14) can be expanded as
"∞ #
X
V (xi ) = Rt+1 + E γ τ Rt+τ +1 |Xt = xi .
τ =1

This can now be rewritten as the recursion (cf. (2.32))

V (xi ) = Rt+1 + E [V (Xt+1 )|Xt = xi ] . (4.15)

With this background, we can think of Monte Carlo simulation as a way to approximate V (xi ) using
the formula (4.14). Specifically, suppose an episode passes through the state of interest xi at time t and
terminates in an absorbing state a tme T . Then the quantity
T
X −t
V̂ (xi ) = γ τ Rtτ +1
τ =0

provides an approximation to V (xi ).


Temporal Difference Learning (TD) takes a different approach. One can express (4.15) as

V (Xt ) ∼ Rt+1 + γV (Xt+1 ), (4.16)

where the symbol ∼ denotes that the random variables on both sides of the formula are the same. Check
this statement, and if necessary, elaborate. So, if v̂ ∈ Rn is a current guess for the value vector at time
t, then we can take V̂ (Xt ) as a proxy for V (Xt ), Wt+1 as a proxy for Rt+1 , and V̂ (Xt+1 ) as a proxy for
V (Xt+1 ). Therefore the error
δt+1 = Wt+1 + γ V̂t (Xt+1 ) − V̂t (Xt ), (4.17)
gives a measure of how erroneous the estimate V̂ (xi ) is. Call this the “Bellman gradient” and give an
interpretation. But it does not tell us how far off the estimates V̂ (xj ), xj 6= xi are. So we choose a
predetermined sequence of step sizes {αt }, and adjust the estimate v̂ as follows:

V̂t (xj ) + αt δt+1 if Xt+1 = xj ,
V̂t+1 (xj ) = (4.18)
V̂t (xj ), otherwise.

Note that only the component of v̂ corresponding to the current state at time t is adjusted, while the
remainder are left unaltered.
Note that if v̂t = v, the true value vector, then

E[δt+1 |Xt = xi ] = E[Wt+1 + γ V̂ (Xt+1 ) − V̂ (Xt )|Xt = xi ] = 0. (4.19)

Thus δt+1 is the “temporal difference” at time t + 1 between what we expect to see and what we actually
see. If that difference is not zero, we correct the corresponding component of V̂ by moving in that direction.

Theorem 4.3. Suppose



X ∞
X
αt = ∞, αt2 < ∞. (4.20)
t=0 t=0

Then the estimated value vector v̂t converges to the true value vector v almost surely.

Proof. (See [35, Section 3.1.1].) To analyze the behavior of TD learning, define the map F : Rn → Rn as

F y := r + γAy − y = r + (γA − In )y, (4.21)


60 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION

where A is the state transition matrix of the unknown Markov process. Then (4.15) can be rewritten in
vector notation as
v = r + γAv,
where v is the unknown value vector. With this reformulation, it can be seen that TD learning is just
asynchronous stochastic approximation with the function f : Rn → Rn equal to F . The conditions for the
asynchronous version of stochastic approximation to converge are broadly similar to those in Theorem ??.
Specifically, it suffices to establish that the differential equation

ẏ = F y = r + (γA − In )y (4.22)

is globally asymptotically stable around the equilibrium y = v, the true value function, and moreover, there
is a Lyapunov function that satisfies the conditions of (??) and (??). (Note that the V in those equations is
the Lyapunov function, and not the value function.) But this is immediate. Note that the equation (4.22) is
linear. Moreover, because ρ(γA) = γ < 1, all eigenvalues of γA − In have negative real parts. So the global
asymptotic stability of this equilibrium can be established using a quadratic Lyapunov function, so that (??)
and (??) are automatically satisfied. Therefore v̂t → v as t → ∞ almost surely.
Note that the recursion formula for TD learning does not require the trajectory to be an episode. There-
fore, for a given sample path of length T , the TD updates operate T times, whereas MC updates operate
only as many times as the number of episodes contained in the sample path. Indeed, aside from the fact that
v̂ is updated at every time instant, this is one more advantage, namely, that partial episodes are also useful.
Moreover, TD learning can also (apparently) be used with Markov processes that do not have an absorbing
state. However, in case the Markov process does have an absorbing state and the sample path corresponds
to an episode, it is possible to derive a useful formula that can be used to relate Monte Carlo simulation to
TD learning. Specifically, the Monte Carlo return over an episode can be expressed as a sum of temporal
differences. Define, as before
T
X −t
Gt := γ τ Wt+τ +1 (4.23)
τ =0

to be the total return over an episode starting at time t and ending at T . Then Gt satisfies the recursion

Gt = Wt+1 + γGt+1 .

So we can write

Gt − V̂ (Xt ) = Wt+1 + γGt+1 − V̂ (Xt )


= Wt+1 + γGt+1 − V̂ (Xt ) − γ V̂ (Xt+1 ) + γ V̂ (Xt+1 )
= δt + γ(Gt+1 − V̂ (Xt+1 )). (4.24)

Now repeat until the end of the episode, when both GT +1 and V̂ (XT +1 ) are zero (because XT is an absorbng
state). This leads to
T
X −t
Gt − V̂ (Xt ) = γ τ δt+τ . (4.25)
τ =0

In TD learning, it is not necessary to look “only” one step ahead. It is possible to derive an “l-step
look-ahead” TD predictor. Note that if we define
l−1
X
Gt+l
t := γ t Rt+τ +1 , (4.26)
τ =0

then with l = 0 we get


Gt+l
t = Gt = Rt+1 .
4.2. TEMPORAL DIFFERENCE METHODS 61

(Note that the one-step look-ahead predictor corresponds to l = 0.) For l > 0, the analog of (4.16) is

V (Xt ) ∼ Gt+l
t + γ l V (Xt+l ). (4.27)

So, given an estimated value vector v̂, we observe that if v̂ = v, the true value vector, then

E[Gtt+l + γ l V̂ (Xt+l ) − V̂ (Xt )|Xt = xi ] = 0. (4.28)

Let us define
t+l+1
δt+1 := Gt+l
t + γ l V̂ (Xt+l ) − V̂ (Xt ). (4.29)
t+l+1
The extra ones in the subscript and the superscript are to ensure that when l = 0, we have that δt+1 = δt+1
t+l+1
as defined in (4.17). In view of (4.28), we can think of δt+1 as an l-step temporal difference. The analogous
updating rule for this l-step TD learning is
t+l+1

V̂t (xj ) + αt δt+1 if Xt+1 = xj ,
V̂t+1 (xj ) = (4.30)
V̂t (xj ), otherwise.

The limiting behavior of first-time Monte Carlo and TD are quite different. Suppose that the duration
of the data is T , which consists of K episodes, of duration T1 , · · · , TK respectively For each episode k, form
the first-time estimate Gk (xi ) for each state xi ∈ X . If the state xi does not occur in episode k, then this
term is set equal to zero, which is functionally equivalent to omitting it. Then
Tk
XX
V̂MC (xi ) → arg min (Gk (xi ) − V (xi ))2 . (4.31)
V (xi ) k=1 k=1

In contrast, the TD estimate converges to the maximum likelihood estimate


k T
1 XX
âuijk I(Xt ,Xt+1 ,Ut )=(xi ,xj ,uk ) . (4.32)
n(xi , uk )
k=1 k=1

As a result, TD methods converge more quickly than Monte Carlo methods. These statements are made on
[33, p. 128].

4.2.2 SARSA: On-Policy Control


In this section we introduce SARSA which is an on-policy method for estimating the action-value function
for a given policy. Then we show how the policy itself can be improved upon in an iterative fashion. Note
that SARSA stands for State, Action, Reward, State, Action.
We begin with the observation that if a policy π is chosen, whether π ∈ Πd or π ∈ Πp , the resulting
process {Xt } is Markovian. Equally, the joint state-action process {(Xt , Ut )} is also Markovian. So, if our
aim is to estimate the action-value function Qπ : X × U → R, then Qπ can be viewed as a reward for this
joint Markov process. Recall from Theorem 2.5 that if π ∈ Πd , then Qπ satisfies the recursion (2.34), namely
n
X
Qπ (xi , uk ) = R(xi , uk ) + γ auijk Qπ (xj , π(xj )).
j=1

However, for the case of probabilistic policies, it is advantageous to write the recursion as

Qπ (xi , uk ) = R(xi , uk ) + γE[Qπ (Xt+1 , π(Xt+1 ))|(Xt , Ut ) = (xi , uk )]. (4.33)

Because this is a recursion, it is amenable to learning via a temporal difference method.


62 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION

Given a sample path {(Xt , Ut , Wt+1 )}t≥0 , start with some initial guess Q̂0 for the action-value function.
Observe that, if at time t, the estimate Q̂t,π were to be correct, then

E[Rt+1 + γ Q̂t,π (Xt+1 , π(Xt+1 )) − Q̂t,π (Xt , Ut )|Xt , Ut ] = 0.

Therefore the term


θt+1 := Wt+1 + γ Q̂t,π (Xt+1 , Ut+1 ) − Q̂t,π (Xt , Ut ) (4.34)
can serve as a temporal difference. Hence we can update the guess Q̂ as

Q̂t,π (Xt , Ut ) + αt+1 θt+1 if (xi , uk ) = (Xt , Ut ),
Q̂t+1,π (xi , uk ) = (4.35)
Q̂t,π (Xt , Ut ) if (xi , uk ) 6= (Xt , Ut ).

It can be surmised that, if {αt } approaches zero as per the conditions (4.20), then Qt,π converges almost
surely to Qπ . However, this requires that every possible pair (xi , uk ) ∈ X × U must be visited infinitely often
by the sample path, which is impossible if π is a deterministic policy. Hence, in order to apply the above
method for a deterministic policy, one should use an -soft version of π.
All this produces the action-value pair for one fixed policy. However, it is possible to combine this with
an -greedy improvement approach to converge to an optimal policy. Specifically, we can update the policy
π using an -greedy approach on the current policy, as before: Define

k ∗ = arg min Q̂π (xi , uk ), ψ(xi ) = uk∗ ,


uk ∈U
(

|U | k=6 k∗
φ(uk |xi ) = 
|U | + (1 − ) k = k ∗ ,

where all policies are functions of t, but the dependence is not explicitly displayed in the interests of clarity.
Now we can reduce  to zero over time. Then perhaps φ → π ∗ and Q̂ → Q∗ as t → ∞. However, this is just
a hope. The Q-learning approach given next turns this hope into a reality by adopting a slightly different
approach to updating Q.

4.2.3 Q-Learning
A substantial advance in RL came in the paper [47]. In this paper, the authors propose an iterative scheme
for learning the optimal action-value function Q∗ (xi , uk ), by starting with an arbitrary initial guess, and
then updating the guess at each time instant.
Recall the following definitions and facts from Section 2.2. The action-value function is defined as
follows, for a given policy π (cf. (2.33)):
"∞ #
X
Qπ (xi , uk ) := Eπ γ t Rπ (Xt )|X0 = xi , U0 = uk .
t=0

As shown in (2.34), the function Qπ satisfies the recursion


n
X
Qπ (xi , uk ) = R(xi , uk ) + γ auijk Qπ (xj , π(xj )).
j=1

If the return R is a random function of Xt , Ut , then the first term on the right side would be the expected
value of the reward R(Xt , Ut ). The optimal action-value function Q∗ satisfies the relationships
n
X

Q (xi , uk ) = R(xi , uk ) + γ auijk max Q∗ (xj , wl ),
wl ∈U
j=1
4.2. TEMPORAL DIFFERENCE METHODS 63

V ∗ (xi ) = max Q∗ (xi , uk ).


uk ∈U

The recursion relationship for Q∗ suggests the following iterative scheme for estimating this function.
Start with some arbitrary function Q0 : X × U → R. Choose a sequence of positive step sizes {αt }t≥0
satisfying
X∞ X∞
αt = ∞, αt2 < ∞. (4.36)
t=0 t=0

As the time series {(Xt , Ut , Wt+1 )}t≥0 is observed, update function Qt as follows:

Qt (xj , wl ) + αt+1 [Wt+1 + γVt (xj ) − Qt (xj , wl )], if (Xt+1 , Ut+1 ) = (xj , wl ),
Qt+1 (xj , wj ) = (4.37)
Qt (xj , wl ) otherwise,

where
Vt (xj ) = max Qt (xj , ul ). (4.38)
ul ∈U

In other words, if (Xt+1 , Ut+1 ) = (xj , wl ), then the corresponding estimate Qt+1 (xj , wj ) is updated, but the
estimates of Q(xi , uk ) for (Xt+1 , Ut+1 ) 6= (xj , wl ) are not updated.
The convergence of the above scheme is analyzed in the next theorem.
Theorem 4.4. (See [47, p. 282].) If there exists a finite constant RM such that |R(Xt , Ut )| ≤ RM , then

Qt (xi , uk ) → Q∗ (xi , uk ) as t → ∞, ∀xi ∈ X , uk ∈ U, w.p. 1. (4.39)

One of the main advantages of Q-learning over learning the value function Vπ is the following: If (Xt , Ut ) =
(xi , uj ), then the quantity
R(xi , uk ) + γ max Q(Xt+1 , wl )
wl ∈U

is an unbiased sample of the quantity


X
R(xi , uk ) + γ auijk max Q(xj , wl ).
wl ∈U
j∈[n]

In contrast, the quantity


max [R(xi , uk ) + γV (Xt+1 )]
uk ∈U

is not an unbiased sample of the quantity


 
X
max R(xi , uk ) + γ auijk vj  .
uk ∈U
j∈[n]

Problem 4.1. The objective of this problem is to analyze the l-step look-ahead Temporal Difference Learn-
ing rule proposed in (4.30). Suppose we have a Markov decision process with a prespecified policy π (deter-
ministic, or probabilistic, it does not matter).
ˆ Start with (2.32) for the value vector associated with the policy π, namely

v = r + γAv,

where we havem omitted π as it is anyway fixed. Show that the following l-step look-ahead version of
the above equation is also true: !
l−1
X
v= γ A r + Al v.
i i

i=0
64 CHAPTER 4. APPROXIMATE SOLUTION OF MDPS VIA SIMULATION

ˆ Using this fact, show that (4.27) is true.

ˆ Show that (4.30) is a stochastic approximation approach to solving the above equation.

ˆ Hence establish that if the conditions in (4.20) hold, then the l-step look-ahead TD learning converges
almost surely to the true value function.

Hint: If A is a row-stochastic matrix, γ < 1, and l is an integer, show that all eigenvalues of the matrix
l
X
γ i Ai − I
i=0

have negative real parts.

Problem 4.2. Generate a 20 × 20 row-stochastic matrix A, and a 20 × 1 reward vector r, in any manner
you wish.
ˆ Compute the actual value function for the A and r you have generated using (2.32).

ˆ Generate a sample path of length 200 time steps starting from some arbitrary initial state. Compute
the associated reward functions as a part of the sample path. (Note that the action is not required as
the policy remains constant.)
ˆ Choose the discount factor γ = 0.95 and the look-ahead step length l = 5 or l = 10. Using the sample
path {(Xt , Wt+1 )} generated above, estimate the value function using conventional TD learning, and
l-step TD-learning with l = 5 and l = 10. Plot the Euclidean norm of the error (difference between
the true and estimated value vector) as a function of the iteration number, for each of these three
approaches. What conclusions do you draw, if any?
Chapter 5

Parametric Approximation Methods

Until now we have presented various iterative methods for computing the value function associated with a
prespecified policy, and for recursively improving a starting policy towards an optimum. These methods
converge asymptotically to the right answer, as the number of episodes (for Monte Carlo methods) or the
number of samples (for Temporal Difference and Q-learning methods) approaches infinity. So in principle
one could truncate these methods when some termination criterion is met, such as the change from one
iteration to the next being smaller than some prespecified threshold. However, the emphasis is still finding
the exact value function, or the exact optimal policy.
This approach can be used to find the value function if the size n of the state space |X | is sufficiently
small. Similarly, this approach can be used to find the action-value function if the product nm of the size
of the state space |X | and the size of the action space |U| is sufficiently small. However, in many realistic
applications, these numbers are too large. Note that the value function V : X → R is an n-dimensional
vector. Hence we can identify the “value space” with Rn . Similarly, the policy π can be associated with a
matrix B ∈ Rn×m , where bij = Pr{π(xi ) = uj }. The associated action-value function Q : X × U → R, and
can be thought of as an nm-dimentional vector. In principle, every vector in Rn can be the value vector
of some Markov reward process, and every vector in Rnm can be the action-value function of some Markov
Decision Process.
Therefore, when these numbers are too large, it becomes imperative to approximate the value or action-
value functions using a smaller number of parameters (smaller than n for value functions and smaller than
nm for action-value functions and policies). The present chapter is devoted to a study of such approximation
methods. We begin with methods for approximating the value function. Then we can ask: Is it possible to
find directly an approximation the optimal policy using a suitable set of basis functions? This question lies at
the heart of “policy gradient” methods and is studied in Section 5.3. The solution is significantly facilitated
by a result known as the “policy gradient theorem.” Proceeding further, we can attempt simultaneously to
approximate both the policy function and the value function over a common set of basis functions. This is
known as the “actor-critic” approach and is studied in Section 5.3.

5.1 Value Approximation Methods


5.1.1 Preliminaries
The object of study in the present section is the approximation of the value function under a fixed policy π.
Therefore, though we will start off with the notation vπ , we will drop the subscript at after some time.
In Chapter 4, we proposed a few methods for estimating the value Vπ (xi ) of a state of interest xi under a
fixed policy π. In the Monte Carlo approach, multiple episodes of the process starting at a state of interest
xi could be used to estimate Vπ (xi ). In the Temporal Difference approach, it is possible to bootstrap and
update the estimate Vπ (xi ) after each time instant, and not just after each episode. Now let vπ denote the

65
66 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

true n-dimensional value vector corrsponding to the policy π. Suppose we try to approximate vπ by another
vector of the form v̂π (θ), where θ ∈ Rd is a vector of adjustable parameters, and d  n. Thus v̂π : Rd → Rn
where n = |X |. Depending on the context, we can also think of v̂ as a collection of n maps, indexed by the
state xi , where each V̂ (xi , ·) : Rd → R. In general it is not possible to choose the parameter vector θ so
as to make v̂π (θ) equal to vπ . In the approaches suggested in Chapter 4, the estimate of Vπ (xi ) did not
depend on the estimate of Vπ (xj ) for xj 6= xi . However, in the present instance, since our aim is to estimate
all values Vπ (xi ) simultaneously, we need to define a suitable error criterion.
Suppose have a sample path {Xt }Tt=0 that is used to estimate Vπ . A natural choice is the least-squares
criterion given by
T
1 X
E(θ) := [Vπ (Xt ) − V̂ (Xt , θ)]2 .
2(T + 1) t=0
The above error criterion suggests that an error at any one time t is as significant as an error at any other
time τ . The above error criterion can be rewritten as
1 X
E(θ, µ) := µi [Vπ (xi ) − V̂π (xi , θ)]2 , (5.1)
2
xi ∈X

where
T
1 X
µi := I{Xt =xi }
T + 1 t=0
is the fraction of times that Xt = xi , the state of interest, in the sample path. With this definition of the
coefficients µi , the error criterion can be further rewritten as
1
E(θ, µ) := [vπ − v̂π (θ)]> M [vπ − v̂π (θ)], (5.2)
2
where M ∈ Rn×n is the diagonal matrix with the µi as its diagonal elements.
This raises the question as to what the coefficients µi should be. Clearly, in a Markov process, the
seriousness of an error in estimating Vπ (xi ) should depend on how likey the state xi is likely to occur in a
sample path. Thus it would be reasonable to choose the coefficient vector µ as the stationary distribution
of the Markov process. However, there are two difficulties with this approach. The first difficulty is that
we do not know what this stationary distribution is. So we try to construct an estimate of the stationary
distribution, before we start the process of approximating the unknown value vector vπ . The second difficulty
is that if the Markov process has one or more absorbing states, then the above approach would be meaningless
as shown below.
If we assume that, under the policy π, the resulting Markov process is irreducible (see Section 8.2), then
there is a unique stationary distribution µ of the process. Moreover, µ > 0, i.e., every component µi is
positive. The question is how to estimate this stationary distribution. Theorem 8.8 and specifically (8.32)
provide an approach. For any function f : X → R, we have that
t−1
1X
E[f, µ] = lim f (Xτ ). (5.3)
t→∞ t
τ =0

Now define a function fi : X → R by fi (xi ) = 1, and fi (xj ) = 0 if j 6= i. Then it is easy to see that the
expected value E[fi , µ] = µi . Moreover, (5.3) implies that
t−1
1X
µi = E[fi , µ] = lim I{Xt =xi } . (5.4)
t→∞ t
τ =0

Thus, in a sample path, we count what fraction are equal to the state of interest xi . As t → ∞, this fraction
converges almost surely to µi . Hence, if we take any sample path (or collection of sample paths), then the
5.1. VALUE APPROXIMATION METHODS 67

fraction of occurrences of xi in the sample path is a good approximation to µi . These estimates can be used
in defining the error criterion in (5.1).
Much of reinforcement learning is based on Markov processes with absorbing states, and using episodes
that terminate in an absorbing state. It is obvious that a Markov process with an absorbing state is not
irreducible. Specifically, there is no path from an absorbing state to any other state. Therefore Theorem 8.8
does not apply. Let us change notation and suppose that a Markov process has nonabsorbing states S and
absorbing stages A. Then it is easy to see that the state transition matrix A looks like
 
A11 A12
A= . (5.5)
0 I|A|
For example, if there is only one absorbing state, the bottom right corner is just a single element of 1. In this
case it can be shown that ρ(A11 ) < 1 (see [46, Chapter 4]. Therefore there is a unique stationary distribution,
with a 1 in the last component and zeros elsewhere. If |A| > 1, then it is again true that ρ(A11 ) < 1, but now
there are infinitely many stationary distributions; see Problem 5.1. Even if it is assumed that there is only
one absorbing state (for the policy under study), it is obvious that trying to use the stationary distribution
to assign the weights µi would be meaningless, because the weights would all equal zero for nonabsorbing
states. A reasonble approach is to set µi equal to the fraction of states Xt in a “typical” sample path that
equal xi , before the sample path hits an absorbing state. For that purpose, the following theorem is useful.
Theorem 5.1. Suppose a reducible Markov process with absorbing states has a state transition matrix of
the form (5.5). Suppose an initial state X0 ∈ S is chosen in accordance with a probability distribution φ.
Then the fraction of states in S along all nonterminal paths (that is, paths until the time when they hit an
absorbing state) has the distribution µ given by
µ> = φ> (I|S| − A11 )−1 . (5.6)
Proof to be supplied later.
This theorem still leaves open the question of how to choose the initial distribution φ.
Either way, once the weight vector µ has been chosen, the next step is to minimize the error function
E defined in (5.1). This is known as “value approximation.” Before discussing how value approximation is
achieved, let us digress to discuss whether this is the “right” problem to solve.
Once we have minimized the error in (5.1), we have an approximate value function v̂π (θ) for a fixed
policy π. However, the objective in MDPs is to choose an optimal or nearly optimal policy. Therefore, in
order to use the approximate value vectors v̂π to guide this choice, we must construct such approximations
for all policies π. Moreover, we must ensure that v̂π is a uniformly good approximation over all possible
policies. In other words, the error in (5.1) needs to be minimized for all policies π. Even if we restrict to
deterministic policies, the cardinality of Πd is mn = |U||X | , which can become very large very quickly. Even
otherwise, just because v̂π is uniformly close to vπ for all π, it does not follow that the minimizer of v̂π over
π is anywhere close to the minimizer of vπ with respect to π. Therefore we can ask: Is it possible to find
directly an approximation the optimal policy using a suitable set of basis functions? This question lies at
the heart of “policy gradient” methods and is studied in Section 5.3. The solution is significantly facilitated
by a result known as the “policy gradient theorem.” Proceeding further, we can attempt simultaneously to
approximate both the policy function and the value function over a common set of basis functions. This is
known as the “actor-critic” approach and is studied in Section 5.3.

5.1.2 Stochastic Gradient and Semi-Gradient Methods


As discussed above, in the objective function of (5.1), the policy π is fixed, and can thus be omitted.
Similarly, once a sample path (a set of episodes or otherwise) is given, the coefficient vector µ is fixed,
whence it too can be omitted from the argument, and we can just write E(θ). Thus the objective is to
minimize the function
1
E(θ) := [v − v̂(θ)]> M [v − v̂(θ)]. (5.7)
2
68 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

To compute the gradient of E with respect to θ, which is the variable of optimization, set θ ← θ + ∆θ.
Then
v̂(θ + ∆θ) ≈ v̂(θ) + ∇θ v̂(θ)∆θ.
Substituting this into the expression for E and retaining only first-order terms shows that

∇θ E(θ) = [∇θ v̂(θ)]> M [∇θ v̂(θ)θ − v]. (5.8)

Now let us discuss what form the function v̂(θ) may take. There are two natural categories, namely:
linear and nonlinear. In the linear case, there is a set of linearly independent vectors ψl ∈ Rn for l = 1, . . . , d
where d is the number of weights. (Recall that θ ∈ Rd .) Define

Ψ := [ψ1 | · · · |ψd ] ∈ Rn×d , (5.9)

and suppose that


v̂(θ) = Ψθ ∈ Rn . (5.10)
The assumption that the columns of Ψ are linearly independent means that Ψ has full column rank. It
means also that there are no redundant weights in θ.
Let V ⊆ Rn denote the span of the d vectors that comprise the columns of Ψ, that is, the range of Ψ
in Rn . Then it is clear that, for every choice of the weight vector θ, the vector v̂(θ) belongs to V. Hence
minimizing E(θ) is equivalent to finding to the closest element of the true value vector v in V, where the
distance between two vectors a, b ∈ Rn is measured by

d(a, b) = [(a − b)> M (a − b)]1/2 .

In this case (5.4) becomes


∇θ E(θ) = Ψ> M Ψθ − Ψ> M v. (5.11)
As pointed out earlier, the function v̂ can also be viewed as a family of functions indexed by xi . Thus
we can define V̂xi : Rd → R by
V̂ (xi , θ) = [v̂(θ)]xi ,
that is, the xi -th component of the map v̂ : Rd → R. As an illustration, consider the linear map v̂ defined
in (5.10). Then the map V̂xi is defined by

V̂ (xi , θ) = Ψxi θ,

where Ψl denotes the l-th row of the matrix Ψ. With this in mind, let us define the error
1
E(xi , θ) = µi [V (xi ) − V̂ (xi , θ)]2 . (5.12)
2
Now suppose, as we have been doing, that a sample path {(Xt , Ut , Wt+1 )}t≥0 is given. We wish to find the
weight vector θethat (nearly) minimizes the error criterion in (5.1). In what follows, we make a distinction
between so-called gradient methods (or updates), and so-called semi-gradient methods (or updates).
We begin with gradient methods, which are based on Monte Carlo simulation, and thus require the
Markov process to have one or more absorbing states. The sample path is also required to terminate in an
absorbing state at some time T . As before, define the cumulated discounted return
−t−1
TX
Gt+1 = γ τ Wτ +t+1 , t ≤ T.
τ =0

The iterations are started off at index τ = 0 with some initial weight vector θ 0 . Then for each time τ in
0, · · · , T − t − 1, we adjust θ τ as follows:

θ t+1 = θ t + αt+1 [Gt+1 − V̂ (Xt , θ t )]∇θ V̂ (Xt , θ t ). (5.13)


5.1. VALUE APPROXIMATION METHODS 69

Here {αt } is a predetermined sequence of time steps that can either be constant or decrease slowly to zero.
However, if we are to interpret (5.13) as moving θ t in the direction of the gradient of the error function
E(Xt , θ) defined in (5.12), then the presence of the coefficient µXt is essential.
The semi-gradient updating method is reminiscent of the Temporal Difference method. We proceed as
follows: Start at time t = 0 at any initial state X0 . Generate a sample path {(Xt , Ut , Wt+1 )} by choosing
Ut ∼ π(·|Xt ) in case the policy is probabilistic, and Ut = π(Xt ) if the policy is deterministic. Observe Wt+1
and Xt+1 . Repeat. Start with any initial vector θ 0 . On the basis of this sample path, update the weight
vector θ t at each time instant as follows:
θ t+1 = θ t + αt+1 [Rt+1 + γ V̂ (Xt+1 , θ t ) − V̂ (Xt , θ t )]∇θ V̂ (Xt , θ t ). (5.14)
If the value approximation function v̂(θ) is linear as in (5.10), then it is easy to see that
V̂ (Xt , θ) = ΨXt θ.
Therefore
∇θ V̂ (Xt , θ) = (ΨXt )> ,
where ΨXt denotes the Xt -th row of the matrix Ψ. If we define
yt = (ΨXt )> , (5.15)
that is, the transpose of row Xt of the matrix Ψ, then we can write
V̂ (Xt , θ) = hyt , θi = yt> θ, ∇θ V̂ (Xt , θ) = yt .
Therefore the updating rule (5.14) becomes
>
θ t+1 = θ t + αt+1 (Wt+1 + γyt+1 θ t − yt> θ t )yt . (5.16)
If the approximation function v̂(θ) is a linear map of the form Ψθ, we can establish the convergence of
the semi-gradient method by viewing it as an implementation of stochastic approximation.
Theorem 5.2. Suppose the value approximation function v̂(θ) is as in (5.10). Suppose the state transition
matrix A is irreducible, and that µ is its unique stationary distribution. Define the error function E(θ) as
in (5.1), and use the update rule (5.16). Finally, suppose the usual conditions on {αt } hold, namely

X ∞
X
αt = ∞, αt2 < ∞. (5.17)
t=1 t=1

Then the sequence of estimates {θ t } converges almost surely to θ ∗ , which is the unique solution of the
equation
Cθ ∗ − b = 0, (5.18)
where
C = Ψ> M (γA − In )Ψ, b = E[Rt+1 yt ], M = Diag(µxi ). (5.19)
To prove this theorem, we state and prove a couple of preliminary results.
Definition 5.1. A matrix F ∈ Rd×d is said to be row diagonally dominant if
X
fii > |fij |, (5.20)
j6=i

which automatically implies that all diagonal elements of F are strictly positive. The matrix F is said to be
column diagonally dominant if F > is row diagonally dominant, or equivalently
X
fii > |fji |. (5.21)
j6=i

Finally, the matrix F is said to be diagonally dominant if F is both row and column diagonally dominant.
70 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

Lemma 5.1. Suppose F is diagonally dominant. Then so is (F + F > )/2, the symmetric part of F .
The proof is omitted as it is obvious.
Lemma 5.2. Suppose F is diagonally dominant. Then the symmetric matrix F + F > is positive definite.
Proof. The proof is a ready consequence of Lemma 5.1 and the Gerschgorin circle theorem. Recall that, for
any matrix G ∈ Rd×d , the eigenvalues of G are contained in the union of the d Girschgorin circles
X
Ci := {z ∈ C : |z − gii | ≤ |gij |}, i = 1, · · · , d.
j6=i

Now let G = F + F > . Then G is symmetric and thus has only real eigenvalues. So the Girschgorin circles
turn into the Girshgorin intervals. The diagonal dominance of G implies that each interval is a subset of
R+ . Therefore all eigenvalues of G are positive, and G is positive definite.
Lemma 5.3. (Lyapunov Matrix Equation) Suppose C ∈ Rd×d . Then all eigenvalues of C have negative real
parts if there is a positive definite matrix P such that

−(C > P + P C) =: Q

is positive definite.
This is a standard result in linear control theory and the proof can be found in many places, for example
[44, Theorem 5.4.42].
At last we come to the proof of the main theorem
>
Proof. (Of Theorem 5.2.) Note that yt+1 is ΨXt+1 , which is row Xt+1 of the matrix Ψ. Thus the the
conditional expectation
>
E[yt+1 |yt ] = yt> A,
where A is the state transition matrix. Therefore
>
E[(Wt+1 + γyt+1 θ t − yt> θ t )yt |yt ] = b − Cθ.

Now the update rule (5.16) is just stochastic approximation used to solve the linear equation

Cθ ∗ = b.

If all eigenvalues of the matrix C have negative real parts, then the differential equation

θ̇ = Cθ − b

is globally asymptotically stable around its unique equilibrium θ ∗ . Thus the proof is completed once it is
established that all eigenvalues of C have negative real parts.
From Lemma 5.3, a sufficient condition for this is that −(C + C > ) is positive definite. Now note that

−(C + C > ) = Ψ> [H + H > ]Ψ,

where
H = M (In − γA). (5.22)
It is now shown that H is both row and column diagonally dominant. Observe first that H has 1 − γaii on
the diagonal (which are all positive because aii ≤ 1 and γ < 1), and has −γaij on the off-diagonal elements.
Therefore, to establish the diagonal dominance of H, it is enough to show that

H1n > 0, 1>


n H > 0.
5.2. VALUE APPROXIMATION VIA TD(λ) -METHODS 71

Let us begin with the first inequality. We have that

H1n = M (In − γA)1n = M (1 − γ)1n > 0.

Here we make use of the fact that A1n = 1n due to the row-stochasticity of A. For the second inequality,
note that
1>
nM = µ ,
>

where µ is the stationary distribution and thus satisfies µ> A = µ> . Therefore

1> > > >


n H = 1n M (I − γA) = µ (In − γA) = µ (1 − γ) > 0.

Now, by Lemma 5.2, the diagonal dominance of H implies that H + H > is positive definite. The fact that
Ψ has full row rank of d implies that

−(C + C > ) = Ψ> [H + H > ]Ψ

is also positive definite. Finally, it follows from Lemma 5.3 that all eigenvalues of C have negative real
parts.

Problem 5.1. Consider a Markov process with two or more absorbing states, so that its state transition
matrix looks like  
A11 A12
A= ,
0 Is

where s = |A|. Show that the set of all stationary distributions of this Markov process consists of
 
0|S|
,
φ

where
s
X
φi ≥ 0, i = 1, · · · , s, φi = 1.
i=1

Problem 5.2. Give an example of a Markov process that is reducible but does not have any absorbing
states.

Problem 5.3. Show that the proof of Theorem 5.2 remains valid if µ is sufficiently close to the stationary
distribution of A. State and prove a theorem to this effect.

5.2 Value Approximation via TD(λ) -Methods


Until now we have studied the application of Monte Carlo methods for episodic Markov processes, and
Temporal Difference (TD) methods for not necessarily episodic Markov processes. In this section we introduce
a variant of TD-learning, known as TD(λ) -learning. In this model, λ ∈ [0, 1) is an adjustable parameter.
The choice λ = 0 leads to conventional TD-learning.
In this section we examine two classes of value approximation problems: The case where the value is
the discounted future reward (which is the case we have been studying until now), and the case where the
value is the average of future rewards. While the two cases are broadly similar, there are a few complicating
factors in the case of average reward processes.
72 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

5.2.1 Discounted Returns


The reference for the contents of this section is [42].
Suppose as before that we have a Markov reward process with state process {Xt } over a finite state space
X , and reward process {Rt+1 }, where the reward could be a random (but bounded) function of the state.
Choose a discount factor γ ∈ (0, 1). Define
"∞ #
X
t
V (xi ) := E γ Rt+1 |X0 = xi , ∀xi ∈ X . (5.23)
t=0

Note that V : X → Rn where n = |X |. If n is too large, we approximate V by another function V̂ : X ×Rd →


R. Therefore, for each parameter θ ∈ Rd , the quantity V̂ (xi , θ) is an approximation for the true value V (xi ).
Suppose have a sample path {(Xt , Wt+1) }t≥0 . Using this sample path we wish to estimate the value
function V . For this purpose, define the temporal difference

δt+1 := Wt+1 + γ V̂ (Xt+1 , θ) − V̂ (Xt , θ).

This is the same as δt+1 defined in (5.14). The conventional TD updating rule is

θ t+1 = θ t + αt+1 δt+1 ∇θ V̂ (Xt , θ t ),

which is the same as (5.16). To define TD(λ) -updating, choose a number λ ∈ [0, 1], and define
t
!
X
t−τ
θ t+1 = θ t + αt+1 δt+1 (γλ) ∇θ V̂ (Xτ , θ τ ) . (5.24)
τ =0

If we choose λ = 0 and set (γλ)0 = 1 if λ = 0, then it is obvious that (5.24) becomes (5.16), the standard
TD updating rule.
In principle the TD(λ) -updating rule (5.24) can be applied with any function V̂ . However, in the remain-
der of this section, we focus on the case of linear approximation, where there is a matrix Ψ ∈ Rn×d , and
V̂ (xi ) = Ψxi θ, for each xi ∈ X . Assume that Ψ has full column rank, so that none of the components of θ
is redundant. Also, define as before the vector

yτ := [ΨXτ ]> ∈ Rd .

Then the eligibility vector zt ∈ Rd can be defined as


t
X
zt = (γλ)t−τ yτ . (5.25)
τ =0

With this new notation, the TD(λ) -updating rule (5.24) can be written as

θ t+1 = θ t + αt+1 δt+1 zt , (5.26)

with z0 = 0. Note that if λ = 0 then zt = yt and (5.26) becomes (5.16). Also note that zt satisfies the
recursion
zt = γλzt−1 + yt . (5.27)
In [42], the convergence properties of TD(λ) -updating are studied. As one might expect, the updating
rule is interpreted as the stochastic approximation algorithm applied to solve a linear equation of the form

Cθ ∗ = b.
5.2. VALUE APPROXIMATION VIA TD(λ) -METHODS 73

If the ODE
θ̇ = Cθ − b
is globally asymptotically stable, then the stochastic approximation converges to a solution of the algebraic
equation. However, the matrix C is more complicated than in Section 5.2. We give a brief description of the
results and direct the reader to [42] for full details.
We begin with an alternative to Theorem 2.2. That theorem states that, whenever the discount factor
γ < 1, the map y 7→ T y := r + γAy is a contraction with respect to the `∞ -norm, with contraction constant
γ. The key to the proof is the fact that if A is row-stochastic, then the induced norm kAk∞→∞ = 1. Now
it is shown that a similar statement holds for an entirely different norm.
Suppose A is row-stochastic and irreducible, and let µ denote its stationary distribution. Note that, by
Theorem 8.6, µ is uniquely defined and has all positive elements. Define M = Diag(µi ) and define a norm
k · kM on Rd by
kvkM = (v> M v)1/2 . (5.28)
Then the corresponding distance between two vectors v1 , v2 is given by

kv1 − v2 kM = ((v1 − v2 )> M (v1 − v2 ))1/2 .

Lemma 5.4. Suppose A ∈ [0, 1]n×n , is row-stochastic, and irreducible. Let µ be the stationary distribution
of A. Then
kAvkM ≤ kvkM , ∀v ∈ Rn . (5.29)
Consequently, the map v 7→ rγAv is a contraction with respect to k · kM .

In order to prove Lemma 5.4, we make use of a result known as “Jensen’s inequality,” of which only a
very simple special case is presented here.

Lemma 5.5. (Jensen’s Inequality) Suppose Y is a real-valued random variable taking values in a finite set
Y = {y1 , . . . , yn }, and let p denote the probability distribution of Y. Suppose f : R → R is a convex function.
Then
f (E[Y, p]) ≤ E[f (Y ), p]. (5.30)

It can be shown that Jensen’s inequality holds for arbitrary real-valued random variables. But what is
stated above is adequate for present purposes.

Proof. (Of Lemma 5.5.) Note that, because f (·) is convex, we have
n
! n
X X
f (E[Y, p]) = f pi yi ≤ pi f (yi ) = E[f (Y ), p].
i=1 i=1

This is the desired bound.

Proof. (Of Lemma 5.4.) We will show that

kAvk2M ≤ kvk2M , ∀v ∈ Rn ,

which is clearly equivalent to (5.29). Now


 2
n
X n
X n
X
kAvk2M = µi (Av)2i = µi  Aij vj  .
i=1 i=1 j=1
74 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

However, for each fixed index i, the row Ai is a probability distribution, and the function f (Y ) = Y 2 is
convex. If we apply Jensen’s inequality with f (Y ) = Y 2 , we see that
 2
Xn n
X
 Aij vj  ≤ Aij vj2 , ∀i.
j=1 j=1

Therefore  
n n n n
! n
X X X X X
kAvk2M ≤ µi  Aij vj2  = µi Aij vj2 = µj vj2 = kvk2M ,
i=1 j=1 j=1 i=1 j=1

where in the last step we use the fact that µA = µ.

Next, to analyze the behavior of the TD(λ) -updating, we define a map T (λ) : Rn → Rn via


" l
#
X X
[T (λ) f ]i := (1 − λ) λl E γ τ R(Xτ +1 ) + γ l+1 fXl+1 |X0 = xi . (5.31)
l=0 τ =0

Note that T (λ) f can be written explicitly as


" l
#
X X
(λ) l τ τ l+1 l+1
T f = (1 − λ) λ γ A r+γ A f , (5.32)
l=0 τ =0

where, as before
r = [ R(x1 ) · · · R(xn ) ]> ,
and
R(xj ) = E[R1 |X0 = xi ].

Lemma 5.6. The map T (λ) is a contraction with respect to k · kM , with contraction constant [γ(1 − λ)]/(1 −
γλ).

Proof. Note that the first term on the right side of (5.32) does not depend on f . Therefore

X
T (λ) (f1 − f2 ) = γ(1 − λ) (γλ)l Al+1 (f1 − f2 ).
l=0

However, it is already known from Lemma 5.4 that

kA(f1 − f2 )kM ≤ kf1 − f2 kM .

By repeatedly applying the above, it follows that

kAl (f1 − f2 )kM ≤ kf1 − f2 kM , ∀l.

Therefore

X γ(1 − λ)
kT (λ) (f1 − f2 )kM ≤ γ(1 − λ) (γλ)l kf1 − f2 kM = kf1 − f2 kM .
1 − γλ
l=0

This is the desired bound.


5.2. VALUE APPROXIMATION VIA TD(λ) -METHODS 75

Let us define v ∈ Rn as the unique solution of

T (λ) v = v.

Then it is easy to verify that v is in fact the value function, because the value function also satisfies the
above equation, and the equation has a unique solution because T (λ) is a contraction. Presumably we could
find v by running a stochastic approximation type of iteration, without value approximation. However, our
interest is in what happens with value approximation. Towards this end, define a projection Π : Rn → Rn
by
Πa := Ψ(Ψ> M Ψ)−1 Ψ> M a. (5.33)
Then
Πa = arg min ka − bkM . (5.34)
b∈Ψ(Rd )

Thus Πa is the closest point to a in the subspace Ψ(Rn ). With this definition, we can state the following:
Theorem 5.3. Suppose the sequence {θ t } is defined by (5.24). Suppose further that the standard conditions

X ∞
X
αt = ∞, αt2 < ∞ (5.35)
t=0 t=0

hold. Then the sequence {θ t } converges almost surely to θ ∗ , where θ ∗ is the unique solution of

ΠT (λ) (Ψθ ∗ ) = Ψθ ∗ . (5.36)

Moreover
1 − γλ
kΨθ ∗ − vkM ≤ kΠv − vkM . (5.37)
1−γ
Let us understand what the above bound states. Because we are approximating the true value vector v
by a vector of the form Ψθ, the best possible approximator in terms of the distance k · kM is given by Πv,
while kΠv − vkM is the smallest possible approximation error. Now (5.37) states that the limit of TD(λ)
iterations satisfies an error bound that is larger than the lowest possible error by an “expansion”factor of
(1 − γ)/(1 − γλ). In order to make this expansion factor smaller, we should choose λ closer to one. However,
the closer λ is to one, the more slowly the infinite series in (5.32) will converge. These tradeoffs are still not
well-understood.
The proof as usual consists of constructing a matrix C whose eigenvalues all have negative real parts,
and showing that θ ∗ converges to the solution of Cθ ∗ = b. The details can be found in the paper.
But let us discuss what the bound (5.37) means. Note that, since Ψθ ∈ Ψ(Rd ), the quantity kΠv∗ −v∗ kM
is the best that we could hope to achieve. The distance between the limit Ψθ ∗ and the true value vector v∗
is bounded by a factor (1 − γλ)/(1 − γ times this minimum. Note that (1 − γλ)/(1 − γ > 1. So this is the
extent to which the TD(λ) iterations miss the optimal approximation.

5.2.2 Average Returns


Until now we have studied discounted Markov Decision Processes. But we can also study average reward
processes. The key reference for this subsection is [43].
Suppose {Xt } is a Markov process on a finite set X , with state transition matrix A (which is unknown).
Suppose further that A is irreducible and aperiodic, and let µ denote the stationary distribution of A.
Suppose R : X → R is the reward function. Define
T
1 X
c∗ := lim R(Xt ). (5.38)
T →∞ T + 1
t=0
76 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

Then by the assumptions on A, it follows that


X
c∗ = µi R(xi ) = E[R, µ]. (5.39)
xi ∈X

Note that c∗ is a constant that is independent of the initial state. Now define
" T
#
1 X
J(xi ) := lim E R(Xt )|X0 = xi − c∗ (5.40)
T →∞ T + 1 t=0

to be the average reward starting in state xi , minus the overall average reward c∗ . So we can think of J(xi )
as the relative advantage or disadvantage of starting in state xi . Define J as the vector J(xi ) as xi varies
over X . Then J satisfies the so-called Poisson equation

J = r − c∗ 1n + AJ. (5.41)

The vector J is called the differential reward. Note that if we replace J by J + α1n for some constant α,
then because A1n = 1n , the vector J + α1n also satisfies (5.41). Therefore there is a unique vector J∗ ∈ Rn
that satisfies (5.41) and also µ> J∗ = 0. Then the set of all solutions to (5.41) is given by {J∗ +α1n : α ∈ R}.
Further, J∗ satisfies
X∞
J∗ = At (r − c∗ 1n ). (5.42)
t=0

Now the problem studied in this subsection is how to approximate J∗ . Choose a matrix Ψ ∈ Rn×d , and
approximate J∗ by Ψθ. A key assumption is that 1n 6∈ Ψ(Rd ). In other words, the vector of all ones does
not belong to the range of Ψ.
Suppose we have a sample path {(Xt , Wt+1 )}t≥0 . As before, for Xt ∈ X , define

yt = [ΨXt ]> ∈ Rd . (5.43)

In this case the approximating function is defined by


ˆ t , θ) := hyt , θi = y> θ = θ > yt .
J(X (5.44)
t

There is also another sequence {ct } in R that tries to approximate the constant c∗ . At time t, define the
temporal difference
δt+1 := Wt+1 − ct + J(Xˆ t+1 , θ t ) − J(X
ˆ t , θ t ). (5.45)
Next, choose a constant λ ∈ [0, 1) and use a TD(λ) update

ct+1 = ct + ηt+1 (Wt+1 − ct ), (5.46)


t
!
X
t−τ
θ t+1 = θ t + αt+1 δt+1 λ yτ , (5.47)
τ =0

where ηt , αt are step sizes. We will choose these step sizes in tandem, so that ηt = βαt for each time t; we
can also just choose β = 1 so that ηt = αt . Also, yτ is defined as per (5.43). Define the eligibility vector
t
X
zt := λt−τ yτ ∈ Rd , (5.48)
τ =0

and observe that xt satisfies the recursion

zt = λzt−1 + yt . (5.49)
5.2. VALUE APPROXIMATION VIA TD(λ) -METHODS 77

Next, define the projection matrix


Π := Ψ(Ψ> M Ψ)−1 Ψ> M, (5.50)
where M = Diag(µi ). Then, as we have seen before, for every a ∈ Rn , we have that
Πa = arg min ka − bkM ,
b∈Ψ(Rd )

where k · kM is the weighted Euclidean distance as before. For each λ ∈ [0, 1), define T (λ) : Rn → Rn by
∞ l
!
X X
(λ) l τ ∗ l+1
T v := (1 − λ) λ A (r − c 1n ) + A v . (5.51)
l=0 τ =0

Just as before, the map T (λ) is a contraction with respect to k · kM , with constant 1 − λ. Note that, for each
finite l, we have
X l
Aτ (r − c∗ 1n ) + Al+1 J∗ = J∗ . (5.52)
τ =0
With these observations, we can state the following theorem:
Theorem 5.4. Suppose ηt = βαt for some fixed constant β. Suppose further that

X ∞
X
αt = ∞, αt2 < ∞ (5.53)
t=0 t=0

Then, as t → ∞, almost surely we have that (i) ct → c∗ , and (ii) θ t → θ ∗ , where θ ∗ is the unique solution
of
ΠT (λ) (Ψθ ∗ ) = Ψθ ∗ . (5.54)
Define  
ct
φ= ∈ Rd+1 . (5.55)
θt
The idea behind the proof is the standard one, namely to use stochastic approximation to solve a linear
equation of the form
Cφ = b.
However, there are a few wrinkles. The matrix C turns out to be
 
−β 0
C= , (5.56)
1
− 1−λ Ψ> M 1n Ψ> M (A(λ) − In )Ψ
where

X
(λ)
A := (1 − λ) λl Al+1 . (5.57)
l=0
Now note that, since λ < 1, we have that ρ(λA) = λ < 1. Therefore

X
λl Al = (In − λA)−1 .
l=0

Therefore
A(λ) = (1 − λ)(In − λA)−1 A. (5.58)
(λ)
It is easy to verify that A 1n = 1n . Therefore, without additional assumptions, the matrix C is not
asymptotically stable in the sense of having its eigenvalues in the open left half-plane. However, A(λ) v 6= 0 if
v is not a multiple of 1n . Now the assumption that 1n 6∈ Ψ(Rd ) guarantees that successive approximations
Ψθ t are not multiples of 1n . In the complement of the one-dimensional subspace generated by 1n , the matrix
C is asymptotically stable.
78 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

5.3 Policy Gradient and Actor-Critic Methods


The contents of Sections 5.1 and 5.2 are aimed at value approximation. Thus, given a policy π, the objective
is to compute an approximation v̂π for the true value function vπ corresponding to this policy. Since the
true value vector vπ ∈ R|X | , if the size of the state space |X | =: n is too large, we choose v̂π = v̂π (θ), where
θ ∈ Rd and d  n. However, ultimately our objective is to choose good policies. So for this purpose we
can choose some suitable objective function, and directly choose π so as to optimize that objective function.
The question then arises as to what this objective function should be. For technical reasons, in policy
optimization, one studies the average reward, as opposed to the discounted cumulative reward that has been
the main focus until now. Recall that we briefly introduced average reward Markov processes in Section 5.2.
We shall build on that in the present section to go beyong just reward processes to decision processes.

5.3.1 Preliminaries
The set-up is the standard Markov Decision Process (MDP) with a finite state space X of cardinality n,
a finite action space U of cardinality m, and state transition matrices Auk for each uk ∈ U. Until now,
much of the focus has been on deterministic policies π : X → U, where the current state Xt determines
the current action Ut uniquely as π(Xt ). However, policy approximation methods that have been widely
studied do not work very well with deterministic policies. Instead, the focus is on probabilistic policies.
Given the action set U, let S(U) denote the set of probability distributions on U. Since U is a finite set,
S(U) consists of m-dimensional nonnegative vectors whose components add up to one, that is, the simplex in
Rm+ . A probabilistic policy π : X → S(U) is a map from the current state Xt to a corresponding probability
distribution on U. Until now there is no difference with earlier notation. Now we introduce a parameter
vector θ ∈ Rd , and denote the policy as π(xi , uk , θ). Thus
π(xi , uk , θ) = Pr{Ut = uk |Xt = xi , θ}. (5.59)
In order for policy approximation theory to work, it is usually assumed that some or all of the following
assumptions hold:
P1. We have that
π(xi , uk , θ) > 0, ∀xi ∈ X , uk ∈ U, θ ∈ Rd . (5.60)
Note that this assumption rules out deterministic policies.
P2. The quantity π(xi , uk , θ) is twice continuously differentiable with respect to θ. Moreover, there exists
a finite constant M such that
∂ ln π(xi , uk , θ)
≤ M, ∀xi ∈ X , uk ∈ U, θ ∈ Rd . (5.61)
∂θl
Note that a ready consequence of (5.61) is that
∂π(xi , uk , θ)
≤ M, ∀xi ∈ X , uk ∈ U, θ ∈ Rd . (5.62)
∂θl
See Problem 5.4.
In policy approximation, the objective is to identify an optimal choice of θ that would maximize a chosen
objective function. As stated above, this objective function is usually the average reward under the policy.
In principle, the value approximation methods studied earlier in this chapter could be used for this purpose,
as follows: For each possible policy π, compute an approximate value v̂π . These approximations are a
function of an auxiliary parameter vector θ which is suppressed in the interests of clarity. Ensure that these
approximations are uniformly close in the sense that
kvπ − v̂π k∞ ≤ , ∀π ∈ Π, (5.63)
5.3. POLICY GRADIENT AND ACTOR-CRITIC METHODS 79

where Π is the class of policies under study. Note that (5.63) is equivalent to

|Vπ (xi ) − V̂π (xi )| ≤ , ∀xi ∈ X , π ∈ Π.

Now fix a state xi ∈ X , and choose a policy π̂i∗ ∈ Π such that

π̂i∗ = arg min V̂π (xi ). (5.64)


π∈Π

Then π̂i∗ is an approximately optimal policy when starting from the state xi . Define πi∗ to be the “true”
optimal policy, that is
πi∗ = arg min V (xi ). (5.65)
π∈Π

Then it is a ready consequence of (5.63) that

|V̂π̂i∗ (xi ) − Vπi∗ (xi )| ≤ . (5.66)

In other words, the optimum of the approximate value function (for a given starting state) is within  of the
true optimal value starting from xi . The proof is left as an exercise; see Problem 5.5. Thus it is possible,
at least in principle, to compute a good approximation to the optimal value. However, there is no reason to
assume that the policy π̂i∗ is in any way “close” to the true optimal policy πi∗ .
To address this issue, in policy optimization one directly parametrizes the policy π as π(xi , uk , θ), and
then chooses θ so as to optimize an appropriate objective function. As mentioned above, the preferred choice
is the average reward. In order to carry out this process, it is highly desirable to have an expression for the
gradient of this objective function with respect to the parameter vector θ. This is provided by the “policy
gradient theorem” given below. However, there are other considerations that need to be taken into account.
Numerical examples have shown that if a nearly optimal policy is chosen on the basis of value approxi-
mation, the resulting set of policies may not converge; see for example [41]. To quote from the article:
This analysis is particularly interesting, since the algorithm is closely related to Q-learning
(Watkins and Dayan, 1992) and temporal-difference learning (T D(λ)) (Sutton, 1988), with λ
set to 0. The counter-example discussed demonstrates the short-comings of some (but not all)
variants of Q-learning and temporal-difference learning that are employed in practice.
This means the following: For a fixed tolerance , choose an approximately optimal policy π̂i∗ () as in
(5.64). As the tolerance  is reduced towards zero, we will get a sequence of nearly optimal policies, one
for each . The question is: As  → 0+ , does the corresponding policy π̂i∗ () approach the true optimal
policy πi∗ ? In general, the answer is no. So this would seem to rule out value approximation as a way
of determining nearly optimal policies, though it does allow us to determine nearly optimal values. On
the other side, choosing nearly optimal policies using policy parametrization leads to slow convergence and
high variance. A class of algorithms known as “actor-critic” consist of simultaneously carrying out policy
optimization (approximately) coupled with value approximation. Numerical experiments show that actor-
critic algorithms lead to faster convergence and lower variance than pure policy optimization alone. It must
be emphasized that “actor-critic” refers to a class of algorithms (or a philosophy), and not one specific
algorithm. Thus, within the umbrella actor-critic, there are multiple variants currently in use.
As stated above, actor-critic algorithms update the policy and the value in parallel. Here “actor” refers
to policy updating, while “critic” refers to value updating. Some assumptions about actor-critic algorithms
are often understated or even unstated, so we mention them now.
ˆ The behavior of actor-critic algorithms has been analyzed for the most part only when the objective
function to be maximized is the average reward.
ˆ In order to prove the convergence of actor-critic algorithms, it is usually assumed that the actor (policy)
is updated more slowly than the critic (value).
80 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

5.3.2 Policy Gradient Formula


In this section we present an important theorem on evaluating the gradient of a cost function with respect to
the parameter that determines a policy. The formula for the gradient is not entirely “expicit,” but has the
ingredients that permit one to “learn” this gradient and to use it in order to maximize an objective function.
We begin with the problem set-up. As before, we have a finite state space X , a finite action space U,
and a family of probabilistic policies π(θ), where each π(θ) : X → S(U), where S(U) is the set of probability
distributions on U. Thus, as stated in (5.59), we define
π(xi , uk , θ) = Pr{Ut = uk |Xt = xi , θ}.
Assume that conditions (P1) and (P2) hold regarding the differentiability of the policy with respect to the
parameter θ. There is also a reward function R : X × U → R. Under the policy π(θ), the state transition
matrix of the corresponding Markov process is given by
X
Pr{Xt+1 = xj |Xt = xi , π(θ)} = [Aπ(θ) ]ij =: [Aθ ]ij = π(xi , uk , θ)auijk .
uk ∈U

In what follows, to simplify notation, we denote various quantities with the subscript or superscript θ,
instead of π(θ). To illustrate, above we have used Aθ to denote Aπ(θ) .
Assume that Aθ is irreducible and aperiodic for each θ, and let µθ denote the corresponding stationary
distribution. Therefore µθ satisfies
" #
X X uk
µθ (xi ) π(xi , uk , θ)aij = µθ (xj ) (5.67)
xi ∈X uk ∈U

The above equation illustrates another notational convention. For a vector such as µθ , we use bold-faced
letters, while for a component of the vector such as µθ (xi ), we don’t use bold-faced letters.
Associated with each policy π is an average reward c∗ , defined in analogy with (5.38) and (5.39), namely
T
∗ 1 X
c (θ) := lim Rθ (Xt ) = E[Rθ , µθ ]. (5.68)
T →∞ T + 1
t=0

Because the policy π is probabilistic, we have from (2.27) that1


X
Rθ (xi ) = π(xi , uk , θ)R(xi , uk ). (5.69)
uk ∈U

Therefore an equivalent characterization of c∗ (θ) is


X X
c∗ (θ) = µθ (xi ) π(xi , uk , θ)R(xi , uk ). (5.70)
xi ∈X uk ∈U

In addition to the constant c∗ which depends only on θ and nothing else, we also define the function
Qθ : X × U → R by

X
Qθ (xi , uk ) = E[Rθ (Xt ) − c∗ (θ)|X0 = xi , U0 = uk , π(θ)]. (5.71)
t=1

Note that, because C ∗ (θ) is the “weighted average” of the various rewards, the summation in (5.71) is
well-defined even though there is no discount factor. Also, Qθ (xi , uk ) satisfies the recursion
X u
Qθ (xi , uk ) = R(xi , uk ) − c∗ (θ) + aijk Vθ (xj ), (5.72)
uk ∈U
1 As per our convention, if Xt = xi , the reward R(xi ) is paid at time t + 1.
5.3. POLICY GRADIENT AND ACTOR-CRITIC METHODS 81

where the value Vθ (xi ) is defined as


X
Vθ (xi ) = π(xi , uk , θ)Qθ (xi , uk ). (5.73)
uk ∈U

Now the objective of policy approximation is to choose the parameter θ so as to maximize c∗ (θ). For
this purpose, it is highly desirable to have an expression for the gradient ∇θ c∗ (θ). This is precisely what the
policy gradient theorem gives us. The expression for the gradient can be combined with various methods to
look for a minimizing choice of θ.
With all this notation, the policy gradient theorem can be stated very simply. The source for this proof
is [34]. Note that in that paper, the policy gradient theorem is proved not only for the average reward but
for another type of reward as well. We do not present that result here and the interested reader can consult
the original paper.
Theorem 5.5. (Policy Gradient Theorem) We have that
X X
∇θ c∗ (θ) = µθ (xi ) ∇θ π(xi , uk , θ)Qθ (xi , uk ). (5.74)
xi ∈X uk ∈U

Remark: As we change the parameter θ, the corresponding stationary distribution µθ also changes.
Moreover, it is difficult to compute this distribution, and even more difficult to compute the gradient with
respect to θ. Therefore the main benefit of the policy theorem is that the only function whose gradient
appears is π(xi , uk , θ), and this is a function that is chosen by the learner; therefore it is easy to compute
the gradient ∇θ π(xi , uk , θ).
For convenience we will write ∂/∂θ instead of ∇θ , even though θ is a vector. Recall the expression (5.73)
for Vθ (xi ). This leads to
∂Vθ (xi ) X  ∂π(xi , uk , θ) ∂Qθ (xi , uk )

= Qθ (xi , uk ) + π(xi , uk , θ)
∂θ ∂θ ∂θ
uk ∈U
 
X ∂π(xi , uk , θ) X ∂  X u
= Qθ (xi , uk ) + π(xi , uk , θ) R(xi , uk ) − c∗ (θ) + aijk Vθ (xj ) .
∂θ ∂θ
uk ∈U uk ∈U xj ∈X

However, R(xi , uk ) does not depend on θ so its gradient is zero. Therefore the above equation can be
expressed as
 

∂Vθ (xi ) X ∂π(xi , uk , θ) X ∂c (θ) X u ∂Vθ (xj )
= Qθ (xi , uk ) + π(xi , uk , θ) − + aijk 
∂θ ∂θ ∂θ ∂θ
uk ∈U uk ∈U xj ∈X
X ∂π(xi , uk , θ) ∂c∗ (θ) X X u ∂Vθ (xj )
= Qθ (xi , uk ) − + π(xi , uk , θ) aijk , (5.75)
∂θ ∂θ ∂θ
uk ∈U uk ∈U xj ∈X

because ∂c∗ (θ)/∂θ does not depend on uk , and


X
π(xi , uk , θ) = 1.
uk ∈U

Now multiply both sides of (5.75) by µθ (xi ) and sum over xi ∈ X . This gives
X ∂Vθ (xi ) X X ∂π(xi , uk , θ) ∂c∗ (θ)
µθ (xi ) = µθ (xi ) Qθ (xi , uk ) −
∂θ ∂θ ∂θ
xi ∈X xi ∈X uk ∈U
X X X ∂Vθ (xj )
+ µθ (xi ) π(xi , uk , θ) auijk . (5.76)
∂θ
xi ∈X uk ∈U xj ∈X
82 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

Here again we use the fact that c∗ (θ) does not depend on xi , so that
X ∂c∗ (θ) ∂c∗ (θ)
µθ (xi ) = .
∂θ ∂θ
xi ∈X

Now we invoke (5.67) which involves µ(θ). This shows that the last term on the right side of (5.76) can be
expressed as
X X X u ∂Vθ (xj ) X ∂Vθ (xj )
µθ (xi ) π(xi , uk , θ) aijk = µθ (xj ) .
∂θ ∂θ
xi ∈X uk ∈U xj ∈X xj ∈X

Therefore (5.76) can be expressed as


X ∂Vθ (xi ) X X ∂π(xi , uk , θ) ∂c∗ (θ) X ∂Vθ (xj )
µθ (xi ) = µθ (xi ) Qθ (xi , uk ) − + µθ (xj ) .
∂θ ∂θ ∂θ ∂θ
xi ∈X xi ∈X uk ∈U xj ∈X

Obviously the left side is the same as the last term on the right side. Cancelling these two terms and
rearranging gives
∂c∗ (θ) X X ∂π(xi , uk , θ)
= µθ (xi ) Qθ (xi , uk ),
∂θ ∂θ
xi ∈X uk ∈U

which is the desired result.


Now we present a related result known as the “policy gradient theorem with functional approximation.”
While the formula (5.74) is very neat, its applicability is limited by the fact that it involves the unknown
function Qθ . To get around this difficulty, we replace the true function Qθ by an approximation Q̂θ .
Specifically, suppose Q̂θ (Xt , Ut ) is an unbiased estimator of Qθ (Xt , Ut ), for example R(Xt , Ut ). In turn
suppose that f : X × U × Ω → R is an approximator to Q̂θ . Note that Q̂θ (Xt , Ut ) can be observed at each
time instant. So we try to choose the parameter ω ∈ Ω so as to minimize the average error
T
1 X
J(ω) := lim [f (Xt , Ut , ω) − Q̂(Xt , Ut )]2 .
T →∞ 2(T + 1)
t=0

So at time t, we can update the current guess via

(ω t+1 − ω t ) ∝ ∇ω J(ω t ) = [f (Xt , Ut , ω) − Q̂(Xt , Ut )]∇ω f (Xt , Ut , ω t ),

where we use only the last term in the summation (as it is the only one that depends on ωt . Note that we
use the symbol ∝ because for the moment we are not worried about the choice of the step size.
As is by now familiar practice, we can write J(ω) equivalently as

1 X X
J(ω) = η(xi , uk )[f (xi , uk , ω) − Q̂(xi , uk )]2 ,
2
xi ∈X uk ∈U

where η is the stationary distribution of the joint Markov process {(Xt , Ut )}. It is easy to see that

Pr{Xt = xi , Ut = uk |π(θ)} = Pr{Xt = xi |π(θ)} · Pr{Ut = uk |Xt = xi , π(θ)} = µθ (xi )π(xi , uk , θ).

We will encounter the above formula again. Also, since Q̂θ is an unbiased estimate of Qθ , we can replace Q̂θ
by Qθ when we take the expectation (i.e., sum over xi and uk ). So when ω is chosen such that ∇ω J(ω) = 0
(i.e., at a stationary point of J(·)), we have that
X X
µθ (xi )π(xi , uk , θ)[f (xi , uk , ω) − Qθ (xi , uk )]∇ω f (xi , uk , ω) = 0. (5.77)
xi ∈X uk ∈U
5.3. POLICY GRADIENT AND ACTOR-CRITIC METHODS 83

Now suppose that the Q̂θ -approximating function f is “compatible” with the policy-approximating func-
tion π in the sense that
1
∇ω f (xi , uk , ω) = ∇θ π(xi , uk , θ). (5.78)
π(xi , uk , θ)
One possibility is simply to choose f (·, ·, ω) to be linear in ω, with the gradient given by the right side of
(5.78). Now (5.78) can be rewritten as

π(xi , uk , θ)∇ω f (xi , uk , ω) = ∇θ π(xi , uk , θ).

Hence, under (5.78), (5.77) implies that


X X
µθ (xi )∇θ π(xi , uk , θ)[f (xi , uk , ω) − Qθ (xi , uk )] = 0.
xi ∈X uk ∈U

Combining the above with (5.74) gives an alternate formula for the gradient of the value function c∗ (θ),
namely X X
∇θ c∗ (θ) = µθ (xi )∇θ π(xi , uk , θ)f (xi , uk , ω). (5.79)
xi ∈X uk ∈U

The significance of this formula is that the unknown function Qθ can be replaced by the approximating
function f (xi , uk , ω). So if we use a gradient search method (for example) to update θ t , we can use (5.79)
instead of (5.74). Note that this introduces a “coupling” between θ t and ω t . In other words, as the policy
approximation gets updated, so does the value approximation. This is a precursor to actor-critic methods
which we study next.

5.3.3 Actor-Critic Methods


In this subsection, we elaborate upon the policy gradient theorem to study simultaneous approximation of
both policy and value, often known as actor-critic methods. The actor is the policy, which is controlled by
a parameter θ ∈ Rd , while the critic is an approximator of the value corresponding to the policy, and is
controlled by a parameter ζ ∈ Rl . In [23, 22], which is the main reference for this subsection, it is assumed
that l > d, that is, there are mor adjustable parameters in the critic than there are in the actor. Morever,
the actor is of the form Ψθ while the critic is of the form Φζ. Not only does Φ have more columns than Ψ,
but the range of Ψ is a subset of the range of Ψ. In other words, the critic’s parameterization contains the
actor’s parametrization as a subset.
The initial ideas in [23] are similar to those in [34], but with different notation. So we begin by restating
the first part of [23] in the current notation. If πθ is a policy, then under this policy not only is {Xt } a
Markov process, but so is the joint process {(Xt , Ut )}. If µθ denotes the stationary distribution of Xt , then
the joint distribution of (Xt , Ut ) is given by

Pr{Xt = xi , Ut = uk } = Pr{Xt = xi } Pr{Ut = uk |Xt = xi , πθ } = µθ (xi )π(xi , uk , θ) =: ηθ (xi , uk ). (5.80)

With this definition, (5.70) can also be written a


X X
c∗ (π(θ)) = ηθ (xi , uk )R(xi , uk ) = E[R, η]. (5.81)
xi ∈X uk ∈U

Thus c∗ (θ) is the expected value of the reward R under the joint stationary distribution of (Xt , Ut ). Moreover,
Vθ as defined in (5.73) satisfies the Poisson equation
 
X X u
c∗ (θ) + Vθ (xi ) = π(xi , uk , θ) R(xi , uk ) + aijk Vθ (xj ) . (5.82)
uk ∈U xj ∈X
84 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

We continue to have the characterization (5.72) for Qθ .


Now we make a few assumptions regarding the policy. Define the vector
1
ω(xi , uk , θ) = ∇θ π(xi , uk , θ). (5.83)
π(xi , uk , θ)
Note that
∂ ln π(xi , uk , θ)
ω(xi , uk , θl ) = .
∂θi
We can also write
∇θ π(xi , uk , θ) = π(xi , uk , θ)ω(xi , uk , θ).
So (5.74) becomes
X X
∇θ c∗ (θ) = µθ (xi )π(xi , uk , θ)ω(xi , uk , θ)Qθ (xi , uk )
uk ∈U xi ∈X
X X
= ηθ (xi , uk )ω(xi , uk , θ)Qθ (xi , uk ). (5.84)
uk ∈U xi ∈X

Now we give a very brief description of the actor-critic methods in [23]. Suppose q1 , q2 are two functions
that map X × U into R. In other words, q1 , q2 are |X | · |U |-dimensional vectors. Suppose we define an inner
product hq1 , q2 iη as X X
hq1 , q2 iη = ηθ (xi , uk )q1 (xi , uk )q2 (xi , uk ).
uk ∈U xi ∈X

This is similar to Lemma 5.4, where we defined an inner product making use of a stationary distribution.
Then
∂c∗ (θ)
= hqθ , ω θ iη ,
∂θi
where qθ is the function Qθ written out as a vector of dimension nm, and ω θ ∈ Rnm is the vector defined in
(5.83). Hence, in order to compute this partial derivative, it is not necessary to know the vector Qθ – it is
enough to know a projection of it on the space spanned by the ω θ vectors. Now suppose we use a “linear”
parametrization of the type
Xd
π(θ) = ψ l θl ,
l=1
nm
where each ψ l ∈ R and the components can be thought of as ψl (xi , uk ). Recall that π(θ) is a probability
distribution on U, so the set of policies consists of summations of the above form that are in S(U). One
possibility is to choose the columns to correspond to policies, and to restrict θ to vary over the unit simplex
in Rd , so that the policy πθ is a convex combination of the columns of Ψ, as opposed to a linear combination
of the columns. In any case, we have
∂π(θ)
= ψl ,
∂θl
and as a result
1
ωl = ψ ∝ ψl .
π(θ) l
Therefore the span of the vectors {ω l } is the same as the span of the vectors ψ l , that is, Ψ(Rd ). So all we
need to find is a projection of the nm-dimensional vector qθ onto Ψ(Rd ). The projection is computed using
the inner product h·, ·iη defined above. Nevertheless, the projection belongs to Ψ(Rd ).
In [23], the authors slightly over-parametrize by taking
s
X
q̂θ = φl ζl = Φζ,
l=1
5.3. POLICY GRADIENT AND ACTOR-CRITIC METHODS 85

as an approximation for the projection of q, where Φ ∈ Rnm×s and ζ ∈ Rs . Here s  nm so that the above
approximation results in a reduction in the dimension of q. At the same time, one chooses s > d, so that ζ
has more parameters than θ. Moreover, one chooses the vectors in such a way that

span{ψ 1 , · · · , ψ d } ⊆ span{φ1 , · · · , φs }.

Therefore, the approximant q̂ may not belong to Ψ(Rd ). However, the projection of q̂ onto Ψ(Rd ) can be
used to determine the gradient of c∗ (θ).
Now we present the actor-critic methods in [23]. Unlike in earlier situations where we had a fixed policy,
here the policy also varies with t. Nevertheless, we will get a sample path {(Xt , Ut , Wt+1 )}. That sample
path is used below.
The critic tries to form an estimate of the average reward process. The equations are analogous to (5.45),
(5.46) and (5.47). There are two estimates, one for the value of the process c∗ (θ), and another for the
parameter ζt that gives an approximation q̂ζ t = Φζ t . These updates are as follows:

ct+1 = ct + βt+1 (Wt+1 − ct ),

ζ t+1 = ζ t + βt+1 δt+1 zt ,


where βt is a step size,
δt+1 = Wt+1 − ct + Q̂ζ t (Xt+1 , Ut+1 ) − Q̂ζ t (Xt , Ut )
is the temporal difference. The eligibility vector zt is defined as follows: Let

yτ := [Φ(Xt ,Ut ) ]>

denote the (Xt , Ut )-th row of the matrix Φ, transposed. Then the TD(λ) update is defined via
t
X
zt = λt−τ yτ
τ =0

As before zt satisfies the recursion


zt = λzt−1 + yt .
Konda and Tsitsiklis study the case λ = 1, so that

zt = zt−1 + yt .

This is at the opposite end of the spectrum from T D(0) which is the standard TD-learning.
For updating the policy, we choose a globally Lipscitz-continuous function Γ : Rs → R+ with the property
that, for some finite constant C > 0, we have
C
Γ(ζ) ≤ , ∀ζ ∈ Rs .
1 + kζk
Then the policy update is just a gradient update, given by

θ t+1 = θ t + αt+1 Γ(ζ t )(Q̂ζ t (Xt+1 , Ut+1 )ω θt (Xt+1 , Ut+1 ).

Note that the update of θ t depends also on ζ t .


To study the convergence properties of the above actor-critic method, we make the following assumption
called “uniform positive definiteness”: For each θ ∈ Rd , the s × s matrix G(θ) satisfies
X X
G(θ) = η(xi , uk , θ)[Φ(xi ,uk ) ]> Φ(xi ,uk ) ≥ Is ,
xi ∈X uk ∈U

for some positive constant . With this assumption we can state the following theorem:
86 CHAPTER 5. PARAMETRIC APPROXIMATION METHODS

Theorem 5.6. Suppose that



X ∞
X ∞
X ∞
X
αt = ∞, αt2 < ∞, βt = ∞, βt2 < ∞,
t=0 t=0 t=0 t=0

and in addition
αt
→ 0 as t → ∞.
βt
If {θ t } is bounded almost surely, then
lim k∇θ c∗ (θ)k = 0.
t→∞

Note that the policy parameter θ t is update more slowly than the value parameter ζ t .
Problem 5.4. Show that (5.62) is a consequence of (5.61).
Problem 5.5. Prove (5.66).

5.4 Zap Q-Learning


Chapter 6

Introduction to Empirical Processes

6.1 Concentration Inequalities


6.2 Vapnik-Chervonenkis and Pollard Dimensions
6.3 Uniform convergence of Empirical Means
6.4 PAC Learning
6.5 Mixing Stochastic Processes
6.5.1 Beta-Mixing Stochastic Processes
6.5.2 UCEM and PAC Learning with Beta-Mixing Inputs

87
88 CHAPTER 6. INTRODUCTION TO EMPIRICAL PROCESSES
Chapter 7

Finite-Time Bounds

In previous chapters, the convrgence results presented are mostly asymptotic in nature. By and large, they
do not tell us what happens after a particular learning algorithm has been run a finite number of times. In
contrast, in the present chapter, several results are presented that are “finite time” in nature.

7.1 Finite Time Bounds on Regret


Suppose there are K machines (“bandits”) with random payoffs in [0, 1], with unknown means µ1 , · · · , µK ,
and unknown probability distributions. Each time a machine is played, it returns a payoff distributed
according to its unknown probability distribution. Moreover, each return is independent of all other returns
of the same machine. It is evident that, as a machine is played more and more times, the learner knows more
and more about the unknown probability distribution. In turn, this information can be used to make quasi-
informed decisions about which machine to play next. The objective is to develop a strategy for choosing
the next machine to play in such a way that an appropriate objective function is optimized. The objective
function studied in this section is the “regret” associated with a strategy. Define

µ∗ := max µi .
1≤i≤K

Thus µ∗ is the best possible return. If learner knew which machine has the return µ∗ , then s/he would just
keep playing that machine, which would lead to the maximum possible expected return per play. On the
other hand, there is some amount of “exploration” of all the machines, during the course of which many
suboptimal choices will be made. The regret attempts to capture the return foregone, which is seen as the
cost of exploration. Specifically, suppose that some strategy for choosing the machine to be played at each
time generates a sequence of indices {Mt }t≥1 where each Mt lies between 1 and K. After n plays, let Ti (n)
denote the number of times that machine i is played. Where convenient, we also denote this as ni (n) or
just ni if n is obvious from the context. Now, the actual returns of machine i at these ni time instants are
random, and can be denoted by {Xi,1 , · · · , Xi,ni }. By playing machine i a total of ni times until time n, the
reward foregone is
XK
µ∗ n − µi ni .
i=1

However, this is a random number, because the number of plays ni is random. Therefore, corresponding to
a policy for choosing the next machine to be played, the regret is defined as
K
X
C = µ∗ n − µi E[ni ]. (7.1)
i=1

89
90 CHAPTER 7. FINITE-TIME BOUNDS

There are several equivalent ways to express the regret. For example, define the empirical return of
machine i after n plays as
ni
1 X
µ̂i := Xi,t . (7.2)
ni t=1

Due to the assumption that the returns are i.i.d., it is obvious that E[µ̂i ] = µi . We can write
"K #
X

C=E ni (µ − µ̂i ) .
i=1

In this section we follow [2]. We study two strategies, and derive upper bounds for the regret of each
strategy. In the second case, our result is a slight generalization of the cotents of [2].

7.2 Finite Time Bounds for Reinforcement Learning


7.3 Probably Approximately Correct Markov Decision Processes
7.4 Unification of Regret and RL Bounds
7.5 Empirical Dynamic Programming
Chapter 8

Background Material

The objective of this chapter is to collect in one place the background material required to understand the
main body of the notes. While the exposition is rigorous, and several references are given throughout, these
notes are not, by themselves, sufficient to gain a mastery over these topics. A reader who is encountering
these topics for the first time is strongly encouraged to consult the various references in order to understand
the present discussion thoroughly. To avoid tedious repetition, we do not include the phrase “Introduction
to” in the title of each section, but that should be assumed.

8.1 Random Variables and Stochastic Processes


In this section we give a very cursory introduction to the topics of measure, probability, random variables, and
allied concepts. The topic can be said to have been started by Kolmogorov, and his very brief monograph [21]
gives a good motivation for the subject. For concepts from measure theory, the reader can consult [4] which
is a thorough treatment of these topics. Another good source with greater emphasis on probability theory
is [27]. Topics such as conditional expectation and martingales are briefly discussed in [6, 11]. However, a
more detailed introduction can be found in [48].

8.1.1 Random Variables


Definition 8.1. Suppose Ω is a set and that F is a collection of subsets of X. Then F is said to be a
σ-algebra1 if F satisfies the following axioms:

S1. Ω ∈ F.

S2. If A ∈ F then Ac ∈ F, where Ac denotes the complement of A in Ω.2

S3. If {Ai }i≥1 is any countable sequence of sets belonging to F, then



[
Ai ∈ F. (8.1)
i=1

The pair (Ω, F) is called a measurable space.

Definition 8.2. Suppose (Ω, F) is a measurable space. A function P : F → [0, 1] is called a probability
measure if it satisfies the following axioms:
1 The term σ-field is more popular, but this terminology is preferred here.
2 Note that [S1] and [S2] together imply that ∅ ∈ F .

91
92 CHAPTER 8. BACKGROUND MATERIAL

P1. P (Ω) = 1.

P2. P is countably additive, that is: Whenever {Ai } are pairwise disjoint sets from F, we have that
∞ ∞
!
[ X
P Ai = P (Ai ). (8.2)
i=1 i=1

The triple (Ω, F, P ) is called a probability space.

Note that if Ω is a finite or countable set, it is customary to take F to be the “power set” of Ω, that is,
the collection of all subsets of Ω, often denoted by 2Ω . If a nonnegative weight pi is assigned to each element
i ∈ Ω, and if P is defined as X X
P (A) = pi = pi I{i∈A} , (8.3)
i∈A i∈Ω

then it is easy to verify that (Ω, 2Ω , P ) is a probability space. However, if Ω is an uncountable set, e.g., the
real numbers, then the above approach of assigning weights to individual elements does not work. Note that
in (8.3), I{i∈A} is the indicator function that equals 1 if i ∈ A and 0 if i 6∈ A.

Definition 8.3. Suppose (Ω, F) and (X , G) are measurable spaces. Then a map f : Ω → X is said to be
measurable if f −1 (S) ∈ F for all S ∈ G.

Thus a map from Ω into X is measurable if the preimage of every set in G under f belongs to F.

Definition 8.4. Suppose (Ω, F, P ) is a probability space, and (X , G) is a measurable space. A function
X : Ω → X is said to be a random variable if it is measurable, that is,

X −1 (S) ∈ F, ∀S ∈ G. (8.4)

In such a case, for each set S ∈ G, the quantity

P (X −1 (S)) =: PX (S)

is called the probability that X ∈ S.

In the above definition, (X , G) is called the “event space,” and the sets belonging to the σ-algebra G are
called “events,” because each such set has a probability associated with it (via (8.4))). The triple (Ω, F, P )
is called the “sample space.”

Example 8.1. Suppose we wish to capture the notion of a two-sided coin that comes up H for heads 60% of
the time, and T for tails 40% of the time. In such a case, the event space (the set of possible outcomes) is just
X = {H, T }. Because the set X is finite, the corresponding σ-algebra G can be just 2X = {∅, {H}, {T }, X }.
The sample space (Ω, F, P ) can be anything, as can the map f : Ω → X , provided only that two conditions
hold: First,

f −1 ({H}) = {ω ∈ Ω : f (ω) = H} ∈ F, f −1 ({T }) = {ω ∈ Ω : f (ω) = T } ∈ F.

(Actually, either one of the conditions would imply the other.) Second,

P (f −1 ({H}) = 0.6, P (f −1 ({T }) = 0.4.

Definition 8.5. Suppose X is a random variable defined on the sample space (Ω, F, P ) taking values in
(X , G). Then the σ-algebra generated by X is defined as the smallest σ-algebra contained in F with
respect to which X is measurable, and is denoted by σ(X).
8.1. RANDOM VARIABLES AND STOCHASTIC PROCESSES 93

Example 8.2. Consider again the random variable studied in Example 8.1. Thus X = {H, T } and G = 2X .
Now suppose X is a measurable map from some (Ω, F, P ) into (X , G). Then all possible preimages of sets
in G are:
∅ = X −1 (∅), Ω = X −1 (X ), A := X −1 ({H}), B := X −1 ({T }) = Ac = Ω \ A.
Thus the smallest possible σ-algebra on Ω with respect to which X is measurable consists of {∅, A, Ac , Ω}.
Therefore this is the σ-algebra of Ω generated by X. Any other sets in F are basically superfluous. We can
carry this argument further and simply take the sample space Ω to be the same as the event space X , and X
to be the identity operator on (Ω, F, P ) into (Ω, F). Thus Ω = {H, T }, and F = {∅, {H}, {T }, Ω}. Further,
we can define P ({H}) = 0.6, P ({T }) = 0.4. This is sometimes called the canonical representation of the
random variable X. Usually we can do this whenever the event space is finite or countable.
Originally, the phrase “random variable” was used only for the case where the event space X = R, and
the σ-algebra is the so-called Borel σ-algebra, which is defined as the smallest σ-algebra of subsets of R
that contains all closed subsets of R. Random quantities such as the outcomes of coin-toss experiments were
called something else (depending on the author). Subsequently, the phrase “random variable” came to be
used for any situation where the outcome is uncertain, as defined above. In the context of Markov Decision
Processes, often the Markov process evolves over a finite set, and the action space is also finite. So a lot of
the heavy machinery above is not needed to describe the evolution of an MDP. However, in reinforcement
learning, the parameters of the MDP need to be estimated, including the reward–and these are real-valued
quantities. So it is desirable to deal with random variables that assume values in a continuum. When we
do that, the sample set Ω equals R or some subset thereof, and F equals the Borel σ-algebra. Finally, the
probability measure P is defined on Ω.

8.1.2 Independence, Joint and Conditional Probabilities


Kolmogorov, who laid down the foundations of probability theory, remarks on [21, p. 8] (in English transla-
tion) that
Historically, the independence of experiments and random variables represents the very mathe-
matical concept that has given the theory of probability its peculiar stamp.
This statement, together with the text that precedes it, can be paraphrased as: Without the concept of
independence, there is essentially no difference between measure theory and probability theory. Thus the
concept of independence is fundamental (and unique) to probability theory.
Definition 8.6. Suppose (Ω, F, P ) is a probability space. Then two events S, T ∈ F are said to be inde-
pendent if
P (S ∩ T ) = P (S)P (T ).
Suppose now that F1 , F2 are sub-σ-algebras of F. Then F1 and F2 are said to be independent if

P (S ∩ T ) = P (S)P (T ), ∀S ∈ F1 , T ∈ F2 . (8.5)

Two random variables X1 , X2 defined on (Ω, F, P ) are said to be independent if the corresponding σ-
algebras σ(X1 ), σ(X2 ) are independent.
It is easy to show that two events S, T are independent if and only if the corresponding σ-algebras
F1 = {∅, S, S c , Ω} and F2 = {∅, T, T c , Ω} are independent.
The extension of the above definition to any finite number of events, or σ-algebras, or random variables,
is quite obvious. For more details, see [11, Section 3.1] or [48, Chapter 4].
Until now we have discussed what might be called “individual” random variables. Now we discuss the
concept of joint random variables, and the associated notion of joint probability. The definition below is
for two joint variables, but it is obvious that a similar definition can be made for any finite number of joint
random variables. In turn this reads to the concept of conditional probability.
94 CHAPTER 8. BACKGROUND MATERIAL

Definition 8.7. Suppose (X , G) and (Y, H) are measurable spaces. Then the product of these two spaces
is (X × Y, G ⊗ H) where G ⊗ H is the smallest σ-algebra of subsets of X × Y that contains all products of
the form S × T, S ∈ G, T ∈ H.
Note that G ⊗ H is called the “product” σ-algebra, and is not to be confused with G × H, the Cartesian
product of the two collections G and H. In fact one can write G ⊗ H = σ(G × H), where σ(S) denotes the
smallest σ-algebra containing all sets in the collection S. Previously we had defined σ(X), the σ-algebra
generated by a random variable X. The two usages are consistent. Suppose X is a random variable on
(Ω, F, P ) mapping Ω into (X , G), and let S consist of all preimages in Ω of sets in G. Then σ(X) and σ(S)
are the same.
Suppose (Ω, F, P ) is a probability space, and that (X , G) and (Y, H) are measurable spaces. Let (X ×
Y, G ⊗ H) denote their product. Suppose further that Z : Ω → X × Y is measurable and thus a random
variable taking values in X × Y. Express Z as (X, Y ) where X, Y are the components of Z, so that
X : Ω → X , Y : Ω → Y. Then it can be shown that X and Y are themselves measurable and are thus
random variables in their own right. The probability measures associated with these two random variables
are as follows:
PX (S) := P −1 (Z ∈ (S × Y)), ∀S ∈ G, PY (T ) := P −1 (Z ∈ (X × T )), ∀T ∈ H. (8.6)
We refer to Z = (X, Y ) as a joint random variable with joint probability measure PZ , and to PX and PY
as the marginal probability measures (or just marginal probabilities) of PZ for X and Y respectively.
Definition 8.8. Suppose (Ω, F, P ) is a probability space. Suppose (X , G) and (Y, H) are measurable spaces,
and let Z = (X, Z) : Ω → X × Y be a joint random variable. Finally, suppose S ∈ G, T ∈ H are events
involving X and Y respectively. Then the conditional probability Pr{X ∈ S|Y ∈ T } is defined as
Pr{Z = (X, Y ) ∈ S × T } PZ (S × T )
Pr{X ∈ S|Y ∈ T } = = . (8.7)
Pr{Y ∈ T } PY (T )
Further, X and Y are said to be independent if
PZ (S × T ) = PX (S) × PY (T ), ∀S ∈ G, T ∈ H. (8.8)
In the definition of the conditional probability (8.7), it is assumed that PY (T ) > 0.
A common application of conditional probabilities arises when both X and Y are finite sets. In this
case X, Y, Z are random variables assuming values in finite sets X , Y, X × Y. Suppose to be specific that
X = {x1 , · · · , xn } and Y = {y1 , · · · , ym }. Then it is convenient to represent the joint probability distribution
of Z = (X, Y ) as an n × m matrix Θ, where
θij = Pr{Z = (xi , yj )} = Pr{X = xi &Y = yj }.
Let us denote the marginal probabilities as
φi = Pr{X = xi }, ψj = Pr{Y = yj }.
Then it is easy to infer that
φ> = Θ1n , ψ = 1>
n Θ,
where 1k denotes a column vector of k ones. Note that we follow the convention that a probability distribution
is a row vector. Also, in this simple situation, it can be assumed without any loss of generality that φi > 0
for all i, and ψj > 0 for all j. If φi = 0 for some index i, it means that θij = 0 for all j; therefore the element
xi can be deleted from the X without affecting anything. Similar remarks apply to ψ as well. With these
notational conventions, it is easy to see that
P
Pr{Z = (X, Y ) ∈ S × T } xi ∈S,yj ∈T θij
Pr{X ∈ S|Y ∈ T } = = P .
Pr{Y ∈ T } yj ∈T ψj

All of the above definitions can be extended to more than two random variables.
8.1. RANDOM VARIABLES AND STOCHASTIC PROCESSES 95

8.1.3 Conditional Expectations


The concept of conditional probability discussed above can be applied to even “abstract” random variables,
that is, random variables assuming values in some abstract set. In contrast, concepts such as expected value
(both unconditional and conditional) are meant to be used with real-valued random variables. The ideas
extend readily to vector-valued random variables by applying them componentwise. The objective of this
subsection is to introduce these concepts. The discussion below requires an understanding of integration
with respect to a probability measure. We do not go into too many details regarding the abstract concept
of integration with respect to a measure, because that would be rather tangential to the main discussion.
Instead we refer interested reader to [4] for details.
Throughout this subsection, we deal with real-valued random variables. For this purpose, on the set
R of real numbers, we define the Borel σ-algebra, which is denoted by B and consists of the smallest
σ-algebra of subsets of R that contains all closed sets.3 Thus, when we say that X is a real random variable
on (Ω, F, P ), we mean that X is a measurable map from (Ω, F, P ) to (R, B).
Note that a probability space can be thought of as a measure space where the underlying set (Ω) has
measure one. So in principle we can attempt to integrate the function X(ω), ω ∈ Ω using the measure P .
Therefore, if it exists, the quantity Z
E[X, P ] := X(ω)P (dω) (8.9)

is called the mean or the expected value of the random variable X.4 Next, for 1 ≤ p < ∞, we define the
function space Lp (Ω, P ) as the set of functions whose p-th powers are absolutely integrable, that is
 Z 
Lp (Ω, P ) := f : Ω → R s.t. |f (ω)|p P (dω) < ∞ . (8.10)

The Lp -norm of a function f ∈ Lp (Ω, P ) is defined as


Z 1/p
p
kf kp := |f (ω)| P (dω) . (8.11)

If p = ∞, we define L∞ (Ω, P ) to be the set of functions that are essentially bounded, that is, bounded
except on a set of measure zero, and define the corresponding norm as the “essential supremum” of f (·),
that is
kf k∞ = inf{c : P {|f (ω)| ≥ c} = 0}. (8.12)
Next, for a given p ∈ [1, ∞], and define the conjugate index q ∈ [1, ∞] as the unique solution of the
equation
1 1
+ = 1. (8.13)
p q
In particular, if p ∈ (1, ∞), then q = p/(p − 1). If p = 1, then q = ∞ and vice versa. Then Hölder’s
inequality [4, p. 113] states that if f ∈ Lp (Ω, P ) and g ∈ Lq (Ω, P ) where p and q are conjugate indices, then
the product f g ∈ L1 (Ω, P ), and
Z Z 1/p Z 1/q
p q
kf gk1 ≤ kf kp · kgkq , or |f (ω)g(ω)|P (dω) ≤ |f (ω)| P (dω) · |g(ω)| P (dω) . (8.14)
Ω Ω Ω

In particular, choosing p = q = 2 leads to Schwarz’ inequality, namely, if f, g ∈ L2 (Ω, P ), then f g ∈ L1 (Ω, P ),


and
Z Z 1/2 Z 1/2
2 2
kf gk1 ≤ kf k2 · kgk2 , or |f (ω)g(ω)|P (dω) ≤ |f (ω)| P (dω) · |g(ω)| P (dω) . (8.15)
Ω Ω Ω
3 Or open sets, or semi-open sets–they all generate the same σ-algebra.
4 Note that some authors also use the phrase “expectation” to mean “expected value.” In such a case, this phrase will be
doing double duty, first to denote the real number defined above, and second to denote the random variable defined in Definition
8.9 below. This dual usage is by now pervasive in the probability theory literature.
96 CHAPTER 8. BACKGROUND MATERIAL

By using Hölder’s inequality and the fact that P (Ω) = 1, it is easy to show that

Lq (Ω, P ) ⊆ Lp (Ω, P ) whenever p < q. (8.16)

In particular, if a real random variable X is square-integrable, it is also absolutely integrable, and thus has
a well-defined mean. Moreover, with µ := E[X, P ], we can write
Z Z
2
[X(ω) − µ] P (dω) = X 2 (ω)P (dω) − µ2 =: V (X, P ),
Ω Ω

often called the “variance” of X.


In the discussion below, we often deal with two random variables X and X 0 that differ only on a set of
measure zero, that is,
P {ω : X(ω) 6= X 0 (ω)} = 0.
In such a case, we write X = X 0 a.e., or X = X 0 a.s..
The concept of a conditional expectation is defined next.
Definition 8.9. (See [11, Definition 4.16] or [48, Section 9.2].) Suppose (Ω, F, P ) is a probability space,
and that X is a real random variable with the additional property that X ∈ L1 (Ω, F, P ). Suppose that
G ⊆ F is another σ-algebra on Ω. Then the conditional expectation of X with respect to G, denoted by
E(X|G), is any random variable Y such that (i) Y is measurable with respect to (Ω, G), and (ii)
Z Z
X(ω)P (dω) = Y (ω)P (dω), ∀D ∈ G (8.17)
D D

Any two conditional expectations E(X|G) agree almost surely, and each is called a “version” of the conditional
expectation.
Note that E(X|G) is a (Ω, G)-measurable approximation to X such that, when restricted to sets in G,
E(X|G) is functionally equivalent to X, as stated in (8.17). Note that (8.17) can also be expressed as
Z Z
X(ω)ID (ω)P (dω) = Y (ω)ID (ω)P (dω), ∀D ∈ G
Ω Ω

where ID (·) is the indicator function of the set D.


To make the discussion below easier to follow, we employ the notation Y ∈ M(G) to indicate that Y
maps Ω into R, and is measurable with respect to (Ω, G) and (R, B).
In the above definition, it is not clear that such a conditional expectation exists. Any Y that satisfies
(8.17) is called a “version” in [48, 14]. The next theorem summarizes, without proof, some key properties of
the conditional expectation. These details can be found in [48, Chapter 9] and/or [14, Section 4.1].
Theorem 8.1. Suppose X ∈ L1 (Ω, F, P ) and that G ⊆ F is another σ-algebra on Ω. Then
1. (Existence) There is at least one Y ∈ M(G) such that (8.17) holds.
2. (Uniqueness) If Y, Y 0 ∈ M(G) both satisfy (8.17), then Y (ω) = Y 0 (ω) a.s..
3. (Expected Value Preservation) Every conditional expectation Y = E(X|G) belongs to L1 (Ω, F, P ).
Moreover Z Z
E[Y, P ] = E[X, P ], or Y (ω)P (dω) = X(ω)P (dω). (8.18)
Ω Ω

4. (Self-Replication) If X ∈ M(G), then E(X|G) = X a.s..


5. (Idempotency) If p, q are conjugate indices, and Z ∈ Lq (Ω, G, P ), X ∈ Lp (Ω, F, P ), then

E[(ZX)|G] = ZE(X|G) a.s.. (8.19)


8.1. RANDOM VARIABLES AND STOCHASTIC PROCESSES 97

6. (Iterated Conditioning) If H ⊆ G ⊆ F are σ-algebras, then

E[E(X|G)|H] = E(X|H). (8.20)

7. (Linearity) If X1 , X2 ∈ L1 (Ω, F, P ) and a1 , a2 ∈ R, then

E[(a1 X1 + a2 X2 )|G] = a1 E(X1 |G) + a2 E(X2 |G) a.s.. (8.21)

8. (Nonnegativity) If X(ω) ≥ 0 a.s., then E(X|G)(ω) ≥ 0 a.s..


9. (Projection Property) If X ∈ L2 (Ω, F, P ) (and not just L1 (Ω, F, P )), then

E(X|G) = arg min kY − Xk22 a.s.. (8.22)


Y ∈L2 (Ω,G,P )

Now we interpret some of the statements in the theorem. The obvious ones are not discussed. Item 3
states that the expected value of the conditional expectation is the same as the expected value of the original
random variable.5 Item 5 states that if X is multiplied by a bounded random variable Z ∈ M(G),6 then
the multiplier Z just passes through the conditional expectation operation. Item 6 states that if we were
to first take the conditional expectation of X with respect to G, and then take the conditional expectation
of the resulting random variable with respect to a smaller σ-algebra H, then the answer would be the same
as if we had directly taken the conditional expectation with respect to H. Note that this property is called
the “tower property” on [48, p. 88]. A ready consequence of Items 7 and 8 is that, if X1 ≥ X2 almost
surely, then E(X1 |G) ≥ E(X2 |G) almost surely. Finally, Item 9 states that if X belongs to the smaller space
L2 (Ω, F, P ) which is an inner product space, then its conditional expection also belongs to L2 (Ω, F, P ), and
can be computed using the projection theorem.
Example 8.3. In this example, we illustrate the concept of a conditional expectation in a very simple case,
namely, that of a random variable assuming only finitely many values. Suppose X = {x1 , · · · , xn } and
V = {v1 , · · · , vm } are finite sets, and that Z = (X, V ) is a joint random variable assuming values in X × V.
Let Θ ∈ [0, 1]n×m denote the joint probability distribution of Z written out as a matrix, and let φ, ψ denote
the marginal probability distributions of X and V respectively, written out as row vectors. Finally, suppose
f : X × V → R is a given function. Then f (Z) is a real-valued random variable assuming values in some
finite set.
Because both X and V are finite-valued, we can use the canonical representation, and choose Ω = X × V,
F = 2Ω , and P = Θ. Now suppose we define G to be the σ-algebra generated by V alone. Thus G =
{∅, X } ⊗ 2V . Again, because f (X, V ) assumes only finitely many values over a finite set, it is a bounded
random variable. Therefore E(f |G) is the best approximation to f (X, V ) using a function of V alone. From
Item 9 of the theorem this conditional expectation can be determined using projections.
As pointed out after Definition 8.8, it can be assumed without loss of generality that every component
of ψ is positive. Therefore the ratio
θij
= Pr{X = xi |V = vj }
ψj
is well-defined, though it could be zero.
In order to determine E(f |G), we should find a function g : V → R such that the error E[(f − g)2 , Θ] is
minimized. Let g1 , · · · , gm denote the values of g(·), and define the objective function
m n
1 XX
J= (gj − fij )2 θij .
2 j=1 i=1
5 To streamline the notation wherever possible, we write E[X, P ] to denote the expected value, which is a real number, and

E(X|G) to denote the conditional expectation, which is a random variable.


6 Hereafter we follow the probabilists’ convention and say “bounded” when we actually mean “bounded except on a set of

measure zero,” that is, “essentially bounded.”


98 CHAPTER 8. BACKGROUND MATERIAL

Then the objective is to choose the constants g1 , · · · , gm so as to minimize J. This happens when
n
∂J X
0= = (gj − fij )θij .
∂gj i=1

This expression can be rewritten as


n
X n
X n
X
0 = gj θij − fij θij = gj ψj − fij θij ,
i=1 i=1 i=1
or
n
X θij
gj = fij = E[f (X, V )|V = vj ].
i=1
ψj
This formula explains the terminology “conditional expectation.” gj equals the expected value of f condi-
tioned on the event that V = vj . The same expression also shows that
m
X m X
X n
E[g, ψ] = gj ψj = fij θij = E[f, Θ].
j=1 j=1 i=1

For future use, we introduce definitions of what it means for a sequence of real-valued random variables to
converge. Three commonly used notions of convergence are convergence probability, almost sure convergence,
and convergence in the mean. All are defined here.
Definition 8.10. Suppose {Xn }n≥0 is a sequence of real-valued random variables, and X ∗ is a real-valued
random variable, on a common probability space (Ω, F, P ). Then the sequence {Xn }n≥0 is said to converge
to X ∗ in probability if
P ({ω ∈ Ω : |Xn (ω) − X ∗ (ω)| > }) → 0 as n → ∞, ∀ > 0. (8.23)
The sequence {Xn }n≥0 is said to converge to X ∗ almost surely (or almost everywhere) if
P ({ω ∈ Ω : Xn (ω) → X ∗ (ω) as n → ∞}) = 1. (8.24)
Now suppose that Xn , X ∗ ∈ L1 (Ω, P ). Then the sequence {Xn }n≥0 is said to converge to X ∗ in the mean
if
kXn − X ∗ k1 → 0 as n → ∞. (8.25)
Note that, for any p ∈ (1, ∞), “convergence in the p-th mean” can be defined in the space Lp (Ω, P ), as
kXn − X ∗ kp → 0 as n → ∞. Also, the extension of Definition 8.10 to random variables assuming values in
a vector space Rd is obvious and is left to the reader.
The relationship between the various types of convergence is as follows:
Theorem 8.2. Suppose {Xn }, X ∗ are random variables defined on some probability space (Ω, F, P ). Suppose
Xn → X ∗ in probability as n → ∞. Then every subsequence of {Xn } contains a subsequence that converges
almost surely to X ∗ .
The converse of Theorem 8.2 is also true. See [4, Section 21, Problem 10(c)].
Theorem 8.3. Suppose {Xn }, X ∗ ∈ L1 (Ω, P ). Then
1. Xn → X ∗ a.s. implies that Xn → X ∗ in probability.
2. Xn → X ∗ in the mean implies that Xn → X ∗ in probability
3. Suppose there is a nonnegative random variable Z ∈ L1 (Ω, P ) such that |Xn | ≤ Z a.e., and suppose
that Xn → X ∗ a.s.. Then Xn → X ∗ in the mean.
These statements also apply to Rd -valued random variables.
Problem 8.1. Show that a consequence of Definition 8.1 is that ∅ ∈ F.
Problem 8.2. Show that a consequence of Definition 8.2 is that P (∅) = 0.
8.2. MARKOV PROCESSSES 99

8.2 Markov processses


In this section, we introduce the concept of Markov processes, which plays a central role in Reinforcement
Learning. As a prelude, we introduce the concept of a stochastic process.
Definition 8.11. Suppose (Ω, F, P ) is a probability space, and that (X , G) is a measure space. A stochastic
process on (X , G) is a sequence of random variables {Xt }t≥0 where each Xt takes values in X .
For each finite index T , let X0T denote the tuple (X0 , . . . , XT ). We can view this is as a random variable
QT
taking values in the (T + 1)-fold product ( i=0 X , ⊗Ti=0 G), with its own probability distribution PX0T . In
principle, in order to talk about stochastic processes precisely, we should define the infinite product, but this
leads to too many technicalities. So we avoid that. As a result, there are minor imprecisions in the discussion
below. Two types of stochastic processes make their appearence in Reinforcement Learning, namely: Those
where Xt takes its values in some finite set, which could be an abstract set of labels, and those where Xt ∈ Rd
for some integer d. In this section, we deal only with stochastic processes where the “alphabet” X , which is
the set to which Xt belongs, is some finite set.

8.2.1 Markov Processes: Basic Properties


Suppose X is a set of finite cardinality, say X = {x1 , · · · , xn }, and suppose that {Xt }t≥0 is a stochastic
process assuming values in X , that is, {Xt }t≥0 is a sequence of random variables assuming values in X . Let
the symbol X0t denote the (finite) collection of random variables (X0 , · · · , Xt ).
Definition 8.12. The process {Xt }t≥0 is said to possess the Markov property, or to be a Markov
process, if
Pr{Xt+1 |X0t } = Pr{Xt+1 |Xt }, ∀t ≥ 0. (8.26)
Because all random variables assume values in the finite set X , we can make the abstract equation
(8.26) more explicit. Equation (8.26) is a shorthand for the following statement: Suppose u ∈ X and
(y0 , · · · , yt ) ∈ X t+1 are arbitrary. Then (8.26) is equivalent to

Pr{Xt+1 = u|X0t = (y0 , · · · , yt )} = Pr{Xt+1 = u|Xt = yt }, ∀u ∈ X , (y0 , · · · , yt ) ∈ X t+1 .

In other words, the conditional probability of the state Xt+1 depends only on the most recent value of Xt ;
adding information about the past values of Xτ for τ < t does not change the conditional probability. One
can also say that Xt+1 is independent of X0t−1 given Xt . This property is sometimes paraphrased as “the
future is conditionally independent of the past, given the present.”
A Markov process over a finite set X is completely characterized by its state transition matrix A,
where
aij := Pr{Xt+1 = xj |Xt = xi }, ∀xi , xj ∈ X .
Thus in aij , i denotes the current state and j the future state. The reader is cautioned that some authors
interchange the roles of i and j in the above definition. If the transition probability does not depend on t,
then the Markov process is said to be stationary; otherwise it is said to be nonstationary. We do not
deal with nonstationary Markov processes in these notes.
Note that aij ∈ (0, 1) for all i, j. Also, at any time t + 1, it must be the case that Xt+1 ∈ X , no matter
what Xt is. Therefore, the sum of each row of A equals one, i.e.,
n
X
aij = 1, i = 1, . . . n. (8.27)
j=1

The above equation can be expressed compactly as

A1n = 1n , (8.28)
100 CHAPTER 8. BACKGROUND MATERIAL

Figure 8.1: Snakes and Ladders Game

where 1n denotes the column vector consisting of n ones. For future purposes, let us refer to a matrix
A ∈ [0, 1]n×n that satisfies (8.27) as a row-stochastic matrix, and denote by Sn×n the set of all row-
stochastic matrices of dimension n × n.
The matrix A is often called the “one-step” transition matrix, because row i of A gives the probablity
distribution of Xt+1 if Xt = xi . So we can ask: What is the k-step transition matrix? In other words, what
is the probability distribution of Xt+k if Xt = xi ? It is not difficult to show that this conditional probability
is just the i-row of Ak . Thus the k-step transition matrix is just Ak .
Example 8.4. A good example of a Markov process is the “snakes and ladders” game. Take for example the
board shown in Figure 8.1. In this case, we can let Xt denote an integer between 1 and 100, corresponding to
the square on which the player is. Thus X = {1, · · · , 100}. Suppose the player throws a four-sided die with
each of the outcomes (1, 2, 3, 4) being equally probable. Then the resulting sequence of positions {Xt }t≥0
is a stochastic process. Suppose for example that the player is on square 60. Note that what happens next
after a player has reached square 60 (or any other square) does not depend on how the player reached that
square. That is why the sequence of positions is a Markov process. Now, if the player is on square 60 so that
Xt = 60, then with probability of 1/4, the position at time t + 1 will be 61, 19 (snake on 62), 81 (ladder on
63) and 60 (snake on 64). Hence, in row 60 of the 100 × 100 state transition matrix, there are elements of 1/4
in columns 19, 60, 61, 81 and zeros in the remaining 96 columns. In the same manner, the entire 100 × 100
state transition matrix can be determined.
Let us suppose that the snakes and ladders game always starts with the player being in square 1. Thus
X0 is not random, but is deterministic, and the “probability distribution” of X0 , viewed as a row vector,
has a 1 in column 1 and zeros elsewhere. If we multiply this row vector by Ak for any integer k, we get the
probability distribution of the player’s position after k moves.
An application of the Gerschgorin circle theorem [18, Theorem 6.1.1] shows that, whenever A is row-
stochastic, the spectral radius ρ(A) ≤ 1. Moreover, the relationship (8.27) shows that λ = 1 is an eigenvalue
of A with column eigenvector 1n , so that in fact ρ(A) = 1. Thus one can ask: What does the row eigenvector
corresponding to λ = 1 look like? If there is a nonnegative row eigenvector µ ∈ Rn+ , then it can be scaled
so that µ1n = 1. Such a µ is called a stationary distribution of the Markov process, because if Xt has
the probability distribution µ, then so does Xt+1 . More generally, if X0 has the probability distribution µ,
then so does Xt for all t ≥ 0.
8.2. MARKOV PROCESSSES 101

Theorem 8.4. (See [5, Theorem 3.2, p. 8].) Every row-stochastic matrix A has a nonnegative row eigen-
vector corresponding to the eigenvalue λ = 1.

Note that Theorem 8.4 is a very weak statement. It states only that there exists a stationary distribution;
nothing is said about whether this is unique or not. To proceed further, it is helpful to make some assumptions
about A.

Definition 8.13. A row-stochastic matrix A is said to be irreducible if it is not possible to partition the
permute the rows and columns symmetrically (via a permutation matrix Π) such that
 
B11 0
Π−1 AΠ = .
B21 B22

Thus a row-stochastic matrix is irreducible if it is not possible to turn it into a block-triangular matrix
through symmetric row and column permutations. The notion of irreducibility plays a crucial role in the
theory of Markov processes. So it is worthwhile to give an alternate characterization of irreducibility.

Lemma 8.1. A row-stochastic matrix A is irreducible if and only if, for any pair of states ys , yf ∈ X , there
exists a sequence of states y1 , · · · yl ∈ X such that, with y0 = ys and yl+1 = yf , we have that

ayk yk+1 > 0, k = 0, . . . , l.

Thus the matrix A is irreducible if and only if, for every pair of states ys and yf , there is a path from
ys to yf such that every step in the path has a positive probability. In such a case we can say that yf
is reachable from ys . There are several equivalent characterizations of irreducibility, and for nonnegative
matrices in general, not necessarily satisfying (8.27); see [46, Chapter 3]. Another useful reference is [5],
which is devoted entirely to the study of nonnegative matrices. One such characterization is given next.

Theorem 8.5. (See [46, Corollary 3.8].) A row-stochastic matrix A is irreducible if and only if
n−1
X
Al > 0,
l=0

where A0 = I and the inequality is componentwise.

So we can start with M0 = I and define recursively Ml+1 = I + AMl . If Ml > 0 for any l, then A is
irreducible. If we get up to Mn−1 and this matrix is not strictly positive, then A is not irreducible.

Theorem 8.6. (See [46, Theorem 3.25].) Suppose A is an irreducible row-stochastic matrix. Then

1. λ = 1 is a simple eigenvalue of A.

2. The corresponding row eigenvector of A has all positive elements.

3. Thus A has a unique stationary distribution, whose elements are all positive.

4. There is an integer p, called the period of A, such that the spectrum of A is invariant under rotation
by exp(i2π/p).

5. In particular, exp(i2lπ/p), l = 0, · · · , p − 1 are all eigenvalues of A.

Now we introduce a concept that is stronger than irreducibility.

Definition 8.14. A row-stochastic matrix A is said to be primitive if there exists an integer l such that
Al > 0.
102 CHAPTER 8. BACKGROUND MATERIAL

Definition 8.15. An irreducible row-stochastic matrix A is said to be aperiodic if λ = 1 is the only


eigenvalue of A with magnitude one.

Theorem 8.7. (See [46, Theorem 3.15].) A row-stochastic matrix A is primitive if and only if it is irreducible
and aperiodic.

Example 8.5. Suppose    


0 0.5 0.5 0 1 0
A1 =  0.5 0 0.5  , A2 =  0 0 1 .
0.5 0.5 0 1 0 0
Then A1 is primitive, while A2 is irreducible but not primitive; it has a period p = 3.

In some situations, the following result is useful.

Theorem 8.8. (See [46, Lemma 4.12].) Suppose A is an irreducible row-stochastic matrix, and let µ denote
the corresponding stationary distribution. Then
T −1
1 X t
lim A = 1n µ. (8.29)
T →∞ T
t=0

Therefore, the average of I, A, · · · , AT −1 approaches the rank one matrix 1n µ. So, if φ is any probability
distribution on X , and the Markov process is started off with the initial distribution φ, then the distribution
of the state Xt is φAt . Note that, because φ is a probability distribution, we have that φ1n = 1. Therefore
(8.45) implies that
T −1
1 X
lim φAt = φ1n µ = µ, ∀φ. (8.30)
T →∞ T
t=0

The above relationship holds for every φ and forms the basis for the so-called Markov chain Monte
Carlo (MCMC) algorithm. Suppose {Xt }t≥0 is a Markov process evolving over the state space X , with
an irredcible state transition matrix A and stationary distribution µ. Suppose further that f : X → R is a
real-valued function defined on the state space X . We wish to compute the expected value of the random
variable f (Xt ) with respect to the stationary distribution µ, namely
X
E[f (X), µ] = f (xi )µi . (8.31)
xi ∈X

While we may know A, often we may not know µ or may not wish to spend the effort to compute it due to
the high dimension of A. In such a case, we start off the Markov process with an arbitrary initial probability
distribution φ, let it run for some time t0 , and then compute the quantity
t0 +T
1 X
fˆT = f (Xt ). (8.32)
T t=t +1
0

Because this quantity is based on the observed state Xt which is random, fˆT is also random. However, the
expected value of fˆT is precisely E[f (X), µ]. Moreover, its sample-path average fˆT converges to E[f (X), µ]
as T → ∞, and is a good approximation for the expected value for finite T .
The next result is analogous to Theorem 8.8 for primitive matrices.

Theorem 8.9. (See [46, Corollary 4.13].) Suppose A is a primitive row-stochastic matrix, and let µ denote
the corresponding stationary distribution. Then

Al → 1>
n µ as l → ∞. (8.33)
8.2. MARKOV PROCESSSES 103

Now we prove a couple of useful lemmas about irreducible and primitive matrices respectively. These are
useful when we study so-called Markov Decision Processes.
Theorem 8.10. Suppose A is a nontrivial convex combination of row stochastic matrices A1 , · · · , Ak , and
that at least one Ai is irreducible. Then A is irreducible.
Without loss of generality, write
k
X
A= γ i Ai ,
i=1

where γ1 > 0 and A1 is irreducible. Then


Al ≥ γ1l Al1 ∀l,
where the inequality holds componentwise, because all other “cross-product” terms in the expansion of Al
are nonnegative matrices. Because A1 is irreducible, it follows from Theorem 8.5 that
n−1
X
An−1
1 > 0,
l=0

where again the inequality is componentwise. Combining this with the above inequality shows that
n−1
X n−1
X n−1
X
l
A ≥ γ1l An−1
1 ≥ γ1n−1 An−1
1 > 0.
l=0 l=0 l=0

Therefore A is irreducible.
Corollary 8.1. The set of irreducible matrices is convex.
Theorem 8.11. Suppose A is a nontrivial convex combination of row stochastic matrices A1 , · · · , Ak , and
that at least one Ai is primitive. Then A is primitive.
The proof is similar to that of Theorem 8.9, except that Theorem 8.5 is replaced by Definition 8.14.
Corollary 8.2. The set of primitive matrices is convex.

8.2.2 Stopping Times and Hitting Probabilities


The contents of this subsection are very useful in Section 4.1.1, when we study reinforcement learning using
“episodes.”
Definition 8.16. A state xi ∈ X is said to be an absorbing state if Xt = xi implies that Xt+1 = xi , or
equivalently, that Xτ = xi for all τ ≥ t. Another equivalent defintion is that row i of the state transtion
matrix A consists of a 1 in column i and zeros elsewhere. More generally, a subset S ⊆ X is said to be a set
of absorbing states if Xt ∈ S =⇒ Xτ ∈ S for all τ > t.
Now we illustrate the concepts of absorbing states, and of absorbing sets. For convenience, we change
notation slightly. Assume that the state space X of a Markov process can be partitioned as T ∪ S, where
T denotes the set of “transient” states, and S is an absorbing set. Suppose further that T = {x1 , · · · , xm },
and S = {a1 , . . . , as }. It is a ready consequence of Definition 8.16 that the state transition matrix M of the
Markov process has the form (note the change in notation):
 
A B
M= , (8.34)
0 C

where C ∈ Ss×s is a row stochastic matrix in itself, and the matrix B has at least one nonzero element. Note
too that the set S can be absorbing, even if no individual state in S is absorbing. For example, suppose C
104 CHAPTER 8. BACKGROUND MATERIAL

is a permutation matrix over s indices. However, if C = Is , the identity matrix, then not only is the set S
absorbing, but every individual state in S is absorbing. In this case the matrix M looks like
 
A B
M= . (8.35)
0 Is

An illustration of an absorbing state is provided by the snakes and ladders game. If the player’s position
hits 100, then the game is over. So 100 is an absorbing state. In other games like Blackjack, there are two
absorbing states, namely W and L (for win and lose). In the Markov process literature, any sample path
X0l such that Xl is an absorbing state is called an episode.
It can be shown that if the state Xt of the Markov process enters the absorbing set S with probability
one as t → ∞, then B 6= 0, that is, B contains at least one nonzero element, and further, ρ(A) < 1. See
specifically Items 3 and 6 of [46, Theorem 4.7]. More details can be found in [46, Section 4.2.2]. (Note that
notation in [46] is different.) For the purposes of RL, it is useful to go beyond these facts, and to compute
the probability distribution of the time at which the state trajectory enters S. In turn this gives the average
number of time steps needed to reach the absorbing set. In case there are multiple absorbing states, it is also
possible to compute the probability of hitting an individual absorbing state ai within the overall absorbing
set S. To be specific, define θiS to be the first time that a sample path {X0∞ } hits the set S, starting at
X0 = xi . Further, if M is of the form (8.35) so that each set in S is absorbing, define θik to be the first
time that a sample path {X0∞ } hits the absorbing state ak , starting at X0 = xi . Then we have the following
result:

Theorem 8.12. With the above notation, we have that

Pr{θiS = l} = e>
i A
l−1
B1s ∀l ≥ 1, (8.36)

where ei denotes the i-th elementary column vector with a 1 in row i and zeros elsewhere. If M has the form
(8.35), the for each k ∈ [s], we have

Pr{θik = l} = e>
i A
l−1
bk ∀l ≥ 1 (8.37)

where bk denotes the k-th column of B. The probability that a sample path X0∞ with X0 = xi terminates in
the absorbing state ak is given by
pik = e>
i (I − A)
−1
bk . (8.38)
Moreover,
s
X
pik = 1, ∀i ∈ [m].
k=1

The vector of probabilities that a sample path X0∞ terminates in the absorbing state ak is given by

pk = (I − A)−1 bk . (8.39)

For each transient initial state xi ∈ T , define the average hitting time to reach the absorbing set S starting
from the initial state xi to be the expected value of θiS , that is

X
θ̄iS = l Pr{θiS = l},
l=1

and the vector of average hitting times as θ̄ S ∈ Rm . Then

θ̄ S = (I − A)−1 B1s . (8.40)


8.2. MARKOV PROCESSSES 105

Proof. We begin by deriving the expressions for the probability distributions. For each pair of indices
i, j ∈ [m] and each integer l, the value (Al )ij is the probability that, starting in state xi at time t = 0, the
state at time l equals xj , while staying within the set X . Thus the probability that θiS = l is given by
m
X
Pr{θiS = l} = (Al−1 )ij (B1s )j = e>
i A
l−1
B1s .
j=1

This is (8.36). If S consists of individual absorbing states, and we wish to determine the probability distri-
bution that Xl = ak given that X0 = xi , then we simply replace B1s by the corresponding k-th column of
B. This is (8.37). Equation (8.38) is obtained by observing that, since ρ(A) < 1, we have that

X
Al−1 = (I − A)−1 .
l=1

Therefore the probability that a trajectory starting at xi terminates in state ak is given by



"∞ #
X X
> l−1
ei A bk = ei A l−1
bk = ei (I − A)−1 bk .
l=1 l=1

This is (8.38). Stacking these probabilities as i varies over [m] gives (8.39).
Next we deal with the hitting times. Define the vector b = B1s , and consider the modified Markov
process with the state transition matrix  
A b
M= .
0 1
In effect, we have aggregated the set of absorbing states into one “virtual state.” From the standpoint of
computing θ̄, this is permissible, because once the trajectory hits the set S, or the virtual “last state” in the
modified formulation, the time counter stops. To prove (8.40), suppose the Markov process starts in state
xi . Then there are two possibilities: First, with probability bi , the trajectory hits the last virtual state. In
this case the counter stops, and we can say that the hitting time is 1. Second, with probability aij for each
j, the trajectory hits the state xj . In this case, the hitting time is now 1 + θ̄j . Therefore we have
n
X
θ̄i = bi + aij (1 + θ̄j ).
j=1

Observe however that


n
X
bi = 1 − aij .
j=1

Substituting in the previous equation gives


n
X
θ̄i = 1 + aij θ̄j ,
j=1

or in matrix form
(I − A)θ̄ = 1m .
Clearly this is equivalent to (8.40).
Example 8.6. Consider the “toy” snakes and ladders game with two extra states, called W and L for win
and lose respectively. The rules of the game are as follows:
ˆ Initial state is S.
106 CHAPTER 8. BACKGROUND MATERIAL

ˆ A four-sided, fair die is thrown at each stage.


ˆ Player must land exactly on W to win and exactly on L to lose.
ˆ If implementing a move causes crossing of W and L, then the move is not implemented.
There are twelve possible states in all: S, 1, . . . , 9 , W , L. However, 2, 3, 9 can be omitted, leaving nine
states, namely S, 1, 4, 5, 6, 7, 8, W , L. At each step, there are at most four possible outcomes. For example,
from the state S, the four outcomes are 1, 7, 5, 4. From state 6, the four outcomes are 7, 8, 1, and W. From
state 7, the four outcomes are 8, 1, W, 7. From state 8, there four possible outcomes are 1, W , L and 8
with probability 1/4 each, because if the die comes up with 4, then the move cannot be implemented. It is
time-consuming but straight-forward to compute the state transition matrix as

S 1 4 5 6 7 8 W L
S 0 0.25 0.25 0.25 0 0.25 0 0 0
1 0 0 0.25 0.50 0 0.25 0 0 0
4 0 0 0 0.25 0.25 0.25 0.25 0 0
5 0 0.25 0 0 0.25 0.25 0.25 0 0
6 0 0.25 0 0 0 0.25 0.25 0.25 0
7 0 0.25 0 0 0 0 0.25 0.25 0.25
8 0 0.25 0 0 0 0 0.25 0.25 0.25
W 0 0 0 0 0 0 0 1 0
L 0 0 0 0 0 0 0 0 1

The average duration of a game, which is the expected time before hitting one of the two absorbing states
W or L, is given by (8.40), and is  
5.5738
 5.4426 
 
 4.7869 
 
θ=  4.9180  .

 3.9344 
 
 3.1475 
3.1475
To compute the probabilities of reaching the absorbing states W or L from any nonabsorbing state, define
A to be the 7 × 7 submatrix on the top left, and B to be the 7 × 2 submatrix on the top left. Then the
probabilities of hitting W and L are given by (8.39), and are given by
 
0.5433 0.4567
 0.5457 0.4543 
 
 0.5574 0.4426 
[PW PL ] = (I − A)−1 B = 
 
 0.5550 0.4450  .

 0.6440 0.3560 
 
 0.5152 0.4848 
0.5152 0.4848

Not surprisingly, the two columns add up to one in each row, showing that, irrespective of the starting
state, the sample path with surely hit either W or L. Also not surprisngly, the probability of hitting W
is maximum in state 6, because it is possible to win in one throw of the die, but impossible to lose in one
throw.
Problem 8.3. Suppose the last rule of the toy snakes and ladders game is modified as follows: If imple-
menting a move causes the player to go past L, then the player moves to L (and loses the game). With this
modification, compute the average length of a game and the probability of winning / losing from each state.
8.2. MARKOV PROCESSSES 107

S 1 2 3 4 5 6 7 8 9 W L

Figure 8.2: Toy Snakes and Ladders Game

8.2.3 Maximum Likelihood Estimate of Markov Processes


Suppose {Xt }t≥0 is a Markov process evolving over a finite state space (or alphabet) X = {x1 , . . . , xn },
with an unknown state transition matrix. We are able to observe a sample path y0l := {y0 , y1 , . . . , yl } of the
process, where each yi ∈ X . From this observation, we wish to determine the most likely state transition
matrix A, that is, the matrix A that maximizes the likelihood of the observed sample path. As it turns out,
the solution is very simple.
Suppose A is a row-stochastic matrix. In other words, A ∈ [0, 1]n×n and satisfies
n
X
aij = 1, i = 1, . . . , n. (8.41)
j=1

For a given sample path y0l , the likelihood that this sample path is generated by a Markov process with state
transition matrix A is given by
l
Y
L(y0l |A) = Pr{y0 } P r{Xt = yt |Xt−1 = yt−1 , A}
t=1
l
Y
= Pr{y0 } ayt −1 yt . (8.42)
t=1

The formula becomes simpler if we take the logarithm of the above. Clearly, maximizing the log-likelihood
of observing y0l is equivalent to maximizing the likelihood of observing y0l . Thus
l
X
LL(y0l |A) = log Pr{y0 } + log ayt −1 yt . (8.43)
t=1

A further simplification is possible. For each pair (xi , xj ) ∈ X 2 , let νij denote the number of times that the
string xi xj occurs (in that order) in the sample path y0l . Next, define
n
X
ν̄i := νij . (8.44)
j=1

It is easy to see that, instead of summing over strings yt−1 yt , we can sum over strings xi xj . Thus yt−1 yt =
xi xj precisely νij times. Therefore
n X
X n
LL(y0l |A) = log Pr{y0 } + νij log aij . (8.45)
i=1 j=1
108 CHAPTER 8. BACKGROUND MATERIAL

We can ignore the first term as it does not depend on A. Also, A needs to satisfy the stochasticity constraint
(8.28). So we want to maximize the right side of (8.45) (without the term log Pr{y0 }) subject to (8.41). For
this purpose we form the Lagrangian
 
n X
X n n
X n
X
J= νij log aij + λi 1 − aij  ,
i=1 j=1 i=1 j=1

where λ1 , . . . , λn are the Lagrange multipliers. Next, observe that

∂J νij
= − λi .
∂aij aij

Setting the partial derivatives to zero gives


νij νij
λi = , or aij = .
aij λi

The value of λi can be determined from (8.41), which gives


n n
X 1 X ν̄i
aij = νij = = 1 =⇒ λi = ν¯i .
j=1
λi j=1 λi

Therefore the maximum likelihood estimate for the state transition matrix of a Markov process, based on
the sample path y0l , is given by
νij
aij = . (8.46)
ν̄i

8.3 Contraction Mapping Theorem


In this section we introduce a very powerful theorem known as the contraction mapping theorem (also known
as the Banach fixed point theorem), which provides an iterative technique for solving noninear equations. It
holds in extremely general settings. We present a version that is sufficient for the present purposes.

Theorem 8.13. Suppose f : Rn → Rn and that there exists a constant ρ < 1 such that

kf (x) − f (y)k ≤ ρkx − yk, ∀x, y ∈ Rn , (8.47)

where k · k on Rn . Then there is a unique x∗ ∈ Rn such that

f (x∗ ) = x∗ . (8.48)

To find x∗ , choose an arbitrary x0 ∈ Rn and define xl+1 = f (xl ). Then {xl } → x∗ as l → ∞. Moreover, we
have the explicit estimate
ρl
kx∗ − xl k ≤ kx1 − x0 k. (8.49)
1−ρ

Proof. By definition, we have that

kxl+1 − xl k ≤ ρkx− xl−1 k ≤ · · · ≤ ρl kx1 − x0 k. (8.50)


8.4. SOME ELEMENTS OF LYAPUNOV STABILITY THEORY 109

Suppose m > l, say m = l + r with r > 0. Then


r−1
X
kxm − xl k = kxl+r − xl k ≤ kxl+i+1 − xl+i k
i=0
r−1
X
≤ ρl+i kx1 − x0 k
i=0

X
≤ ρl+i kx1 − x0 k
i=0
l
ρ
= kx1 − x0 k. (8.51)
1−ρ

Therefore kxm − xl k → 0 as min{m, l} → ∞. Such a sequence is called a Cauchy sequence. In Rn , a


Cauchy sequence always converges to a limit. Denote this limit by x∗ . Then x∗ = liml→∞ xl . Now (8.47)
makes it clear that the function f is continuous. Therefore

f (x∗ ) = lim f (xl ) = lim xl+1 = x∗ .


l→∞ l→∞

Therefore x∗ satisfies (8.48). To show that x∗ is unique, suppose f (y ∗ ) = y ∗ . Then it follows from (8.47)
that
kx∗ − y ∗ k = kf (x∗ ) − f (y ∗ )k ≤ ρkx∗ − y ∗ k.
Since ρ < 1, the only way in which the above inequality can hold is if kx∗ − y ∗ k = 0, i.e., if x∗ = y ∗ . Finally,
let m → ∞ in (8.51) so that xm → x∗ and kxm − xl k → kx∗ − xl k. Then (8.51) becomes (8.49).

The bound (8.49) is extremely useful. Note that kx1 − x0 k = kf (x0 ) − x0 k. Therefore kx1 − x0 k is a
measure of how far off the initial guess x0 is from being a fixed point of f . Then (8.49) gives an explicit
estimate of how far x∗ is from xl , for each iteration xl . Note that the bound on the right side of (8.49)
decreases by a factor of ρ at each iteration.

8.4 Some Elements of Lyapunov Stability Theory


110 CHAPTER 8. BACKGROUND MATERIAL
Bibliography

[1] A. Arapostathis, V. S. Borkar, E. Fernández-Gaucherand, M. K. Ghosh, and S. I. Marcus. Discrete-


time controlled Markov processes with average cost criterion: A survey. SIAM Journal of Control and
Optimization, 31(2):282–344, 1993.

[2] P. Auer, N. Cesi-Bianchi, and F. Fischer. Finite-time analysis of the multiarmed bandit problem.
Machine Learning, 47:235–256, 2002.

[3] A. Benveniste, M. Metivier, and P. Priouret. Adaptive Algorithms and Stochastic Approximation.
Springer-Verlag, 1990.

[4] S. K. Berbarian. Measure and Integration. Chelsea, 1965.

[5] A. Berman and R. J. Plemmons. Nonnegative Matrices in the Mathematical Sciences. Academic Press,
1979.

[6] V. S. Borkar. Probability Theory: An Advanced Course. Springer-Verlag, 1995.

[7] V. S. Borkar. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization,
36(3):840–851, 1998.

[8] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press,
2008.

[9] V. S. Borkar and S. P. Meyn. The O.D.E. method for convergence of stochastic approximation and
reinforcement learning. SIAM Journal on Control and Optimization, 38:447–469, 2000.

[10] R. J. Boucherie and N. M. van Dijk, editors. Markov Decision Processes in Practice. Springer Nature,
2017.

[11] L. Breiman. Probability. SIAM: Society for Industrial and Applied Mathematics, 1992.

[12] Deep Mind. Alphazero: Shedding new light on chess, shogi, and go.
https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go.

[13] C. Derman and J. Sacks. On dvoretzky’s stochastic approximation theorem. Annals of Mathematical
Statistics, 30(2):601–606, 1959.

[14] R. Durrett. Probability: Theory and Examples (5th Edition). Cambridge University Press, 2019.

[15] A. Dvoretzky. On stochastic approximation. In Proceedings of the Third Berkeley Symposium on


Mathematical Statististics and Probability, volume 1, pages 39–56. University of California Press, 1956.

[16] E. G. Gladyshev. On stochastic approximation. Theory of Probability and Its Applications, X(2):275–
278, 1965.

111
112 BIBLIOGRAPHY

[17] J.-B. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Springer-Verlag, Berlin and
Heidelberg, 2001.

[18] R. A. Horn and C. R. Johnson. Matrix Analysis (Second Edition). Cambridge University Press, 2013.

[19] ImageNet. http://image-net.org/about-stats, 2010.

[20] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Annals of
Mathematical Statistics, 23(3):462–466, 1952.

[21] A. N. Kolmogorov. Foundations of Probability (Second English Edition). Chelsea, 1950. (English
translation by Kai Lai Chung).

[22] V. Konda and J. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization,
42(4):1143–1166, 2003.

[23] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Neural Information Processing Systems
(NIPS1999), pages 1008–1014, 1999.

[24] H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained
Systems. Applied Mathematical Sciences. Springer-Verlag, 1978.

[25] H. J. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applications.
Springer-Verlag, 1997.

[26] T. L. Lai. Stochastic approximation (invited paper). The Annals of Statistics, 31(2):391–406, 2003.

[27] K. R. Parthasarathy. Probability Measures on Metric Spaces. Academic Press, 1967.

[28] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley,
2005.

[29] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics,
22(3):400–407, 1951.

[30] C. E. Shannon. Programming a computer for playing chess. Philosophical Magazine, Ser.7, 41(314),
March 1950.

[31] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran,
T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement learning algorithm
that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, December 2018.

[32] S. P. Singh and R. S. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning,
22(1–3):123–158, 1996.

[33] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction (Second Edition). MIT Press,
2018.

[34] R. S. Sutton, D. McAllester, S. Singh, and Y.Mansour. Policy gradient methods for reinforcement learn-
ing with function approximation. In Advances in Neural Information Processing Systems 12 (Proceedings
of the 1999 conference), pages 1057–1063. MIT Press, 2000.

[35] C. Szepesvári. Algorithms for Reinforcement Learning. Morgan and Claypool, 2010.

[36] G. Tesauro. Gammon, a self-teaching backgammon program, achieves master-level play. Neural Com-
putation, 6(2):215–219, 1994.
BIBLIOGRAPHY 113

[37] G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58–68,
1995.
[38] G. Tesauro. Programming backgammon using self-teaching neural nets. Artificial Intelligence, 134(1-
2):181–199, 2002.

[39] J. N. Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine Learning, 16:185–202,
1994.
[40] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic
gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803–812, September
1986.

[41] J. N. Tsitsiklis and B. V. Roy. Feature-based methods for large-scale dynamic programming. Machine
Learning, 22:59–94, 1996.
[42] J. N. Tsitsiklis and B. V. Roy. An analysis of temporal-difference learning with function approximation.
IEEE Transactions on Automatic Control, 42(5):674–690, May 1997.

[43] J. N. Tsitsiklis and B. V. Roy. Average cost temporal-difference learning. Automatica, 35:1799–1808,
1999.
[44] M. Vidyasagar. Nonlinear Systems Analysis (SIAM Classics Series). Society for Industrial and Applied
Mathematics (SIAM), 2002.
[45] M. Vidyasagar. Learning and Generalization: With Applications to Neural Networks and Control Sys-
tems. Springer-Verlag, 2003.
[46] M. Vidyasagar. Hidden Markov Processes: Theory and Applications to Biology. Princeton University
Press, 2014.
[47] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.

[48] D. Williams. Probability with Martingagles. Cambridge University Press, 1991.


[49] J. Wolfowitz. On the stochastic approximation method of Robbins and Monro. Annals of Mathematical
Statistics, 23(3):457–461, 1952.
[50] M. Zastow. I’m in shock!: How an AI beat the world’s best human at go, 2016.

You might also like