Stochastic Calculus, Filtering, and Stochastic Control
Stochastic Calculus, Filtering, and Stochastic Control
Stochastic Control
Lecture Notes
Spring 2007
Preface
These lecture notes were written for the course ACM 217: Advanced Topics in Stochas-
tic Analysis at Caltech; this year (2007), the topic of this course was stochastic calcu-
lus and stochastic control in continuous time. As this is an introductory course on the
subject, and as there are only so many weeks in a term, we will only consider stochas-
tic integration with respect to the Wiener process. This is sufficient do develop a large
class of interesting models, and to develop some stochastic control and filtering theory
in the most basic setting. Stochastic integration with respect to general semimartin-
gales, and many other fascinating (and useful) topics, are left for a more advanced
course. Similarly, the stochastic control portion of these notes concentrates on veri-
fication theorems, rather than the more technical existence and uniqueness questions.
I hope, however, that the interested reader will be encouraged to probe a little deeper
and ultimately to move on to one of several advanced textbooks.
I have no illusions about the state of these notes—they were written rather quickly,
sometimes at the rate of a chapter a week. I have no doubt that many errors remain
in the text; at the very least many of the proofs are extremely compact, and should be
made a little clearer as is befitting of a pedagogical (?) treatment. If I have another
opportunity to teach such a course, I will go over the notes again in detail and attempt
the necessary modifications. For the time being, however, the notes are available as-is.
If you have any comments at all about these notes—questions, suggestions, omis-
sions, general comments, and particularly mistakes—I would love to hear from you. I
can be contacted by e-mail at [email protected].
Required background. I assume that the reader has had a basic course in probabil-
ity theory at the level of, say, Grimmett and Stirzaker [GS01] or higher (ACM 116/216
should be sufficient). Some elementary background in analysis is very helpful.
Layout. The LATEX layout was a bit of an experiment, but appears to have been
positively received. The document is typeset using the memoir package and the
daleif1 chapter style, both of which are freely available on the web.
i
Contents
Preface i
Contents ii
Introduction 1
ii
Contents iii
Bibliography 254
Introduction
This course is about stochastic calculus and some of its applications. As the name
suggests, stochastic calculus provides a mathematical foundation for the treatment
of equations that involve noise. The various problems which we will be dealing with,
both mathematical and practical, are perhaps best illustrated by considering some sim-
ple applications in science and engineering. As we progress through the course, we
will tackle these and other examples using our newly developed tools.
1
Introduction 2
0.5 1 1
0.5
0 0.5
0
-0.5 0
-0.5
-1 -0.5
-1
-1.5 -1.5 -1
0 0.5 1 0 0.5 1 0 0.5 1
Figure 0.1. Randomly generated sample paths xt (N ) for the Brownian motion model in the
text, with (from left to right) N = 20, 50, 500 collisions per unit time. The displacements ξn
are chosen to be random variables which take the values ±N −1/2 with equal probability.
has zero mean). Then at time t, the position xt (N ) of the pollen particle is given by
bN tc
X
xt (N ) = x0 + ξn .
n=1
We want to consider the limit where the number of bombardments N is very large,
but where every individual water molecule only contributes a tiny displacement to the
pollen particle—this is a reasonable assumption, as the pollen particle, while being
small, is extremely large compared to a single water molecule. To be concrete, let us
define a constant γ by var(ξn ) = γN −1 . Note that γ is precisely the mean-square
displacement of the pollen particle per unit time:
N
!
X
E(x1 (N ) − x0 )2 = var ξn = N var(ξn ) = γ.
n=1
The physical regime in which we are interested now corresponds to the limit N → ∞,
i.e., where the number of collisions N is large but the mean-square displacement per
unit time γ remains fixed. Writing suggestively
PbN tc
√ n=1 Ξn
xt (N ) = x0 + γt √ ,
Nt
p
where Ξn = ξn N/γ are i.i.d. random variables with zero mean and unit variance,
we see that the limiting behavior of xt (N ) as N → ∞ is described by the central limit
theorem: we find that the law of xt (N ) converges to a Gaussian distribution with zero
mean and variance γt. This is indeed the result of Einstein’s analysis.
The limiting motion of the pollen particle as N → ∞ is known as Brownian mo-
tion. You can get some idea of what xt (N ) looks like for increasingly large N by
having a look at figure 0.1. But now we come to our first significant mathematical
problem: does the limit of the stochastic process t 7→ xt (N ) as N → ∞ even exist
in a suitable sense? This is not at all obvious (we have only shown convergence in
Introduction 3
distribution for fixed time t), nor is the resolution of this problem entirely straightfor-
ward. If we can make no sense of this limit, there would be no mathematical model of
Brownian motion (as we have defined it); and in this case, these lecture notes would
come to an end right about here. Fortunately we will be able to make mathematical
sense of Brownian motion (chapter 3), which was first done in the fundamental work
of Norbert Wiener [Wie23]. The limiting stochastic process xt (with γ = 1) is known
as the Wiener process, and plays a fundamental role in the remainder of these notes.
where p and q are some positive constants. The first term in this expression is the
time-average (on some time interval [0, T ]) mean square distance of the particle from
the focus of the microscope: clearly we would like this to be small. The second term,
on the other hand, is the average power in the control signal, which should also not
be too large in any realistic application (our electric motor will only take so much).
The goal of the optimal control problem is to find the feedback strategy u t which
minimizes the cost JT [u]. Many variations on this problem are possible; for example,
if we are not interested in a particular time horizon [0, T ], we could try to minimize
" Z # " Z #
1 T 1 T
J∞ [u] = p lim sup E (xt + zt )2 dt + q lim sup E u2 dt .
T →∞ T 0 T →∞ T 0 t
Introduction 4
The tradeoff between the conflicting goals of minimizing the distance of the particle
to the focus of the microscope and minimizing the feedback power can be selected by
modifying the constants p, q. The optimal control theory further allows us to study
this tradeoff explicitly: for example, we can calculate the quantity
( " Z # " Z # )
1 T 1 T
C(U ) = inf lim sup E (xt + zt )2 dt : lim sup E u2 dt ≤ U ,
T →∞ T 0 T →∞ T 0 t
i.e., C(U ) is the smallest time-average tracking error that is achievable using con-
trols whose time-average power is at most U . This gives a fundamental limit on the
performance of our tracking system under power constraints. The solution of these
problems, in a much more general context, is the topic of chapter 6.
40
Stock Price ($)
30
20
10
1999 2000 2001 2002 2003 2004 2005 2006 2007
50
40
Stock Price ($)
30
20
10
0
1970 1975 1980 1985 1990 1995 2000 2005
Figure 0.2. Market price of McDonald’s stock on the New York Stock Exchange (NYSE) over
the period 1999–2007 (upper plot) and 1970–2007 (lower plot). The stock price data for this
figure was obtained from Yahoo! Finance at finance.yahoo.com.
This is essentially a stochastic control problem, which was tackled in a famous paper
by Merton [Mer71]. A different class of questions concerns the pricing of options—a
sort of “insurance” issued on stock—and similar financial derivatives. The modern
theory of option pricing has its origins in the pioneering work of Black and Scholes
[BS73] and is an important problem in practice. There are many variations on these
and other topics, but they have at least one thing in common: their solution requires a
healthy dose of stochastic analysis.
For the time being, let us consider how to build a mathematical model for the stock
prices—a first step for further developments. Bachelier used Brownian motion for this
purpose. The problem with that model is that the stock prices are not guaranteed to
be positive, which is unrealistic; after all, nobody pays money to dispose of his stock.
Another issue to take into account is that even though this is not as visible on shorter
time scales, stock prices tend to grow exponentially on the long run: see figure 0.2.
Often this exponential rate will be larger than the interest rate we can obtain by putting
our money in the bank, which is a good reason to invest in stock (investing in stock
is not the same as gambling at a casino!) This suggests the following model for stock
prices, which is widely used: the price St at time t of a single unit of stock is given by
σ2
St = S0 exp µ− t + σWt ,
2
where Wt is a Wiener process, and S0 > 0 (the initial price), µ > 0 (the return
rate), and σ > 0 (the volatility) are constants. A stochastic process of this form is
called geometric Brownian motion. Note that St is always positive, and moreover
E(St ) = S0 eµt (exercise: you should already be able to verify this!) Hence evidently,
Introduction 6
on average, the stock makes money at rate µ. In practice, however, the price may
fluctuate quite far away from the average price, as determined by magnitude of the
volatility σ. This means that there is a probability that we will make money at a
rate much faster than µ, but we can also make money slower or even lose money. In
essence, a stock with large volatility is a risky investment, whereas a stock with small
volatility is a relatively sure investment. When methods of mathematical finance are
applied to real-world trading, parameters such as µ and σ are often estimated from
real stock market data (like the data shown in figure 0.2).
Beside investing in the stock St , we will also suppose that we have the option of
putting our money in the bank. The bank offers a fixed interest rate r > 0 on our
money: that is, if we initially put R0 dollars in the bank, then at time t we will have
Rt = R0 exp(rt)
dollars in our bank account. Often it will be the case that r < µ; that is, investing in
the stock will make us more money, on average, than if we put our money in the bank.
On the other hand, investing in stock is risky: there is some finite probability that we
will make less money than if we had invested in the bank.
Now that we have a model for the stock prices and for the bank, we need to be
able to calculate how much money we make using a particular investment strategy.
Suppose that we start initially with a capital of X0 dollars. We are going to invest some
fraction θ0 of this money in stock, and the rest in the bank (i.e., we put (1 − θ 0 )X0
dollars in the bank, and we buy θ0 X0 /S0 units of stock). Then at time t, our total
wealth Xt (in dollars) amounts to
2
Xt = θ0 X0 e(µ−σ /2)t+σWt
+ (1 − θ0 )X0 ert .
Now suppose that at this time we decide to reinvest our capital; i.e., we now invest a
fraction θt of our newly accumulated wealth Xt in the stock (we might need to either
buy or sell some of our stock to ensure this), and put the remainder in the bank. Then
at some time t0 > t, our total wealth becomes
2
/2)(t0 −t)+σ(Wt0 −Wt ) 0
Xt0 = θt Xt e(µ−σ + (1 − θt )Xt er(t −t) .
In principle, however, we should be able to decide at any point in time what fraction
of our money to invest in stock; i.e., to allow for the most general trading strategies
we need to generalize these expressions to the case where θt can vary continuously in
time. At this point we are not equipped to do this: we are missing a key mathematical
ingredient, the stochastic calculus (chapters 4 and 5). Once we have developed the
latter, we can start to pursue the answers to some of our basic questions.
Introduction 7
How, then, should we define white noise ξt in a meaningful way? Having learned
our lesson from the previous example, we might try to insist that the time average of
ξt is well defined. Inspired by AWGN, where in unit time the corrupting noise is a
zero mean Gaussian random variable (with unit variance, say), we could require that
the average white noise in unit time Ξ1 is a zero mean Gaussian random variable with
unit variance. We also want to insist that ξt retains its independence property: ξt and
ξs are i.i.d. for t 6= s. This means, in particular, that
Z 1/2 Z 1
ξt dt and ξt dt
0 1/2
must both be Gaussian random variables with mean zero and variance 1/2 (after all,
their sum equals Ξ1 and they must be i.i.d.), etc. Proceeding along these lines (con-
vince yourself of this!), it is not difficult to conclude that
Z t
ξs ds = Wt must be a Wiener process.
0
Hence we conjecture that the correct “definition” of white noise is: ξ t is the time
derivative dWt /dt of a Wiener process Wt . Unfortunately for us, the Wiener process
turns out to be non-differentiable for almost every time t. Though we cannot prove
it yet, this is easily made plausible. Recall that Wt is a Gaussian random variable
with variance t; to calculate dWt /dt at t = 0, for example, we consider Wt /t and
let t → 0. But Wt /t is a Gaussian random variable with variance t−1 , so clearly
something diverges as t → 0. Apparently, we are as stuck as before.
Let us explore a little bit further. First, note that the covariance of the Wiener
process is3 E(Ws Wt ) = s ∧ t. To see this, it suffices to note that for s ≤ t, Wt − Ws
and Ws are independent (why?), so E(Ws Wt ) = E(Ws2 ) + E(Ws (Wt − Ws )) = s.
Let us now formally compute the covariance of white noise:
d d d d s + t − |t − s| d 1 + sign(t − s)
E(ξs ξt ) = E(Ws Wt ) = = = δ(t−s),
dt ds dt ds 2 dt 2
where δ(t) is the Dirac delta “function”. This is precisely the defining property of
white noise as it is used in the engineering literature and in physics. Of course, the
non-differentiability of the Wiener process is driven home to us again: as you well
know, the Dirac delta “function” is not actually a function, but a distribution (general-
ized function), an object that we could never get directly from the theory of stochastic
processes. So the bad news is that despite the widespread use of white noise,
a mathematical model for white noise does not exist,
at least within the theory of stochastic processes.4 Fortunately there is also some good
news: as we will see below and throughout this course,
3 The lattice-theoretic notation a ∧ b = min(a, b) and a ∨ b = max(a, b) is very common in the
paths” can be, for example, tempered distributions. In the class of generalized stochastic processes one can
make sense of white noise (see [Hid80, HØUZ96], or [Arn74, sec. 3.2] for a simple introduction). This is
not particularly helpful, however, in most applications, and we will not take this route.
Introduction 9
almost anything we would like to do with white noise, including its ap-
plications in science and engineering, can be made rigorous by working
directly with the Wiener process.
For example, consider the transmission of a continuous-time signal a t through white
noise ξt : i.e., the corrupted signal is formally given by xt = at + ξt . This quantity is
not mathematically meaningful, but we can integrate both sides and obtain
Z t Z t Z t Z t
Xt = xs ds = as ds + ξs ds = as ds + Wt .
0 0 0 0
The right-hand side of this expression is mathematically meaningful and does not
involve the notion of white noise. At least formally, the process X t should contain
the same information as xt : after all, the latter is obtained from the former by formal
differentiation. If we want to estimate the signal at from the observations xt , we might
as well solve the same problem using Xt instead—the difference being that the latter
is a mathematically well-posed problem.
Why do we insist on using white noise? Just like in mathematics, true white noise
does not exist in nature; any noise encountered in real life has fairly regular sample
paths, and as such has some residual correlations between the value of the process at
different times. In the majority of applications, however, the correlation time of the
noise is very short compared to the other time scales in the problem: for example,
the thermal fluctuations in an electric wire are much faster than the rate at which we
send data through the wire. Similar intuition holds when we consider a dynamical
system, described by a differential equation, which is driven by noise whose random
fluctuations are much faster than the characteristic timescales of the dynamics (we will
return to this below). In such situations, the idea that we can approximate the noise
by white noise is an extremely useful idealization, even if it requires us to scramble a
little to make the resulting models fit into a firm mathematical framework.
The fact that white noise is (formally) independent at different times has far-
reaching consequences; for example, dynamical systems driven by white noise have
the Markov property, which is not the case if we use noise with a finite correlation
time. Such properties put extremely powerful mathematical tools at our disposal, and
allow us to solve problems in the white noise framework which would be completely
intractable in models where the noise has residual correlations. This will become
increasingly evident as we develop and apply the necessary mathematical machinery.
Tracking revisited
Let us return for a moment to the problem of tracking a diffusing particle through a
microscope. Previously we tried to keep the particle in the field of view by modifying
the position of the microscope slide based on our knowledge of the location of the par-
ticle. In the choice of a feedback strategy, there was a tradeoff between the necessary
feedback power and the resulting tracking error.
The following is an interesting variation on this problem. In biophysics, it is of sig-
nificant interest to study the dynamical properties of individual biomolecules—such
as single proteins, strands of DNA or RNA—in solution. The dynamical properties
Introduction 10
u u
system (x,z) system (x,z)
y = βx + ξ
x x^
controller controller filter
Figure 0.3. Optimal feedback control strategies with full (left) and partial (right) information.
In the full information case, the control signal is a function of the state of the system. In the
partial information case, the observation process is first used to form an estimate of the state of
the system; the control signal is then a function of this estimate.
of these molecules provide some amount of insight into protein folding, DNA repli-
cation, etc. Usually, what one does is to attach one or several fluorescent dyes to the
molecules of interest. Using a suitably designed microscope, a laser beam is focused
on a dilute sample containing such molecules, and the fluorescent light is captured
and detected using a photodetector. The molecules perform Brownian motion in the
solution, and occasionally one of the molecules will randomly drift into the focus of
the laser beam. During the period of time that the molecule spends in the focus of the
beam, data can be collected which is subsequently analyzed to search for signatures of
the dynamical behavior of interest. The problem is that the molecules never stay in the
beam focus very long, so that not much data can be taken from any single molecule
at a time. One solution to this problem is to anchor the molecules to a surface so
that they cannot move around in solution. It is unclear, however, that this does not
significantly modify the dynamical properties of interest.
A different solution—you guessed it—is to try to follow the molecules around
in the solution by moving around the microscope slide (see [BM04] and references
therein). Compared to our previous discussion of this problem, however, there is now
an additional complication. Previously we assumed that we could see the position of
the particle under the microscope; this information determined how we should choose
the control signal. When we track a single molecule, however, we do not really “see”
the molecule; the only thing available to us is the fluorescence data from the laser,
which is inherently noisy (“shot noise”). Using a suitable modulation scheme [BM04],
we can engineer the system so that to good approximation the position data at our
disposal is given by yt = βxt + ξt , where ξt is white noise, xt is the distance of the
molecule to the center of the slide, and β is the signal-to-noise ratio. As usual, we
make this rigorous by considering the integrated observation signal
Z t Z t
Yt = ys ds = βxs ds + Wt .
0 0
Our goal is now to minimize a cost function of the form J∞ [u], for example, but with
an additional constraint: our feedback strategy ut is only allowed to depend on the
Introduction 11
ξ
decoded
message message
A(θ,y) y=A+ξ ^
θ encoder decoder θ
noisy channel
noiseless feedback
Figure 0.4. Setup for the transmission of a message over a noisy channel. We get to choose the
encoder and the decoder; moreover, the encoder has access to the corrupted signal observed by
the decoder. How should we design the system to minimize the transmission error?
past observations, i.e., on the process Ys on the interval s ∈ [0, t]. Compared to our
previous discussion, where we were allowed to base our feedback strategy directly on
the particle position xs , we have apparently lost information, and this will additionally
limit the performance of our control strategy.
At first sight, one could expect that the optimal control strategy in the case of
observation-based feedback would be some complicated functional of the observation
history. Fortunately, it turns out that the optimal control has a very intuitive structure,
see figure 0.3. The controller splits naturally into two parts. First, the observations are
used to form an estimate of the state of the system (i.e., the position of the molecule).
Then the control signal is chosen as a function of this estimate. This structure is quite
universal, and is often referred to as the separation principle of stochastic control.
It also highlights the fact that filtering—the estimation of a stochastic process from
noisy observations—is intimately related with stochastic control. Filtering theory is
an interesting and important topic on its own right; it will be studied in detail in chapter
7, as well as the connection with control with partial observations.
After all, if we do not do this then we could transmit an arbitrarily strong signal
through the channel, and an optimal solution would be to transmit the message θ t
directly through the channel with an infinite signal-to-noise ratio. With a power con-
straint in place, however, there will be a fundamental limitation on the achievable
performance. We will see that these problems can be worked out in detail, for exam-
ple, if the message θt is modelled as a Gaussian process (see, e.g., [LS01b]).
The goal is then to choose ϑ that minimizes J[ϑ], and the choice of p and q determine
the relative importance of achieving a low false alarm rate or a short detection delay.
Alternatively, we could ask: given that we tolerate a fixed false alarm rate, how should
we choose ϑ is minimize the detection delay? Note that these considerations are
very similar to the ones we discussed in the tracking problem, where the tradeoff was
Introduction 13
between achieving good tracking performance and low feedback power. Indeed, in
many ways this type of problem is just like a control problem, except that the control
action in this case is the time at which we decide to intervene rather than the choice
of a feedback strategy. Similarly, the separation principle and filtering theory play an
important role in the solution of this problem (due to Shiryaev [Shi73]).
A related idea is the problem of hypothesis testing. Here we are given noisy data,
and our job is to decide whether there is a signal buried in the noise. To be more
precise, we have two hypotheses: under the null hypothesis H 0 , the observations have
the form yt = ξt where ξt is white noise; under the alternative hypothesis H1 there is
a given signal θt buried in the noise, i.e., yt = θt + ξt . Such a scenario occurs quite
often, for example, in surveillance and target detection, gravitational wave detection,
etc. It is our goal to determine whether the hypothesis H0 or H1 is correct after
observing yt some time ϑ. Once again we have conflicting interests; on the one hand,
we would like to make a pronouncement on the matter as soon as possible (ϑ should
be small), while on the other hand we would like to minimize the probability that we
obtain the wrong answer (we choose H0 while H1 is true, and vice versa). The rest of
the story is much as before; for example, we can try to find the hypothesis test which
takes the least time ϑ under the constraint that we are willing to tolerate at most some
given probability α of selecting the wrong hypothesis.
Both these problems are examples of optimal stopping problems: control prob-
lems where the goal is to select a suitable stopping time ϑ. Optimal stopping theory
has important applications not only in statistics, but also in mathematical finance and
in several other fields. We can also combine these ideas with more traditional con-
trol theory as follows. Suppose that we wish to control a system (for example our
favorite tracking system) not by applying continuous feedback, but by applying feed-
back impulses at a set of discrete times. The question now becomes: at which times
can we best apply the control, and what control should we apply at those times? Such
problems are known as impulse control problems, and are closely related to optimal
stopping problems. Optimal stopping and impulse control are the topics of chapter 8.
dxt
= b(t, xt ) + σ(t, xt ) ξt ,
dt
Introduction 14
where b and σ are given (sufficiently smooth) functions and ξt is white noise. Such
an equation could be interesting for many reasons; in particular, one would expect to
obtain such an equation from any deterministic model
dxt
= b(t, xt ) + σ(t, xt ) ut ,
dt
where ut is some input to the system, when the input signal is noisy (but we will see
that there are some subtleties here—see below). If σ = 0, our stochastic equation is
just an ordinary differential equation, and we can establish existence and uniqueness
of solutions using standard methods (notably Picard iteration for the existence ques-
tion). When σ 6= 0, however, the equation as written does not even make sense: that
infernal nuisance, the formal white noise ξt , requires proper interpretation.
Let us first consider the case where σ(t, x) = σ is a constant, i.e.,
dxt
= b(t, xt ) + σξt .
dt
This additive noise model is quite common: for example, if we model a particle with
mass m in a potential V (x), experiencing a noisy force Ft = σξt (e.g., thermal noise)
and a friction coefficient k, then the particle’s position and momentum satisfy
dxt dpt dV
= m−1 pt , =− (xt ) − kpt + σξt .
dt dt dx
How should we interpret such an equation? Let us begin by integrating both sides:
Z t
xt = x 0 + b(s, xs ) ds + σWt ,
0
where Wt is a Wiener process. Now this equation makes sense! We will say that the
stochastic differential equation
has a unique solution, if there is a unique stochastic process xt that satisfies the associ-
ated integral equation. The differential notation dxt , etc., reminds us that we think of
this equation as a sort of differential equation, but it is important to realize that this is
just notation: stochastic differential equations are not actually differential equations,
but integral equations like the one above. We could never have a “real” stochastic
differential equation, because clearly xt cannot be differentiable!
For additive noise models, it is not difficult to establish existence and uniqueness
of solutions; in fact, one can more or less copy the proof in the deterministic case
(Picard iteration, etc.) However, even if we can give meaning to such an equation,
we are lacking some crucial analysis tools. In the deterministic theory, we have a key
tool at our disposal that allows us to manipulate differential equations: undergraduate
calculus, and in particular, that wonderful chain rule! Here, however, the chain rule
will get us in trouble; for example, let us naively calculate the equation for x 2t :
d 2 ? dxt
xt = 2xt = 2xt b(t, xt ) + 2σxt ξt .
dt dt
Introduction 15
This is no longer an additive noise model, and we are faced with the problem of giving
meaning to the rather puzzling object
Z t
xs ξs ds = ??.
0
The resolution of this issue is key to almost all of the theory in this course! Once we
have a satisfactory definition of such an integral (chapter 4), we are in a position to
define general stochastic differential equations (chapter 5), and to develop a stochastic
calculus that allows us to manipulate stochastic differential equations as easily as
their deterministic counterparts. Here we are following in the footsteps of Kiyosi Itô
[Itô44], whose name we will encounter frequently throughout this course.
In chapter 4 we will define a new type of integral, the Itô integral
Z t
xs dWs ,
0
which will play the role of a white noise integral in our theory. We will see that this
integral has many nice properties; e.g., it has zero mean, and will actually turn out to
be a martingale. We will also find a change of variables formula for the Itô integral,
just like the chain rule in ordinary calculus. The Itô change of variables formula,
however, is not the same as the ordinary chain rule: for example, for any f ∈ C 2
while the usual chain rule would only give the first term on the right. This is not
surprising, however, because the ordinary chain rule cannot be correct (at least if we
insist that our stochastic integral has zero mean). After all, if the chain rule were
correct, the variance of Wt could be calculated as
Z t
2
var(Wt ) = E(Wt ) = E 2 Ws dWs = 0,
0
which is clearly untrue. The formula above, however, gives the correct answer
Z t Z t
var(Wt ) = E(Wt2 ) = E 2 Ws dWs + ds = t.
0 0
Evidently things work a little differently in the stochastic setting than we are used to;
but nonetheless our tools will be almost as powerful and easy to use as their determin-
istic counterparts—as long as we are careful!
The reader is probably left wondering at this point whether we did not get a little
carried away. We started from the intuitive idea of an ordinary differential equation
driven by noise. We then concluded that we can not make sense of this as a true
differential equation, but only as an integral equation. Next, we concluded that we
didn’t really know what this integral is supposed to be, so we proceeded to make one
up. Now we have finally reduced the notion of a stochastic differential equation to a
mathematically meaningful form, but it is unclear that the objects we have introduced
bear any resemblance to the intuitive picture of a noisy differential equation.
Introduction 16
where σ 0 (t, x) = dσ(t, x)/dx. The second term on the right is known as the Wong-
Zakai correction term, and our naive interpretation of stochastic differential equations
cannot account for it! Nonetheless it is not so strange that it is there. To convince
yourself of this, note that xεt must satisfy the ordinary chain rule: for example,
dxεt d(xεt )2
= Axεt + Bxεt ξtε , = 2A(xεt )2 + 2B(xεt )2 ξtε .
dt dt
If we take the limit as ε → 0, we get using the Wong-Zakai correction term
dxt = (A+ 21 B 2 )xt dt+Bxt dWt , d(xt )2 = (2A+2B 2 )(xt )2 dt+2B(xt )2 dWt .
If the ordinary chain rule held for xt as well, then we would be in trouble: the latter
equation has an excess term B 2 (xt )2 dt. But the ordinary chain rule does not hold for
xt , and the additional term in the Itô change of variables formula gives precisely the
additional term B 2 (xt )2 dt. Some minor miracles may or may not have occured, but
at the end of the day everything is consistent—as long as we are sufficiently careful!
Regardless of how we arrive at our stochastic differential equation model—be it
through some limiting procedure, through an empirical modelling effort, or by some
other means—we can now take such an equation as our starting point and develop
stochastic control and filtering machinery in that context. Almost all the examples that
we have discussed require us to use stochastic differential equations at some point in
the analysis; it is difficult to do anything without these basic tools. If you must choose
to retain only one thing from this course, then it is this: remember how stochastic
calculus and differential equations work, because they are ubiquitous.
Introduction 17
This chapter is about basic probability theory: probability spaces, random variables,
limit theorems. Much of this material will already be known to you from a previous
probability course. Nonetheless it will be important to formalize some of the topics
that are often treated on a more intuitive level in introductory courses; particularly the
measure-theoretic apparatus, which forms the foundation for mathematical probabil-
ity theory, will be indispensible. If you already know this material, you can skip to
chapter 2; if not, this chapter should contain enough material to get you started.
Why do we need the abstraction provided by measure theory? In your undergrad-
uate probability course, you likely encountered mostly discrete or real-valued random
variables. In the former case, we can simply assign to every possible outcome of a
random variable a probability; taking expectations is then easy! In the latter case, you
probably worked with probability densities, i.e.,
Z b Z ∞
Prob(X ∈ [a, b]) = pX (x) dx, E(X) = x pX (x) dx, (1.0.1)
a −∞
where pX is the density of the real-valued random variable X. Though both of these
are special cases of the general measure-theoretic framework, one can often easily
make do without the general theory.
Unfortunately, this simple form of probability theory will simply not do for our
purposes. For example, consider the Wiener process Wt . The map t 7→ Wt is not a
random number, but a random sample path. If we wanted to describe the law of this
random path by a probability density, the latter would be a function on the space of
continuous paths. But how can we then make sense of expressions such as eq. (1.0.1)?
What does it mean to integrate a function over the space of continuous paths, or to take
limits of such functions? Such questions have to be resolved before we can move on.
18
1.1. Probability spaces and events 19
3. Ω ∈ F.
An element A ∈ F is called an (F-)measurable set or an event.
1 A curiosity: in quantum mechanics, this is not true—this is a major difference between quantum
probability and classical probability. Now that you have read this footnote, be sure to forget it.
1.1. Probability spaces and events 20
The second and third condition are exactly as discussed above. In the first condi-
tion, we have allowed not only A or B?, but also A1 or A2 or A3 or . . .?, as long as
the number of questions An are countable. This is desirable: suppose, for example,
that Ω = N, and that {n} ∈ F for any n ∈ N (is it three? is it six? . . .); then it would
be a little strange if we could not answer the question {2n : n ∈ N} (is it an even
number?). Note that the fact that FTis closed under countable intersections (ands)
follows from the definition: after all, n An = ( n Acn )c .
S
Example 1.1.6. Let Ω be any set. Then the power set F = {A : A ⊂ Ω} (the
collection of all subsets of Ω) is a σ-algebra.
We can make more interesting σ-algebras as follows.
Definition 1.1.7. Let {Ai } be a (not necessarily countable) collection of subsets of
Ω. Then F = σ{Ai } denotes the smallest σ-algebra that contains every set Ai , and is
called the σ-algebra generated by {Ai }.
It is perhaps not entirely obvious that σ{Ai } exists or is uniquely defined. But
note that the power set contains all Ai ⊂ Ω, so that there exists at least one σ-algebra
that contains all Ai . For uniqueness,
T note that if {Fj } is a (not necessarily countable)
collection of σ-algebras, then j Fj is also a σ-algebra (check this!) So σ{Ai } is
uniquely defined as the intersection of all σ-algebras that contain all A i .
Example 1.1.8. Let Ω = {1, 2, 3, 4, 5, 6}. Then the σ-algebra generated by {1} and
{4} is σ{{1}, {4}} = {∅, {1}, {4}, {1}c, {4}c, {1, 4}, {1, 4}c, Ω}. Interpretation: if
I can answer the questions did we throw a one? and did we throw a four?, then I can
immediately answer all the questions in σ{{1}, {4}}. We think of σ{{1}, {4}} as
encoding the information contained in the observation of {1} and {4}.
This example demonstrates that even if our main σ-algebra is large—in the ex-
ample of throwing a die, one would normally choose the σ-algebra F of all sensible
questions to be the power set—it is natural to use subalgebras of F to specify what
(limited) information is actually available to us from making certain observations.
This idea is very important and will come back again and again.
Example 1.1.9. Let Ω be a topological space. Then σ{A ⊂ Ω : A is an open set} is
called the Borel σ-algebra on Ω, denoted as B(Ω).
When we work with continuous spaces, such as Ω = R or Ω = C([0, ∞[; R 3 )
(with its natural topology of uniform convergence on compact sets), we will usually
choose the σ-algebra F of sensible events to be the Borel σ-algebra.
Remark 1.1.10. This brings up a point that has probably puzzled you a little. What
is all this fuss about “sensible” events (yes-no questions)? If we think of Ω as the
set of all possible fates of the system, then why should any event A ⊂ Ω fail to be
sensible? In particular, why not always choose F to be the power set? The answer
to this question might not be very satisfying. The fact of the matter is that, as was
learned the hard way, it is essentially impossible to build a consistent theory if F
contains too many sets. We will come back to this very briefly below and give a
1.1. Probability spaces and events 21
slightly more satisfying answer. This also provides an excuse for another potentially
puzzling aspect: why do we only allow countable unions in the definition of the σ-
algebra, not uncountable unions? Note that if F had to be closed under uncountable
unions, and contained all individual points of Ω (surely a desirable state of affairs),
then F would be the power set and we would be in trouble. If you are interested
in this sort of thing, you will find plenty written about this in the literature. We will
accept it as a fact of life, however, that the power set is too large; fortunately, the Borel
σ-algebra is an extremely rich object and is more than sufficient for most purposes.
It remains to complete the final point on our agenda: we need to assign a proba-
bility to every event in F. Of course, this has to be done in a consistent way. If A
and B are two mutually exclusive events (A ∩ B = ∅), then it must be the case that
the probability of A or B? is the sum of the individual probabilities. This leads to the
following definition, which should look very familiar.
Definition 1.1.11. A probability measure is a map P : F → [0, 1] such that
S P
1. For countable {An } s.t. An ∩ Am = ∅ for n 6= m, P( n An ) = n P(An ).
2. P(∅) = 0, P(Ω) = 1.
The first property is known as countable additivity. It is fundamental to the inter-
pretation of the theory but also to its mathematical structure: this property will allow
us to take limits, and we will spend a lot of time taking limits in this course.
Definition 1.1.12. A probability space is a triple (Ω, F, P).
The simplest examples are the point mass and a finite probability space.
Example 1.1.13. Let Ω be any set and F be any σ-algebra. Fix some ω̃ ∈ Ω. Define
P as follows: P(A) = 1 if ω̃ ∈ A, and P(A) = 0 otherwise. Then P is a probability
measure, called the point mass on ω̃. Intuitively, this corresponds to the situation
where the fate ω̃ always happens (P is a “deterministic” measure).
Example 1.1.14. Let Ω be a finite set, and F be the power set of Ω. Then any proba-
bility measure on Ω can be constructed asP follows. First, specify for every point ω ∈ Ω
a probability P({ω}) ∈ [0, 1], such that ω∈Ω P({ω}) = 1. We can now extend this
map P to all of F by using the additivity property: after all, any subset
Pof Ω is the dis-
joint union of a finite number of sets {ω}, so we must have P(A) = ω∈A P({ω}).
This example demonstrates a basic idea: in order to define P, it is not necessary
to go through the effort of specifying P(A) for every element A ∈ F; it is usually
enough to specify the measure on a much smaller class of sets G ⊂ F, and if G is
large enough there will exist only one measure that is consistent with the information
provided. For example, if Ω is finite, then G = {{ω} : ω ∈ Ω} is a suitable class.
When Ω is continuous, however, specifying the probability of each point {ω} is
clearly not enough. Consider, for example, the uniform distribution on [0, 1]: the
probability of any isolated point {ω} should surely be zero! Nonetheless a similar
idea holds also in this case, but we have to choose G a little more carefully. For the
1.2. Some elementary properties 22
case when Ω = R and F = B(R), the Borel σ-algebra on R, the appropriate result
is stated as the following theorem. The proof of this theorem is far beyond our scope,
though it is well worth the effort; see [Bil86, Theorem 12.4].
Theorem 1.1.15. Let P be a probability measure on (R, B(R)). Then the function
F (x) = P(]−∞, x]) is nondecreasing, right-continuous, and F (x) → 0 as x → −∞,
F (x) → 1 as x → ∞. Conversely, for any function F (x) with these properties, there
exists a unique probability measure P on (R, B(R)) such that F (x) = P(] − ∞, x]).
You must have encountered the function F (x) in your introductory probability
course: this is the cumulative distribution function (CDF) for the measure P on R.
Theorem 1.1.15 forms a link between introductory probability, which centers around
objects such as F (x), and more advanced probability based on measure spaces.
In this course we will never need to construct probability measures directly on
more complicated spaces than R. As we will see, various techniques allow us to
construct more complicated probability spaces from simpler ones.
Example 1.1.16. The simple Gaussian probability space with mean µ ∈ R and vari-
ance σ > 0 is given by (Ω, F, P) with Ω = R, F = B(R), √ and P is constructed
through Theorem 1.1.15 using F (x) = 12 + 21 erf((x − µ)/σ 2).
Remark 1.1.17. We can now say a little more about the discussion in remark 1.1.10.
Suppose we took Ω = R, say, and we took F to be the power set. What would go
wrong? It turns out that there do not exist any probability measures on the power set
of R such that P({x}) = 0 for all x ∈ R. This is shown by Banach and Kuratowski
[BK29]; for more information, see [Dud02, Appendix C] or [Bir67, sec. XI.7]. This
means that if we wanted to work with the power set, the probability mass could at best
concentrate only on a countable number of points; but then we might as well choose
Ω to be the set of those points, and discard the rest of R. The proof of Banach and
Kuratowski assumes the continuum hypothesis, so might be open to some mathemat-
ical bickering; but at the end of the day it seems pretty clear that we are not going to
be able to do anything useful with the power set. For us, the case is now closed.
T
5. A1 ⊃ A2 ⊃ · · · ∈ F =⇒ limn→∞ P(An ) = P( n An ).
Proof.
1. Ω = A ∪ Ac and A ∩ Ac = ∅, so 1 = P(Ω) = P(A) + P(Ac ).
2. B = A ∪ (B\A) and A ∩ (B\A) = ∅, so P(B) = P(A) + P(B\A) ≥ P(A).
3. Assume without loss of generality that n ∈ N ({An } is a sequence). S
We construct
S sets
{Bn } that are disjoint, i.e., Bn ∩ Bm = ∅ for m 6= n, but such that k Ak = k Bk :
choose B1 = A1 , B2 = A2 \A1 , B3 = A3 \(A1 ∪ A2 ), . . . But note that Bk ⊂ Ak for
any k. Hence P(Bk ) ≤ P(Ak ), and we obtain
! !
[ [ X X
P Ak = P Bk = P(Bk ) ≤ P(Ak ).
k k k k
T S
5. Use n An = ( n Acn )c and the previous result.
The rightmost set is called lim sup Ak , in analogy with the limit superior of a sequence
of numbers. We have thus established:
\ [
{ω ∈ Ω : ω ∈ An i.o.} = Ak = lim sup Ak .
n≥1 k≥n
S
where we have used that k≥n Ak is a nonincreasing sequence of sets.
encounter notation such as P(X > 0), P(|Xn | > ε i.o.), etc. Such notation is very
intuitive, but keep in mind that this is actually short-hand notation for well-defined
mathematical objects: the probabilities of certain events in F.
We will take another notational liberty. If we make some statement, for example,
if we claim that X ∈ A (i.e., we claim to have proved that X ∈ A) or that |Xn | > ε
infinitely often, we generally mean that that statement is true with probability one,
e.g., P(X ∈ A) = 1 or P(|Xn | > ε i.o.) = 1. If we wanted to be precise, we would
say explicitly that the statement holds almost surely (abbreviated as a.s.). Though
sets of probability zero do not always play a negligible role (see section 2.4), we are
ultimately only interested in proving results with unit probability, so it is convenient
to interpret all intermediate statements as holding with probability one.
Now you might worry (as well you should!) that this sort of sloppiness could get
us in big trouble; but we claim that as long as we make only countably many almost
sure statements, we have nothing to worry about. You should revisit Corollary 1.2.2
at this point and convince yourself that this logic is air-tight.
It is not always entirely trivial to prove that a map is measurable. The following
simple facts are helpful and not difficult to prove; see, for example, [Wil91, Ch. 3].
Lemma 1.3.3. Let (Ω, F, P) be a probability space and (S, S) be a measurable space.
1. If h : Ω → S and f : S → S 0 are measurable, then f ◦ h is measurable.
2. If {hn } is a sequence of measurable functions hn : Ω → S, then inf n hn ,
supn hn , lim inf n hn , and lim supn hn are measurable.
One could use this as a method to generate a σ-algebra on Ω, if we did not have
one to begin with, starting from the given σ-algebra S. However, usually this concept
is used in a different way. We start with a probability space (Ω, F, P) and consider
some collection {Xi } of random variables (which are already F-measurable). Then
σ{Xi } ⊂ F is the sub-σ-algebra of F which contains precisely those yes-no ques-
tions that can be answered by measuring the Xi . In this sense, σ{Xi } represents the
information that is obtained by measuring the random variables X i .
Example 1.3.5. Suppose we toss two coins, so we model Ω = {HH, HT, T H, T T },
F is the power set of Ω, and we have some measure P which is irrelevant for this
discussion. Suppose we only get to observe the outcome of the first coin flip, i.e., we
1.3. Random variables and expectation values 26
We will not need this lemma in the rest of this course; it is included here to help
you form an intuition about measurability and generated σ-algebras. The point is
that if {Xi } is a collection of random variables and X is σ{Xi }-measurable, then
you should think of X as being a function of the Xi . It is possible to prove analogs
of lemma 1.3.6 for most situations of interest (even when the collection {Xi } is not
finite), if one so desires, but there is rarely a need to do so.
The rest of this section is devoted to the concept of expectation. For a random
variable that takes a finite number of values, you know very well what this means: it
is the sum of the values of the random variable weighted by their probabilities.
Definition 1.3.7. Let (Ω, F, P) be a probability space. A simple random variable
X : Ω → R is a random variable that takes only a finite P number of values, i.e.,
X(Ω) = {x1 , . . . , xn }. Its expectation is defined as E(X) = nk=1 xk P(X = xk ).
Remark 1.3.8. Sometimes we will be interested in multiple probability measures on
the same σ-algebra
P F (P and Q, P say). The notation E(X) can then be confusing: do
we mean k xk P(X = xk ) or k xk Q(X = xk )? Whenever necessary, we will
denote the former by EP and the latter by EQ to avoid confusion. Usually, however, we
will be working on some fixed probability space and there will be only one measure
of interest P; in that case, it is customary to write E(X) to lighten the notation.
1.3. Random variables and expectation values 27
We want to extend this definition to general random variables. The simplest ex-
tension is to the case where X does not take a finite number of values, but rather a
countable number of values. This appears completely trivial, but there is an issue here
P∞calculus type: suppose that X takes the values {x k }k∈N and we de-
of the elementary
fine E(X) = k=1 xk P(X = xk ). It is not obvious that thisP sum is well behaved: if
xk is an alternating sequence, it could well be that the series nk=1 xk P(X = xk ) is
not absolutely convergent and the expectation would thus depend on the order of sum-
mation! Clearly that sort of thing should not be allowed. To circumvent this problem
we introduce the following definition, which holds generally.
Definition 1.3.9. Let us define X + = max(X, 0) and X − = − min(X, 0), so that
X = X + − X − . The expectation E(X) is defined only if either E(X + ) < ∞ or
E(X − ) < ∞. If this is the case, then by definition E(X) = E(X + ) − E(X − ).
As such, we should concentrate on defining E(X) for nonnegative X. We have
got this down for simple random variables and for random variables with countable
values; what about the general case? The idea here is very simple. For any nonnega-
tive random variable X, we can find a sequence Xn of simple random variables that
converges to X; actually, it is most convenient to choose Xn to be a nondecreasing
sequence Xn % X so that E(Xn ) is guaranteed to have a limit (why?).
Definition 1.3.10. Let X by any nonnegative random variable. Then we define the
expectation E(X) = limn→∞ E(Xn ), where Xn is any nondecreasing sequence of
simple random variables that converges to X.
It remains to prove (a) that we can find such a sequence Xn ; and (b) that any
such sequence gives rise to the same value for E(X). Once these little details are
established, we will be convinced that the definition of E(X) makes sense. If you are
already convinced, read the following remark and then skip to the next section.
Remark 1.3.11. The idea of approximating a function by a piecewise constant func-
tion, then taking limits should look very familiar—remember the Riemann integral?
In fact, the expectation which we have constructed really is a type of integral, the
Lebesgue integral with respect to the measure P. It can be denoted in various ways:
Z Z
E(X) ≡ X(ω) P(dω) ≡ X dP.
Unlike the Riemann integral we can use the Lebesgue integral to integrate functions
on very strange spaces: for example, as mentioned at the beginning of the chapter,
we can integrate functions on the space of continuous paths—provided that we can
construct a suitable measure P on this space.
When Ω = Rd , F = B(Rd ) and with a suitable choice of measure µ (instead of
P), the Lebesgue integral can actually serve as a generalization of the Riemann inte-
gral (it is a generalization because the Riemann integral can only integrate continuous
functions, whereas the Lebesgue integral can integrate measurable functions). The
Lebesgue measure µ, however, is not a probability measure: it satisfies all the condi-
tions of Definition 1.1.11 except µ(Ω) = 1 (as µ(Rd ) = ∞). This does not change
much, except that we can obviously not interpret µ probabilistically.
1.4. Properties of the expectation and inequalities 28
Lemma 1.3.13. Let X ≥ 0, and let {Xn } and {X̃n } be two sequences of simple ran-
dom variables s.t. Xn % X and X̃n % X. Then limn→∞ E(Xn ) = limn→∞ E(X̃n ).
Proof. It suffices to prove that E(X̃k ) ≤ limn→∞ E(Xn ) for any k. After all, this implies
that limk→∞ E(X̃k ) ≤ limn→∞ E(Xn ), and inequality in the reverse direction follows by
reversing the roles of Xn and X̃n . To proceed, note that as X̃k is simple, it takes a finite
number of values x1 , . . . , x` on the sets Ai = X̃k−1 (xi ). Define
X̀
E(Xn ) ≥ (xi − ε) P(Bin ) =⇒ lim E(Xn ) ≥ E(X̃k ) − ε.
n→∞
i=1
Proof. The main idea is to prove that these results are true for simple random variables (this is
easily verified), then take appropriate limits.
1. First assume X, Y ≥ 0. Apply lemma 1.3.12 to X, Y ; this gives two sequences Xn , Yn
of simple functions with Xn % X, Yn % Y , and Xn = Yn a.s. for all n (why?). It is
immediate that E(Xn ) = E(Yn ) for all n, so the result follows by letting n → ∞. Now
drop the assumption X, Y ≥ 0 by considering separately X + , Y + and X − , Y − .
2. Same idea.
3. Same idea.
4. Use −|f | ≤ f ≤ |f | and that X ≤ Y implies E(X) ≤ E(Y ).
5. Suppose X is not finite a.s.; then on some set A ∈ F with P(A) > 0 we have X = ∞
or −∞ (we can not have both, as then E(X) would not be defined). It follows from the
definition of the expectation that E(X + ) = ∞ or E(X − ) = ∞, respectively (why?).
6. Suppose that P(X > 0) > 0. We claim that there is a ε > 0 s.t. P(X > ε) > 0.
Indeed, the sets Aε = {ω ∈ Ω : X(ω) > ε} increase in size with decreasing ε, so
P(Aε ) % P(A0 ) = P(X > 0) > 0 (remember lemma 1.2.1?), and thus there must
exist a positive ε with P(Aε ) > 0. But then E(X) ≥ E(εIAε ) = εP(Aε ) > 0 (here
IA (ω) = 1 if ω ∈ A, 0 otherwise) which contradicts the assumption.
Next, let us treat two elementary inequalities: Chebyshev’s inequality (often called
Markov’s inequality) and Jensen’s inequality. These inequalities are extremely useful:
do not leave home without them! In the following, we will often use the notation
1 if ω ∈ A,
IA (ω) = A ∈ F.
0 otherwise,
E(|X|)
P(|X| ≥ α) ≤ .
α
is quite fundamental; it says, for example, something that you know very well: the
variance var(X) = E(X 2 ) − E(X)2 is always nonnegative (as x2 is convex).
Recall that the expectation of X is defined when E(X + ) < ∞ or E(X − ) < ∞.
The most useful case, however, is when both these quantities are finite: i.e., when
E(|X|) = E(X + ) + E(X − ) < ∞. In this case, X is said to be integrable. It is not
necessarily true, e.g., that an integrable random variable has a finite variance; we need
to require a little more for this. In fact, there is a hierarchy of regularity conditions.
Definition 1.4.3. For a random variable X and p ≥ 1, let kXkp = (E(|X|p ))1/p .
A random variable with kXk1 < ∞ is called integrable, with kXk2 < ∞ square
integrable. A random variable is called bounded if there exists K ∈ R such that
|X| ≤ K a.s.; the quantity kXk∞ is by definition the smallest such K.
Remark 1.4.4. Almost all the material which we have discussed until this point has
had direct intuitive content, and it is important to understand the intuition and ideas be-
hind these concepts. Integrability conditions are a little less easy to visualize; though
they usually have significant implications in the theory (many theorems only hold
when the random variables involved satisfy kXkp < ∞ for some sufficiently large p),
they certainly belong more to the technical side of things. Such matters are unavoid-
able and, if you are a fan of analysis, can be interesting to deal with in themselves (or
a pain in the butt, if you will). As we progress through this course, try to make a dis-
tinction for yourself between the conceptual challenges and the technical challenges
that we will face (though sometimes these will turn out to be intertwined!)
The spaces Lp = Lp (Ω, F, P) = {X : Ω → R; kXkp < ∞} play a fundamental
role in functional analysis; on Lp , k · kp is almost a norm (it is not a norm, because
kXkp = 0 implies that X = 0 a.s., not X(ω) = 0 for all ω) and Lp is almost a
Banach space; similarly, L2 is almost a Hilbert space. Functional analytic arguments
would give an extra dimension to several of the topics in this course, but as they are
not prerequisite for the course we will leave these ideas for you to learn on your own
(if you do not already know them). Excellent references are [RS80] and [LL01]. Here
we will content ourselves by stating the most elementary results.
Proposition 1.4.5. Define Lp as the space of random variables X with kXkp < ∞.
1. Lp is linear: if α ∈ R and X, Y ∈ Lp , then X + Y ∈ Lp and αX ∈ Lp .
2. If X ∈ Lp and 1 ≤ q ≤ p, then kXkq ≤ kXkp (so Lp ⊂ Lq ).
3. (Hölder’s inequality) Let p−1 + q −1 = 1. If X ∈ Lp and Y ∈ Lq , then
|E(XY )| ≤ kXkpkY kq (so XY ∈ L1 ).
4. (Minkowski’s inequality) If X, Y ∈ Lp , then kX + Y kp ≤ kXkp + kY kp .
Proof.
1. Linearity follows immediately from |x + y|p ≤ (2 |x| ∨ |y|)p ≤ 2p (|a|p + |b|p ).
2. We would like to prove E(|X|p ) = kXkpp ≥ kXkpq = E(|X|q )p/q . But this follows
directly from convexity of xp/q on [0, ∞[ and Jensen’s inequality.
1.5. Limits of random variables 31
3. We can restrict ourselves to the case where X and Y are nonnegative and kXkp > 0
(why?) For any A ∈ F, define Q(A) = E(IA X p )/E(X p ). Then Q is also a probability
measure on F (this idea is fundamental, and we will come back to it later on). Define
Z(ω) = Y (ω)/X(ω)p−1 wherever X(ω) > 0, and Z(ω) = 0 otherwise. But then
1
0
-1
1 2 3 4 5 6 7 8 9 10
Figure 1.1. The sequence {Xn } of example 1.5.3. Shown is the path n 7→ Xn (ω) with ω = a1
(blue) and ω = a2 (red) for n = 1, . . . , 10. Both paths occur with equal probability.
These are the only implications that hold in general. Though this proposition is
very useful in practice, you will perhaps get the most intuition about these modes of
convergence by thinking about the following counterexamples.
1.5. Limits of random variables 33
8 8 8 8
X1 X2 X4 X8
6 6 6 6
4 4 4 4
2 2 2 2
0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
Figure 1.2. The random variables X1 , X2 , X4 , X8 of example 1.5.4. The horizontal axis
represents Ω = [0, 1], the vertical axis the value of Xn (ω). Note that the probability that Xn is
nonzero shrinks to zero, but the value of Xn when it is nonzero becomes increasingly large.
Example 1.5.3 (Convergence in law but not in probability). It is easy to find coun-
terexamples for this case; here is one of the simplest. Let Ω = {a1 , a2 }, F is the
power set, and P is the uniform measure (P({a1 }) = 1/2). Define the random vari-
able X(a1 ) = 1, X(a2 ) = −1, and consider the sequence Xn = (−1)n X. Ob-
viously this sequence can never converge in probability, a.s., or in L p . However,
E(f (Xn )) = E(f (X)) for any f , so Xn → X in law (and also Xn → −X in law!)
Evidently this type of convergence has essentially no implication for the behavior of
the random process Xn ; certainly it does not look anything like convergence if we
look at the paths n 7→ Xn (ω) for fixed ω! (See figure 1.1). On the other hand, this is
precisely the notion of convergence used in the central limit theorem.
The following three examples use the following probability space: Ω = [0, 1],
F = B([0, 1]), and P is the uniform measure on [0, 1] (under which P([a, b]) = b − a).
Example 1.5.4 (Convergence a.s. but not in Lp ). Consider the sequence of random
variables Xn (ω) = n I]0,1/n] (ω) (with ω ∈ Ω = [0, 1]). Then Xn → 0 a.s.: for any
ω, it is easy to see that I]0,1/n] (ω) = 0 for n sufficiently large. However, kXn k1 =
n E(I]0,1/n] ) = 1 for every n, so Xn 6→ 0 in L1 . As convergence in Lp implies
convergence in L1 (for p ≥ 1), we see that Xn does not converge in Lp . What is
going on? Even though P(Xn 6= 0) shrinks to zero as n → ∞, the value of Xn on
those rare occasions that Xn 6= 0 grows so fast with n that we do not have convergence
in Lp , see figure 1.2 (compare with the intuition: a random variable that is zero with
very high probability can still have a very large mean, if the outliers are sufficiently
large). Note that as Xn converges a.s., it also converges in probability, so this example
also shows that convergence in probability does not imply convergence in L p .
Example 1.5.5 (Convergence in Lq but not in Lp ). Let Xn (ω) = n1/p I]0,1/n] (ω).
You can easily verify that Xn → 0 in Lq for all q < p, but not for q ≥ p. Intuitively,
Xn → X in Lq guarantees that the outliers of |Xn − X| do not grow “too fast.”
Example 1.5.6 (Convergence in Lp but not a.s.). This example is illustrated in figure
1.3; you might want
Pto take a look at it first. Define Xn as follows. Write n as a binary
∞
number, i.e., n = i=0 ni 2i where ni ∈ {0, 1}. Let k be the largest integer such that
nk = 1. Then we set Xn (ω) = I](n−2k )2−k ,(n−2k +1)2−k ] (ω). It is not difficult to see
1.5. Limits of random variables 34
1 1 1 1 1 1 1
X1 X2 X3 X4 X5 X6 X7
0 0 0 0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
1 1 1 1 1 1 1
X8 X9 X10 X11 X12 X13 X14
0 0 0 0 0 0 0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
Figure 1.3. The random variables Xn , n = 1, . . . , 14 of example 1.5.6. The horizontal axis
represents Ω = [0, 1], the vertical axis the value of Xn (ω). Note that the probability that Xn is
nonzero shrinks to zero, but lim supn→∞ Xn (ω) = 1 for any ω > 0.
Before you move on, take some time to make sure you understand the
various notions of convergence, their properties and their relations.
If you take a piece of paper, write on it all the modes of convergence which we
have discussed, and draw arrows in both directions between each pair of convergence
concepts, you will find that every one of these arrows is either implied by proposition
1.5.2 or ruled out, in general, by one of our counterexamples. However, if we impose
some additional conditions then we can often still obtain some of the opposite impli-
cations (needless to say, our examples above will have to violate these conditions).
An important related question is the following: if Xn → X in a certain sense, when
does this imply that E(Xn ) → E(X)? The remainder of this section provides some
answers to these questions. We will not strive for generality, but concentrate on the
most widely used results (which we will have ample occasion to use).
Let us first tackle the question: when does convergence in probability imply a.s.
convergence? To get some intuition, we revisit example 1.5.6.
Example 1.5.7. The following is a generalization of example 1.5.6. We construct a
sequence of {0, 1}-valued random variables Xn such that P(Xn > 0) = `(n), where
`(n) is arbitrary (except that it is [0, 1]-valued). Set X1 (ω) = I]0,`(1)] (ω), then set
X2 (ω) = I]`(1),`(1)+`(2)] mod [0,1] (ω), etc., so that each Xn is the indicator on the
interval of length `(n) immediately adjacent to the right of the interval corresponding
to Xn−1 , and we wrap around from 1 to 0 if necessary. Obviously Xn → 0 in
probability iff `(n) → 0 as n → ∞. We now ask: when does Xn → 0 a.s.? If this is
1.5. Limits of random variables 35
not the case, then Xn must revisit every section of the interval [0, 1] P
infinitely many
times; this means thatPthe total “distance” travelled must be infinite n `(n) = ∞.
On the other hand, if n `(n) < ∞, then eventually the right endpoint of the interval
corresponding to Xn has to accumulate at some x∗ ∈ [0, 1], so that Xn → 0 a.s.
This example suggests that Xn → X in probability would imply Xn → X a.s.
if only the convergence in probability happens “fast enough”. This is generally true,
and gives a nice application of the Borel-Cantelli lemma.
Lemma 1.5.11 (Fatou’s lemma). Let {Xn } be a sequence of a.s. nonnegative ran-
dom variables. Then E(lim inf Xn ) ≤ lim inf E(Xn ). In there exists a Y such that
Xn ≤ Y a.s. for all n and E(Y ) < ∞, then E(lim sup Xn ) ≥ lim sup E(Xn ).
This should really be a theorem, but as it is called “Fatou’s lemma” we will con-
form. The second half of the result is sometimes called the “reverse Fatou’s lemma”.
Proof. Define Zn = inf k≥n Xk , so lim inf n Xn = limn Zn . Note that Zn is nondecreasing,
so by monotone convergence E(lim inf n Xn ) = limn E(Zn ). But E(Zn ) ≤ inf k≥n E(Xk )
(why?), so E(lim inf n Xn ) ≤ lim inf n E(Xn ). The second half of the result follows by apply-
ing the first half to the sequence Xn0 = Y − Xn .
In particular, these expressions make sense (the inner integrals are measurable).
5. (Fubini) The previous statement still holds for random variables f that are not
necessarily nonnegative, provided that E(|f |) < ∞.
The construction extends readily to products of a finite number of spaces. Start-
ing from, e.g., the simple probability space (R, B(R), PN ) where PN is a Gaussian
measure (with mean zero and unit variance, say), which we have already constructed
in example 1.1.16, we can now construct a larger space (Ω, F, P) that carries a finite
number d of independent copies of a Gaussian random variable: set Ω = R × · · · × R
(d times), F = B(R) × · · · × B(R), and P = PN × · · · × PN . The independent
Gaussian random variables are then precisely the projection maps ρ 1 , . . . , ρd .
Remark 1.6.7. The Fubini and Tonelli theorems tell us that taking the expectation
with respect to the product measure can often be done more simply: if we consider
the product space random variable f : Ω1 × Ω2 → R as an ω1 -valued random variable
1.6. Induced measures, independence, and absolute continuity 40
f (ω1 , ·) : Ω2 → R, then we may take the expectation of this random variable with
respect to P2 . The resulting map can subsequently be interpreted as an ω1 -valued
random variable, whose P1 -expectation we can calculate. The Fubini and Tonelli
theorems tell us that (under mild regularity conditions) it does not matter whether we
apply EP1 ×P2 , or EP1 first and then EP2 , or vice versa.
We will more often encounter these results in a slightly different context. Suppose
we have a continuous time stochastic process, i.e., a collection of random variables
{Xt }t∈[0,T ] (see section 2.4 for more on this concept). Such a process is called mea-
surable if the map X· : [0, T ] × Ω → R is B([0, T ]) × F-measurable. In this case,
Z T
Y (ω) = Xt (ω) dt
0
is a well defined random variable: you can interpret this as T times the expectation of
X· (ω) : [0, T ] → R with respect to the uniform measure on [0, T ] (this is the Lebesgue
measure of remark 1.3.11, restricted to [0, T ]). Suppose that we are interested in the
expectation of Y ; it is then often useful to know whether we can exchange the order
of integration and expectation, i.e., whether
! Z
Z T T
E(Y ) = E Xt dt = E(Xt ) dt.
0 0
The Fubini and Tonelli theorems give sufficient conditions for this to be the case.
As you can tell from the preceding remark, product spaces play an important role
even in cases that have nothing to do with independence. Let us get back to the theme
of this section, however, which was to build more complicated probability spaces from
simpler ones. We have seen how to build a product probability space with a finite
number of independent random variables. In our construction of the Wiener process,
however, we will need an entire sequence of independent random variables. The con-
struction of the product space and σ-algebra extends trivially to this case (how?), but
the construction of an infinite product measure brings with it some additional difficul-
ties. Nonetheless this can always be done [Kal97, corollary 5.18]. For the purposes of
this course, however, the following theorem is all we will need.
Theorem 1.6.8. Let {Pn } be a sequence of probability measures on (R, B(R)). Then
there exists a probability measure P on (R×R×· · · , B(R)×B(R)×· · · ) such that the
projection maps ρ1 , ρ2 , . . . are independent and have the law P1 , P2 , . . ., respectively.
The proof of this result is so much fun that it would be a shame not to include it.
However, the method of proof is quite peculiar and will not really help you in the rest
of this course. Feel free to skip it completely, unless you are curious.
Rather than construct the infinite product measure directly, the idea of the proof is to con-
struct a sequence of independent random variables {Xn } on the (surprisingly simple!) proba-
bility space ([0, 1], B([0, 1]), λ), where λ is the uniform measure, whose laws are P1 , P2 , . . .,
respectively. The theorem then follows trivially, as the law of the R × R × · · · -valued random
variable X = (X1 , X2 , . . .) is then precisely the desired product measure P.
1.6. Induced measures, independence, and absolute continuity 41
It may seem a little strange that we can construct all these independent random variables on
such a simple space as ([0, 1], B([0, 1]), λ)! After all, this would be the natural space on which
to construct a single random variable uniformly distributed on the interval. Strange things are
possible, however, because [0, 1] is continuous—we can cram a lot of information in there, if
we encode it correctly. The construction below works precisely in this way. We proceed in two
steps. First, we show that we can dissect and then reassemble the interval in such a way that
it gives an entire sequence of random variables uniformly distributed on the interval. It then
remains to find functions of these random variables that have the correct laws P1 , P2 , . . ..
The dissection and reassembly of the interval [0, 1] is based on the following lemma.
P Let ξ−n
Lemma. be a random variable that is uniformly distributed on [0, 1], and denote by
ξ= ∞ n=1 ξn 2 its binary expansion (i.e., ξn are {0, 1}-valued random variables). Then all
the ξn are independent and takeP the values 0 and 1 with equal probability. Conversely, if {ξn }
is any such sequence, then ξ = ∞ n=1 ξn 2
−n
is uniformly distributed on [0, 1].
Proof. Consider the first k binary digits ξn , n ≤ k. If you write down all possible combinations
of k zeros and ones, and partition [0, 1] into sets whose first k digits coincide to each of these
combinations, you will find that [0, 1] is partitioned into 2k equally sized sets, each of which
has probability 2−k . As for every n ≤ k the set {ξn = 1} is the union of 2−k+1 sets in our
partition, we find that every such set has probability 2−1 . But then clearly the ξn , n ≤ k must
be independent (why?) As this holds for any k < ∞, the first part of the lemma follows. The
second part of the lemma follows directly from the first part, as any random variable constructed
in this way must have the same law as the random variable ξ considered in the first part.
Corollary. There exists a sequence {Yn } of independent random variables on the probability
space ([0, 1], B([0, 1]), λ), each of which is uniformly distributed on the unit interval [0, 1].
Proof. Define ξ : [0, 1] → R, ξ(x) = x. Then ξ is uniformly distributed on [0, 1], so by the
previous lemma its sequence of binary digits {ξn } are independent and take the values {0, 1}
with equal probability. Let us now reorder {ξn }n∈N into a two-dimensional array {ξ̃mn }m,n∈N
(i.e., each ξ̃mn coincides with precisely one ξn ). This is easily done,
P∞ for example, as in the
−m
usual proof that the rational numbers are countable. Define Yn = m=1 ξ̃mn 2 . By the
previous lemma, the sequence {Yn } has the desired properties.
Proof of theorem 1.6.8. We have constructed a sequence {Yn } of independent uniformly dis-
tributed random variables on [0, 1]. The last step of the proof consists of finding a sequence
of measurable functions fn : [0, 1] → R such that the law of Xn = fn (Yn ) is Pn . Then we
are done, as {Xn } is a sequence of independent random variables with law P1 , P2 , . . ., and the
product measure P is then simply the law of (X1 , X2 , . . .) as discussed above.
To construct the functions fn , let Fn (x) = Pn (] − ∞, x]) be the CDF of the measure Pn
(see theorem 1.1.15). Note thst Fn takes values in the interval [0, 1]. Now define fn (u) =
inf{x ∈ R : u ≤ Fn (x)}. Then λ(fn (Yn ) ≤ y) = λ(Yn ≤ Fn (y)) = Fn (y), as Yn is
uniformly distributed. Hence fn (Yn ) has the law Pn , and we are done.
often become very simple if we change to a suitably modified measure (for example, if
{Xn } is a collection of random variables with some complicated dependencies under
P, it may be advantageous to compute using a modified measure Q under which the
Xn are independent). Later on, the change of measure concept will form the basis for
one of the most basic tools in our stochastic toolbox, the Girsanov theorem.
The basic idea is as follows. Let f be a nonnegative random variable with unit
expectation E(f ) = 1. For any set A ∈ F, define the quantity
Z
Q(A) = EP (IA f ) ≡ f (ω) P(dω).
A
for any random variable g for which either side is well defined (why?).
Definition 1.6.9. A probability measure Q is said to have a density with respect to
a probability measure P if there exists a nonnegative random variable f such that
Q(A) = EP (IA f ) for every measurable set A. The density f is denoted as dQ/dP.
Remark 1.6.10. In your introductory probability course, you likely encountered this
idea very frequently with a minor difference: the concept still works if P is not a
probability measure but, e.g., the Lebesgue measure of remark 1.3.11. We then define
Z Z b
Q(A) = f (x) dx, e.g., Q([a, b]) = f (x) dx,
A a
where f is now the density of Q with respect to the Lebesgue measure. Not all
probability measures on R admit such a representation (consider example 1.1.13),
but many interesting examples can be constructed, including the Gaussian measure
(where f ∝ exp(−(x − µ)2 /2σ)). Such expressions allow nice explicit computations
in the case where the underlying probability space is Rd , which is the reason that
most introductory courses are centered around such objects (rather than introducing
measure theory, which is needed for more complicated probability spaces).
Suppose that Q has a density f with respect to P. Then these measures must
satisfy an important consistency condition: if P(A) = 0 for some event A, then Q(A)
must also be zero. To see this, note that IA (ω)f (ω) = 0 for ω ∈ Ac and P(Ac ) = 1,
so IA f = 0 P-a.s. In other words, if Q has a density with respect to P, then any
event that never occurs under P certainly never occurs under Q. Similarly, any event
that happens with probability one under P must happen with probability one under
Q (why?). Evidently, the use of a density to transform a probability measure P into
another probability measure Q “respects” those events that happen for sure or never
happen at all. This intuitive notion is formalized by the following concept.
Definition 1.6.11. A measure Q is said to be absolutely continuous with respect to a
measure P, denoted as Q P, if Q(A) = 0 for all events A such that P(A) = 0.
1.7. A technical tool: Dynkin’s π-system lemma 43
We have seen that if Q has a density with respect to P, then Q P. It turns out
that the converse is also true: if Q P, then we can always find some density f
such that Q(A) = EP (IA f ). Hence the existence of a density is completely equivalent
to absolute continuity of the measures. This is a deep result, known as the Radon-
Nikodym theorem. It also sheds considerable light on the concept of a density: the
intuitive meaning of the existence of a density is not immediately obvious, but the
conceptual idea behind absolute continuity is clear.
Theorem 1.6.12 (Radon-Nikodym). Suppose that Q P are two probability mea-
sures on the space (Ω, F). Then there exists a nonnegative F-measurable function
f with EP (f ) = 1, such that Q(A) = EP (IA f ) for every A ∈ F. Moreover, f is
unique in the sense that if f 0 is another F-measurable function with this property,
then f 0 = f P-a.s. Hence it makes sense to speak of ‘the’ density, or Radon-Nikodym
derivative, of Q with respect to P, and this density is denoted as dQ/dP.
In the case that Ω is a finite set, this result is trivial to prove. You should do this
now: convince yourself that the equivalence between absolute continuity and the ex-
istence of a density is to be expected. The uniqueness part of the theorem also follows
easily (why?), but the existence part is not so trivial in the general case. The theorem
is often proved using functional analytic tools, notably the Riesz representation theo-
rem; see e.g. [GS96]. A more measure-theoretic proof can be found in [Bil86]. Most
beautiful is the probabilistic proof using martingale theory, see [Wil91, sec. 14.13].
We will follow this approach to prove the Radon-Nikodym theorem in section 2.2,
after we have developed some more of the necessary tools.
Lemma 1.7.3 (Dynkin). Let (Ω, F) be a measurable space, and let C be a π-system
such that F = σ{C}. If two probability measures P and Q agree on C, i.e., if we have
P(A) = Q(A) for all A ∈ C, then P and Q are equal (P(A) = Q(A) for all A ∈ F).
Proof. Define D = {A ∈ F : P(A) = Q(A)}. You can verify directly that D is a λ-system,
and by assumption C ⊂ D. But then D = F by the previous lemma.
For sake of example, let us prove the uniqueness part of theorem 1.6.6. Recall
that ρi : Ω1 × Ω2 → Ωi are the projection maps, and F1 × F2 = σ{ρ1 , ρ2 }. Hence
F1 × F2 = σ{C} with C = {A × B : A ∈ F1 , B ∈ F2 }, which is clearly a π-
system. Now any measure P under which ρi has law Pi and under which ρ1 and ρ2
are independent must satisfy P(A × B) = P1 (A) P2 (B): this follows immediately
from the definition of independence. By the π-system lemma, any two such measures
must be equivalent. Hence the product measure P1 × P2 is uniquely defined.
The notion of conditional expectation is, in some sense, where probability theory gets
interesting (and goes beyond pure measure theory). It allows us to introduce interest-
ing classes of stochastic processes—Markov processes and martingales—which play
a fundamental role in much of probability theory. Martingales in particular are ubiq-
uitous throughout almost every topic in probability, even though this might be hard to
imagine when you first encounter this topic.
We will take a slightly unusual route. Rather than introduce immediately the ab-
stract definition of conditional expectations, we will start with the familiar discrete
definition and build some of the key elements of the full theory in that context (par-
ticularly martingale convergence). This will be sufficient both to prove the Radon-
Nikodym theorem, and to define the general notion of conditional expectation in a
natural way. The usual abstract definition will follow from this approach, while you
will get a nice demonstration of the power of martingale theory along the way.
45
2.1. Conditional expectations and martingales: a trial run 46
Y Y
X X
(X | Y)
E (X | Y , Z)
Figure 2.1. Illustration of the discrete conditional expectation on ([0, 1], B([0, 1]), P), where P
is the uniform measure. A random variable X is conditioned with respect to a discrete random
variable Y (left) and with respect to two discrete random variables Y, Z (right). This amounts
to averaging X (w.r.t. P) over each bin in the partition generated by Y and Y, Z, respectively.
information that can be extracted by measuring these random variables: any event
B ∈ σ{Y 1 , . . . , Y n } can be written as a union of the sets Ay1 ,...,yn (why?) Hence
in reality we are not really conditioning on the random variables Y 1 , . . . , Y n them-
selves, but on the information contained in these random variables (which is intuitively
precisely as it should be!) It should thus be sufficient, when we are calculating condi-
tional expectations, to specify the σ-algebra generated by our observations rather than
the observations themselves. This is what we will do from now on.
Definition 2.1.1. A σ-algebra F is said to be finite if it is generated by a finite number
of sets F = σ{A1 , . . . , An }. Equivalently, F is finite if there is a finite partition of Ω
such that every event in F is a union of sets in the partition (why are these equivalent?).
Definition 2.1.2. Let X ∈ L1 be a random variable on (Ω, F, P), and let G ⊂ F be a
finite σ-algebra generated by the partition {Ak }k=1,...,n . Then
n
X
E(X|G) ≡ E(X|Ak ) IAk ,
k=1
where E(X|A) = E(XIA )/P(A) if P(A) > 0 and we may define E(X|A) arbitrarily
if P(A) = 0. The conditional expectation thus defined is unique up to a.s. equivalence,
i.e., any two random variables Y, Ỹ that satisfy the definition obey Y = Ỹ a.s. (why?).
Remark 2.1.3. X ∈ L1 ensures that E(X|Ak ) is well defined and finite.
The conditional expectation defined here should be a completely intuitive concept.
Unfortunately, extending it to σ-algebras which are not finite is not so straightforward.
For example, suppose we would like to condition X on a random variable Y that is
uniformly distributed on the unit interval. The quantity E(X|Y = y) is not well
defined, however: P(Y = y) = 0 for every y ∈ [0, 1]! In the finite case this was not
a problem; if some sets in the partition had zero probability we just ignore them, and
the resulting conditional expectation is still uniquely defined with probability one. In
the continuous case, however, our definition fails with probability one (the problem
being, of course, that there is an uncountable amount of trouble).
A look ahead
In laying the foundations of modern probability theory, one of the most important
insights of Kolmogorov (the father of modern probability) was that the conditional
expectation can be defined unambiguously even in the continuous case. Kolmogorov
noticed that the discrete definition could be rephrased abstractly without mention of
the finiteness of the σ-algebra, and that this abstract definition can serve as a mean-
ingful definition of the conditional expectation in the continuous case. We could in-
troduce this definition at this point and show that it reduces to the definition above
for finite σ-algebras. To prove that the general definition is well posed, however, we
need1 the Radon-Nikodym theorem which we have not yet proved. It may also not
1 This is not the only way to prove well posedness but, as we will see, there is a natural connec-
tion between the Radon-Nikodym theorem and the conditional expectation that makes this point of view
worthwhile. We will comment on the other method of proving well posedness later on.
2.1. Conditional expectations and martingales: a trial run 48
X X X X
Y1 Y2 Y3 Y
Figure 2.2. We would like to calculate E(X|Y ), where P is the uniform measure on the interval,
X : [0, 1] → R is some random variable and Y : [0, 1] → R is given by Y (x) = x for x ≤ 12 ,
Y (x) = 1 − x for x ≥ 12 . Intuitively we should have E(X|Y )(x) = 21 X(x) + 12 X(1 − x),
but definition 2.1.2 does not allow us to conclude this. However, a sequence of approximations
covered by definition 2.1.2 appears to give rise to the expected result. But how to prove it?
3. E(E(X|G)) = E(X).
4. Tower property: if H ⊂ G, then E(E(X|G)|H) = E(X|H) a.s.
5. If X is G-measurable, then E(X|G) = X a.s.
6. If X is G-measurable and XY ∈ L1 , then E(XY |G) = X E(Y |G) a.s.
7. If H and σ{X, G} are independent, then E(X|σ{G, H}) = E(X|G) a.s.
8. If H and X are independent, then E(X|H) = E(X) a.s.
9. Monotone and dominated convergence, Fatou’s lemma, and Jensen’s inequality
all hold for conditional expectations also; e.g., if 0 ≤ X1 ≤ X2 ≤ · · · a.s.,
then E(Xn |G) % E(limn Xn |G) a.s. (monotone convergence).
Proof.
1. Trivial.
2. Trivial.
P P
3. E(E(X|G)) = k E(X|Ak )P(Ak ) = k E(XIAk ) = E(XI∪k Ak ) = E(X).
4. Let {Ak } be the partition for G and {Bk } be the partition for H. Then
XX
E(E(X|G)|H) = E(X|Ak )P(Ak |Bj )IBj .
j k
But H ⊂ G implies that every set Bj is the disjoint union of sets Ak , so P(Ak |Bj ) =
E(IAk IBj )/P(Bj ) = P(Ak )/P(Bj ) if Ak ⊂ Bj , and zero otherwise. Hence
X X E(XIAk ) X E(XIBj )
E(E(X|G)|H) = IB j = IBj = E(X|H).
j k:Ak ⊂Bj
P(Bj ) j
P(Bj )
where we have used (twice) that IAk IAi = IAk if k = i, and zero otherwise.
7. Let {Ak } be a partition for G and {Bk } for H. Then the sets {Ai ∩ Bj } are disjoint,
and hence form a partition for σ{G, H}. We can thus write
X E(XIAi ∩Bj ) X E(XIA ) P(Bj )
i
E(X|G, H) = IAi ∩Bj = IAi IBj = E(X|G)
i,j
P(A i ∩ B j ) i,j
P(Ai ) P(Bj )
(with the convention 0/0 = 0), where we have used the independence of XIAi and IBj .
8. Use the previous result with G = {∅, Ω}.
9. Trivial.
2.1. Conditional expectations and martingales: a trial run 50
this sequence as an adapted stochastic process. On the other hand, many stochastic
processes used to model random signals or natural phenomena do have an associated
physical notion of time, which is faithfully encoded using the concept of a filtration.
Martingales
A martingale is a very special type of stochastic process.
Definition 2.1.8. A stochastic process {Xn } is said to be an Fn -martingale if it is
Fn -adapted and satisfies E(Xn |Fm ) = Xm a.s. for every m ≤ n. (If the filtration is
obvious, e.g., on a filtered probability space, we will just say that X n is a martingale).
Remark 2.1.9. We have not yet defined the conditional expectation for anything but
finite σ-algebras. Thus until further notice, we assume that Fn is finite for every n.
In particular, this means that if Xn is Fn -adapted, then every Xn is a finite-valued
random variable. This will be sufficient machinery to develop the general theory.
How should you interpret a martingale? The basic idea (and the pretty name)
comes from gambling theory. Suppose we play a sequence of games at a casino, in
each of which we can win or lose a certain amount of money. Let us denote by X n our
total winnings after the nth game: i.e., X0 is our starting capital, X1 is our starting
capital plus our winnings in the first game, etc. We do not assume that the games
are independent. For example, we could construct some crazy scheme where we play
poker in the nth game if we have won an even number of times in the past, and we
play blackjack if we have won an odd number of times in the past. As poker and
blackjack give us differently distributed winnings, our winnings X n − Xn−1 in the
nth game will then depend on all of the past winnings X0 , . . . , Xn−1 .
If the game is fair, however, then we should make no money on average in any of
the games, regardless of what the rules are. After all, if we make money on average
then the game is unfair to the casino, but if we lose money on average the game is
unfair towards us (most casinos operate in the latter mode). As such, suppose we
have made Xm dollars by time m. If the game is fair, then our expected winnings
at any time in the future, given the history of the games to date, should equal our
current capital: i.e., E(Xn |σ{X0 , . . . , Xm }) = Xm for any n ≥ m. This is precisely
the definition of an (FnX -) martingale. Hence we can interpret a martingale as the
winnings in a sequence of fair games (which may have arbitrarily complicated rules).
You might be surprised that such a concept has many far-reaching consequences.
Indeed, martingale techniques extend far beyond gambling, and pervade almost all as-
pects of modern probability theory. It was the incredible insight of J. L. Doob [Doo53]
that martingales play such a fundamental role. There are many reasons for this. First,
martingales have many special properties, some of which we will discuss in this chap-
ter. Second, martingales show up naturally in many situations which initially appear
to have little to do with martingale theory. The following simple result (which we will
not need during the rest of this course) gives a hint as to why this could be the case.
Lemma 2.1.10 (Doob decomposition). Let (Ω, F, P) be a probability space, let F n
be a (finite) filtration and let {Xn } be Fn -adapted with Xn ∈ L1 for every n. Then
2.1. Conditional expectations and martingales: a trial run 52
using lemma 2.1.4 (why?) To prove uniqueness, suppose that M̃n and Ãn were another mar-
tingale (with M̃0 = 0) and predictable process, respectively, such that Xn = X0 + Ãn + M̃n
a.s. Then evidently Ãn − An = Mn − M̃n a.s. But the left hand side is Fn−1 -measurable, so
using the martingale property Ãn − An = E(Ãn − An |Fn−1 ) = E(Mn − M̃n |Fn−1 ) = 0
a.s. Hence An = Ãn a.s., and consequently Mn = M̃n a.s. as well.
Remark 2.1.11. Note that any discrete time stochastic process can be decomposed
into a martingale part and a predictable part (provided that Xn ∈ L1 for all n). This re-
sult still holds when the Fn are not finite, but is not true in continuous time. Nonethe-
less almost all processes of interest have such a decomposition. For example, we
will see in later chapters that the solution of a stochastic differential equation can be
written as the sum of a martingale and a process which is differentiable in time. For
similar reasons, martingales play an important role in the general theory of Markov
processes. As this is an introductory course, we will not attempt to lay down these
theories in such generality; the purpose of this interlude was to give you an idea of
how martingales can emerge in seemingly unrelated problems.
Many results about martingales are proved using the following device.
Definition 2.1.12. PLet {Mn } be a martingale and {An } be a predictable process.
n
Then (A · M )n = k=1 Ak (Mk − Mk−1 ), the martingale transform of M by A, is
again a martingale, provided that An and (A · M )n are in L1 for all n (why?).
Let us once again give a gambling interpretation. We play a sequence of games;
before every game, we may stake a certain amount of money. We now interpret the
martingale Mn not as our total winnings at time n, but as our total winnings if we
were to stake one dollar on each game. For example, if we stake A 1 dollars on the
first game, then we actually win A1 (M1 − M0 ) dollars. Consequently, if we stake
Ak dollars on the kth game, then our total winnings after the nth game are given by
Xn = X0 + (A · M )n . Note that it is important for An to be predictable: we have to
place our bet before the game is played, so our decision on how much money to stake
can only depend on the past (obviously we could always make money, if we knew the
outcome of the game in advance!) Other than that, we are free to choose an arbitrarily
complicated gambling strategy An (our decision on how much to stake on the nth
game can depend arbitrarily on what happened in previous games). The fact that X n
is again a martingale says something we know intuitively—there is no “reasonable”
gambling strategy that allows us to make money, on average, on a fair game.
We are now ready to prove of of the key results on martingales—the martingale
convergence theorem. With that bit of machinery in hand, we will be able to prove the
Radon-Nikodym theorem and to extend our definition of the conditional expectation.
2.1. Conditional expectations and martingales: a trial run 53
$b
A=0 A=1 A=0 A=1 A=0 A=1 X
$a
Time t
Figure 2.3. We stake bets on a martingale M as follows: we start betting (with a fixed stake
A = 1) when M is first below a, we stop betting (A = 0) when M next exceeds b, and repeat.
The periods when we are betting are shown in red and when we are not in blue. Our total
winnings X are in green. At the final time T , our winnings can evidently not exceed b − a
times the number of upcrossings (two) minus (a − MT )+ . (Figure adapted from [Wil91].)
in debt—then the latter cannot happen. Hence there is only one logical conclusion in
this case: evidently our friend’s winnings can cross a and b only a finite number of
times; otherwise we could make money on average by playing a predictable gambling
strategy in a fair game. But this must be true for every value of a and b (these were
only used in our gambling strategy, they do not determine our friend’s winnings!),
so we come to a very interesting conclusion: if a martingale Mn is bounded, then it
cannot fluctuate forever; in other words, it must converge to some random variable
M∞ . We have basically proved the martingale convergence theorem.
Let us now make these ideas precise.
Lemma 2.1.13 (Doob’s upcrossing lemma). Let {Mn } be a martingale, and denote
by Un (a, b) the number of upcrossings of a < b up to time n: that is, Un (a, b) is the
number of times that Mk crosses from below a to above b before time n. Then we have
E(Un (a, b)) ≤ E((a − Mn )+ )/(b − a).
Proof. Define the following gambling strategy (see figure 2.3). Let A0 = 0, and set Ak = 1 if
either Ak−1 = 1 and Mk−1 < b, or if Ak−1 = 0 and Mk−1 ≤ a, and set Ak = 0 otherwise.
Clearly Ak is bounded and predictable, so Xn = (A · M )n is a martingale (this is the winnings
process of figure 2.3). We can evidently estimate Xn ≥ (b − a) Un (a, b) − (a − Mn )+ ; the
first term is the number of upcrossings times the winnings per upcrossing, while the second
term is a lower bound on the loss incurred after the last upcrossing before time n. As Xn is a
martingale, however, E(Xn ) = X0 = 0, and the result follows directly.
Hence we have established that Mn converges a.s. to some M∞ . But by Fatou’s lemma
E(|M∞ |) ≤ lim inf n E(|Mn |) < ∞, so M∞ must be a.s. finite and even in L1 .
2.2. The Radon-Nikodym theorem revisited 55
Remark 2.1.15. Note that we are still in the setting where Fn is finite for all n.
However, nothing in the proofs of these results uses (or is hindered by) this fact, and
the proofs will carry over immediately to the general case.
Example 2.2.3. Let Y be a random variable; then σ{Y } is separable: set F = σ{F n :
n = 1, 2, . . .}, where Fn = σ{Y 1 , . . . , Y n } with Y k as in lemma 1.3.12.
Example 2.2.4. If Xn is a sequence of random variables, then F = σ{Xn } is sepa-
rable: approximating every Xn by Xnk , choose Fn = σ{Xm k
: m, k = 1, . . . , n}.
Example 2.2.5. Let {Xt }t∈[0,∞[ be a continuous time stochastic process such that
t 7→ Xt (ω) is continuous for every ω. Then F = σ{Xt : t ∈ [0, ∞[} is separable.
To see this, note that by continuity t 7→ Xt is completely known if we know it for a
dense set of times (e.g., the dyadic rationals). Hence approximating X t by a sequence
Xtk for every t, we can use Fn = σ{Xtk : k = 1, . . . , n, t = `2−n , ` = 0, . . . , n2n }.
A reminder:
Theorem 1.6.12 (Radon-Nikodym). Suppose Q P are two probability measures
on the space (Ω, F). Then there exists a nonnegative F-measurable function f with
EP (f ) = 1, such that Q(A) = EP (IA f ) for every A ∈ F, and f is unique in the sense
that if f 0 is another F-measurable function with this property, then f 0 = f P-a.s.
Assume F = σ{Fn : n = 1, 2, . . .}. Applying lemma 2.2.6 to the Fn , we obtain
a sequence fn of finite Radon-Nikodym derivatives. We now proceed in three steps.
3. We must show uniqueness. It then follows that the limit limn fn is independent
of the choice of discretization {Fn } (which is not obvious from the outset!)
1 X 1 X Q(Bj )
E(fn |Bj ) = E(fn IAk ) = Q(Ak ) = .
P(Bj ) P(Bj ) P(Bj )
k:Ak ⊂Bj k:Ak ⊂Bj
P
Hence evidently E(fn |Fm ) = j E(fn |Bj )IBj = fm . But note that fn is clearly
nonnegative for all n, so the boundedness condition of the martingale convergence theo-
rem holds trivially. Hence fn converges P-a.s., and we can define f = limn fn . But as
Q P, we find that fn → f Q-a.s. as well. This will be crucial below.
2. The hardest part here is to show that E(fS) = 1. Let us complete the argument assuming
that this is the case. Note that G = n Fn is a π-system. We would like to show
that Q(A) = P(IA f ) for all A ∈ F; but as E(f ) = 1, both sides are valid probability
measures and it suffices to check this for A ∈ G (by the π-system lemma 1.7.3). Now for
any A ∈ G, there exists by definition an m such that A ∈ Fm . Hence P(IA fn ) = Q(A)
for n ≥ m by lemma 2.2.6. Using Fatou’s lemma,
But we obtain the inequality in the reverse direction by applying this expression to Ac
and using E(f ) = 1. Hence indeed f = dQ/dP, provided we can show that E(f ) = 1.
To show E(f ) = 1, we would usually employ the dominated convergence theorem (as
E(fn ) = 1 for all n). Unfortunately, it is not obvious how to dominate {fn }. To
circumvent this, we rely on another useful trick: a truncation argument. Define
1 x ≤ n,
ϕn (x) = n+1−x n ≤ x ≤ n + 1,
0 x ≥ n + 1.
where we write Q|Fn to signify that we have restricted the measure Q to Fn (i.e.,
apply lemma 2.2.6 with F = Fn ), and similarly for P|Fn . In particular, if we let
n → ∞, then our proof of the Radon-Nikodym theorem shows that
dQ|Fn dQ|F
E(X|F) = lim = .
n dP|Fn dP|F
Let us briefly remark on the various definitions and constructions of the condi-
tional expectation. We then move on to martingales.
Remark 2.3.4. There are three approaches to defining the conditional expectation.
The first method is Kolmogorov’s abstract definition. It is the most difficult to
interpret directly, but is the cleanest and usually the easiest to use. Proving uniqueness
of the conditional expectation directly using Kolmogorov’s definition is easy (do it!),
but proving existence is hard—it requires the Radon-Nikodym theorem.
The second method is to define the conditional expectation as the least mean
square estimator. This is quite intuitive (certainly from a statistical point of view),
and proving existence and uniqueness of E(X|F) is not difficult provided one first
investigates in more detail the geometric properties of the space L2 . However, this
definition only works (and is natural) for X ∈ L2 , so that the conditional expectation
has to be extended to L1 at the end of the day (by approximation).
The third method is the one we used previously, i.e., defining the conditional ex-
pectation as a limit of discrete conditional expectations. This is perhaps most intuitive
from a probabilistic point of view, but it only seems natural for separable σ-algebras
(the extension to the non-separable case being somewhat abstract). Contrary to the
previous methods, it is existence that is easy to prove here (using the martingale con-
vergence theorem), but uniqueness is the difficult part.
Kolmogorov’s definition of conditional expectations is now universally accepted
in probability theory. However, all the techniques used above (including geometric
and martingale techniques) are very important and are used throughout the subject.
Example 2.3.7. The price of a stock in simple models of financial markets is often a
submartingale (we win on average—otherwise it would not be prudent to invest).
Remark 2.3.8. To prove that a process Xn is a supermartingale, it suffices to check
that E(Xn |Fn−1 ) ≤ Xn−1 a.s. for all n (why?); similarly, it suffices to check that
E(Xn |Fn−1 ) ≥ Xn−1 or E(Xn |Fn−1 ) = Xn−1 to demonstrate the submartingale
and the martingale property, respectively.
Here are some simple results about supermartingales. You should easily be able to
prove these yourself. Do this now. Note that it is often straightforward to extend such
results to submartingales by noting that if Xn is an Fn -submartingale, then K − Xn
is a supermartingale for any F0 -measurable K.
Lemma 2.3.9. Let Xn be a supermartingale. Then it can be written uniquely as Xn =
X0 + An + Mn , where Mn is a martingale and An is a nonincreasing predictable
process (i.e., An ≤ An−1 a.s. for all n).
Lemma 2.3.10. Let Xn be a supermartingale such that supn E(|Xn |) < ∞. Then
there exists a random variable X∞ such that Xn → X∞ a.s.
Lemma 2.3.11. Let Xn be a supermartingale and let An ∈ L1 be a nonnegative
predictable process. Then (A · X)n is a supermartingale, provided it is in L1 .
Example 2.3.14. Here is an example of a random time that is not a stopping time.
Recally that a bounded martingale Mn has finitely many upcrossings of the interval
[a, b]. It would be interesting to study the random time τ at which Mn finishes its final
upcrossing (i.e., the last time that Mn exceeds b after having previously dipped below
a). However, τ is not a stopping time: to know that Mn has up-crossed for the last
time, we need to look into the future to determine that it will never up-cross again.
2.3. Conditional expectations and martingales for real 62
Note that this also proves that Xn∧τ is again Fn -measurable! (This would not be true
if τ were any old random time, rather than a stopping time.)
Speaking of measurability, you might wonder what σ-algebra X τ is naturally mea-
surable with respect to. This following definition clarifies this point.
Definition 2.3.16. Let Fn be a filtration and let τ be a stopping time. By definition,
Fτ = {A ∈ F∞ : A ∩ {τ ≤ n} ∈ Fn for all n} is the σ-algebra of events that occur
before time τ (recall that F∞ = σ{Fn : n = 1, 2, . . .}). If τ < ∞ a.s., then Xτ is
well defined and Fτ -measurable (why?).
Now suppose that Xn is a martingale (or a super- or submartingale). By the above
representation for the stopped process, it is immediately evident that even the stopped
process is a martingale (or super- or submartingale, respectively), confirming our in-
tuition that we can not make money on average using a predictable strategy.
Lemma 2.3.17. If Mn is a martingale (or super-, submartingale) and τ is a stopping
time, then Mn∧τ is again a martingale (or super-, submartingale, respectively).
In particular, it follows directly that E(Mn∧τ ) = M0 for any n. However, this
does not necessarily imply that E(Mτ ) = 0, even if τ < ∞ a.s.! To conclude the
latter, we need some additional constraints. We will see an example below; if this
seems abstract to you, skip ahead to the example.
Proof. We prove the martingale case; the supermartingale result follows identically. To prove
(a), it suffices to note that E(Mτ ) = E(MK∧τ ) = E(M0 ). For (b), note that Mn∧τ → Mτ
a.s. as n → ∞ (by τ < ∞), so the result follows by dominated convergence. For (c), note that
n
X
|Xn∧τ | ≤ |X0 | + Ik≤τ |Xk − Xk−1 | ≤ |X0 | + K(n ∧ τ ) ≤ |X0 | + Kτ,
k=1
where the right hand side is integrable by assumption. Now apply dominated convergence.
We now give an illuminating example. Make sure you understand this example,
and reevaluate what you know about martingales, gambling strategies and fair games.
Example 2.3.20. Let ξ1 , ξ2 , . . . be independent random
Pnvariables which take the val-
ues ±1 with equal probability. Define Mn = M0 + k=1 ξk . Mn are our winnings
in the following fair game: we repeatedly flip a coin; if it comes up heads we gain a
dollar, else we lose one. Proving that Mn is a martingale is a piece of cake (do it!)
First, we should note that Mn a.s. does not converge as n → ∞. This is practically
a trivial observation. If Mn (ω) were to converge for some path ω, then for sufficiently
large N we should have |Mn (ω) − M∞ (ω)| < 12 for all n ≥ N . But Mn takes only
integer values, so this would imply that Mn (ω) = K(ω) for some K(ω) ∈ Z and for
all n ≥ N . Such paths clearly have measure zero (as Mn always changes between
two time steps: |Mn − Mn−1 | = 1 a.s.) Of course Mn does not satisfy the conditions
of the martingale convergence theorem, so we are not surprised.
Now introduce the following stopping time: τ = inf{n : Mn ≥ 2M0 }. That is,
τ is the first time we have doubled our initial capital. Our strategy is to wait until this
happens, then to stop playing, and the question is: do we ever reach this point, i.e.,
is τ < ∞? Surprisingly, the answer is yes! Note that Mn∧τ is again a martingale,
but Mn∧τ ≤ 2M0 a.s. for all n. Hence this martingale satisfies the condition of the
martingale convergence theorem, and so Mn∧τ converges as n → ∞. But repeating
the argument above, the only way this can happen is if Mn∧τ “gets stuck” at 2M0 —
i.e., if τ < ∞ a.s. Apparently we always double our capital with this strategy!
We are now in a paradox, and there are several ways out, all of which you should
make sure you understand. First, note that Mτ = 2M0 by construction. Hence
E(Mτ ) 6= E(M0 ), as you would expect. Let us use the optional stopping theorem
in reverse. Clearly Mn is a martingale, τ < ∞, and |Mn − Mn−1 | ≤ 1 for all n.
Nonetheless E(Mτ ) 6= E(M0 ), so evidently E(τ ) = ∞—though we will eventually
double our profits, this will take arbitrarily long on average. Evidently you can make
money on average in a fair game—sometimes—but certainly not on any finite time
interval! But we already know this, because E(Mn∧τ ) = E(M0 ) for any finite n.
Second, note that Mn∧τ → Mτ a.s, but it is not the case that Mn∧τ → Mτ in
L1 ; after all, the latter would imply that E(Mn∧τ ) → E(Mτ ), which we have seen is
untrue. But recall when a process does not converge in L1 , our intuition was that the
“outliers” of the process somehow grow very rapidly in time. In particular, we have
seen that we eventually double our profit, but in the intermediate period we may have
to incure huge losses in order to keep the game fair.
With the help of the optional sampling theorem, we can actually quantify this idea!
What we will do is impose also a lower bound on our winnings: once we sink below
2.3. Conditional expectations and martingales for real 64
a certain value −R (we are R dollars in debt), we go bankrupt and can not continue
playing. Our new stopping time is κ = inf{n : Mn ≥ 2M0 or Mn ≤ −R} (κ is the
time at which we either reach our target profit, or go bankrupt). Now note that M κ
does satisfy the conditions of the optional stopping theorem (as |M n∧κ | ≤ R ∨ 2M0 ),
so E(Mκ ) = E(M0 ). But Mκ can only take one of two values −R and 2M0 , so we
can explicitly calculate the probability of going bankrupt. For example, if R = 0, then
we go bankrupt and double our capital with equal probability.
Evidently we can not circumvent our previous conclusion—that no money can be
made, on average, in a fair game—unless we allow ourselves to wait an arbitrarily
long time and to go arbitrarily far into debt. This is closely related to the gambling
origin of the word martingale. In 19th century France, various betting strategies were
directly based on the idea that if you play long enough, you will make a profit for sure.
Such strategies were called martingales, and were firmly believed in by some—until
they went bankrupt. In the contemporary words of W. M. Thackeray,
“You have not played as yet? Do not do so; above all avoid a martingale,
if you do. [. . .] I have calculated infallibly, and what has been the effect?
Gousset empty, tiroirs empty, necessaire parted for Strasbourg!”
— W. M. Thackeray, The Newcomes (1854).
Following the work of Doob we will not follow his advice, but this does not take away
from the fact that the original martingale is not recommended as a gambling strategy.
There is much more theory on exactly when martingales converge, and what con-
sequences this has, particularly surrounding the important notion of uniform integra-
bility. We will not cover this here (we want to actually make it to stochastic calculus
before the end of term!), but if you wish to do anything probabilistic a good under-
standing of these topics is indispensable and well worth the effort.
A supermartingale inequality
Let us discuss one more elementary application of martingales and stopping times,
which is useful in the study of stochastic stability.
Let Mn be a nonnegative supermartingale. By the martingale convergence theo-
rem, Mn → M∞ a.s. as n → ∞. Let us now set some threshold K > 0; for some
K, the limit M∞ will lie below K with nonzero probability. This does not mean,
however, that the sample paths of Mn do not exceed the threshold K before ulti-
mately converging to M∞ , even for those paths where M∞ < K. We could thus ask
the question: what is the probability that the sample paths of Mn will never exceed
some threshold K? Armed with stopping times, martingale theory, and elementary
probability, we can proceed to say something about this question.
Lemma 2.3.21. Let Mn be an a.s. nonnegative supermartingale and K > 0. Then
E(M0 )
P sup Mn ≥ K ≤ .
n K
In particular, for any threshold K, the probability of ever exceeding K can be made
arbitrarily small by starting the martingale close to zero.
2.4. Some subtleties of continuous time 65
Proof. Let us first consider a finite time interval, i.e., let us calculate P(supn≤N Mn ≥ K) for
N < ∞. The key is to note that {ω : supn≤N Mn (ω) ≥ K} = {ω : Mτ (ω)∧N (ω) ≥ K},
where τ is the stopping time τ = inf{n : Mn ≥ K}. After all, if supn≤N Mn ≥ K then
τ ≤ N , so Mτ ∧N ≥ K. Conversely, if supn≤N Mn < K then τ > N , so Mτ ∧N < K.
Using Chebyshev’s inequality and the supermartingale property, we have
E(Mτ ∧N ) E(M0 )
P sup Mn ≥ K = P(Mτ ∧N ≥ K) ≤ ≤ .
n≤N K K
Now let N → ∞ using monotone convergence (why monotone?), and we are done.
Once again, this type of result can be generalized in various ways, and one can also
obtain bounds on the moments E(sup n |Mn |p ). You can try to prove these yourself,
or look them up in the literature if you need them.
Definition 2.4.1. Let Xt and Yt be two (discrete or continuous time) stochastic pro-
cesses. Then Xt and Yt are said to be indistinguishable if P(Xt = Yt for all t) = 1,
and they are said to be modifications of each other if P(Xt = Yt ) = 1 for all t.
Clearly if Xt and Yt are indistinguishable, then they are modifications (why?). In
discrete time, the converse is also true: after all,
!
\ X
P(Xn = Yn for all n) = P {ω : Xn (ω) = Yn (ω)} ≥ 1 − P(Xn 6= Yn ),
n n
Evidently modification may not preserve the value of the process at a stopping
time. This will be a real problem that we have to deal with in the theory of optimal
stopping with partial observations: more details will follow in chapter 8.
As already mentioned in remark 1.6.7, we often wish to be able to calculate the
time integral of a process Xt , and we still want its expectation to be well defined. To
this end, we need the process to be measurable not only with respect to the probability
space, but also with respect to time, in which case we can apply Fubini’s theorem.
Definition 2.4.5. Let Xt be a stochastic process on some filtered probability space
(Ω, F, {Ft }, P) and time set T ⊂ [0, ∞[ (e.g., T = [0, T ] or [0, ∞[). Then Xt is
called adapted if Xt is Ft -measurable for all t, is called measurable if the random
variable X· : T × Ω → R is B(T) × F-measurable, and is called progressively
measurable if X· : [0, t] ∩ T × Ω → R is B([0, t] ∩ T) × Ft -measurable for all t.
What do these definitions mean? Adapted you know; measurable means that
Z t
Yt = Xs ds is well defined and F-measurable,
0
Continuous processes
Life becomes much easier if Xt has continuous sample paths: i.e., when the function
t 7→ Xt (ω) is continuous for every ω. In this case most of the major issues are
no longer problematic, and we can basically manipulate such processes in a similar
manner as in the discrete time setting. Here is a typical argument.
Lemma 2.4.6. Let Xt be a stochastic process with continuous paths, and let Yt be
another such process. Then if Xt and Yt are modifications, then they are indistinguish-
able. If Xt is Ft -adapted and measurable, then it is Ft -progressively measurable.
Proof. As Xt and Yt have continuous paths, it suffices to compare them on a countable dense
set: i.e., P(Xt = Yt for all t) = P(Xt = Yt for all rational t). But the latter is unity whenever
Xt and Yt are modifications, by the same argument as in the discrete time case.
For the second part, construct a sequence of approximate processes X k : [0, t] × Ω → R
such that Xtk (ω) = Xt (ω) for all ω ∈ Ω and t = 0, 2−k , . . . , 2−k b2k tc, and such that the
sample paths of X k are piecewise linear. Then Xsk (ω) → Xs (ω) as k → ∞ for all ω and
s ∈ [0, t]. But it is easy to see that every X k is B([0, t]) × Ft -measurable, and the limit of a
sequence of measurable maps is again measurable. The result follows.
3 For an example where Xt is adapted and measurable, but Yt is not adapted, see [Let88, example 2.2].
2.4. Some subtleties of continuous time 68
A natural question to ask is whether the usual limit theorems hold even in con-
tinuous time. For example, is it true that if Xt ≥ 0 a.s. for all t ∈ [0, ∞[, then
E(lim inf t Xt ) ≤ lim inf t E(Xt ) (Fatou’s lemma)? This could be a potentially tricky
question, as it is not clear that the random variable lim inf t Xt is even measurable!
When we have continuous sample paths, however, we can establish that this is the
case. Once this is done, extending the basic convergence theorems is straightforward.
Lemma 2.4.7. Let the process Xt have continuous sample paths. Then the random
variables inf t Xt , supt Xt , lim inf t Xt and lim supt Xt are measurable.
Proof. As the sample paths of Xt are continuous we have, for example, inf t Xt = inf t∈Q Xt
where Q are the rational numbers. As these are countable, measurability follows from the
countable result (lemma 1.3.3). The same holds for supt Xt , lim inf t Xt , and lim supt Xt .
We can now establish, for example, Fatou’s lemma in continuous time. The con-
tinuous time proofs of the monotone convergence theorem and the dominated conver-
gence theorem follow in the same way.
Lemma 2.4.8. Let Xt be an a.s. nonnegative stochastic process with continuous sam-
ple paths. Then E(lim inf t Xt ) ≤ lim inf t E(Xt ). If there is a Y ∈ L1 such that
Xt ≤ Y a.s. for all t, then E(lim supt Xt ) ≥ lim supt E(Xt ).
Proof. By the previous lemma, lim inf t Xt is measurable so the statement makes sense. Now
suppose that the result does not hold, i.e., E(lim inf t Xt ) > lim inf t E(Xt ). Then there ex-
ists a sequence of times tn % ∞ such that E(lim inf t Xt ) > lim inf n E(Xtn ). But note
that lim inf n Xtn ≥ lim inf t Xt by the definition of the inferior limit, so this would imply
E(lim inf n Xtn ) > lim inf n E(Xtn ). However, Xtn is a discrete time stochastic process, and
hence E(lim inf n Xtn ) ≤ lim inf n E(Xtn ) follows from the discrete time version of Fatou’s
lemma. Thus we have a contradiction. The second part of the result follows similarly.
With a little more effort, we can also extend the martingale convergence theorem.
Theorem 2.4.9. Let Mt be martingale, i.e., E(Mt |Fs ) = Ms a.s. for any s ≤ t, and
assume that Mt has continuous sample paths. If any of the following conditios hold:
(a) supt E(|Mt |) < ∞; or (b) supt E((Mt )+ ) < ∞; or (c) supt E((Mt )− ) < ∞;
then there exists an F∞ -measurable random variable M∞ ∈ L1 s.t. Mt → M∞ a.s.
Proof. We are done if we can extend Doob’s upcrossing lemma to the continuous time case;
the proof of the martingale convergence theorem then follows identically.
Let UT (a, b) denote the number of upcrossings of a < b by Mt in the interval t ∈ [0, T ].
Now consider the sequence of times tkn = n2−k T , and denote by UTk (a, b) the number of
upcrossings of a < b by the discrete time process Mtkn , n = 0, . . . , 2k . Note that Mtkn is a
discrete time martingale, so by the upcrossing lemma E(UTk (a, b)) ≤ E((a − MT )+ )/(b − a).
We now claim that UTk (a, b) % UT (a, b), from which the result follows immediately using
monotone convergence. To prove the claim, note that as Mt has continuous sample paths and
[0, T ] is compact, the sample paths of Mt are uniformly continuous on [0, T ]. Hence we must
have UT (a, b) < ∞, and so UT (a, b)(ω) = UTk (a, b)(ω) for k(ω) sufficiently large.
2.5. Further reading 69
70
3.1. Basic properties and uniqueness 71
Lemma 3.1.1 (Finite dimensional distributions). For any finite set of times t1 <
t2 < · · · < tn , n < ∞, the n-dimensional random variable (xt1 (N ), . . . , xtn (N ))
converges in law as N → ∞ to an n-dimensional random variable (x t1 , . . . , xtn )
such that xt1 , xt2 − xt1 , . . . , xtn − xtn−1 are independent Gaussian random variables
with zero mean and variance t1 , t2 − t1 , . . . , tn − tn−1 , respectively.
Proof. The increments xtk (N ) − xtk−1 (N ), k = 1, . . . , n (choose t0 = 0) are independent
for any N , so we may consider the limit in law of each of these increments separately. The
result follows immediately from the central limit theorem.
A different aspect of the Wiener process is the regularity of its sample paths.
For increasingly large N , the random walk xt (N ) has increasingly small increments.
Hence it is intuitively plausible that in the limit as N → ∞, the limiting process x t
will have continuous sample paths. In fact, this almost follows from lemma 3.1.1; to
be more precise, the following result holds.
Proposition 3.1.2. Suppose that we have constructed some stochastic process x t
whose finite dimensional distributions are those of lemma 3.1.1. Then there exists
a modification x̃t of xt such that t 7→ x̃t is continuous [Recall that the process x̃t is a
modification of the process xt whenever xt = x̃t a.s. for all t.]
The simplest proof of this result is an almost identical copy of the construction we
will use to prove existence of the Wiener process; let us thus postpone the proof of
proposition 3.1.2 until the next section.
We have now determined all the finite-dimensional distributions of the Wiener pro-
cess, and we have established that we may choose its sample paths to be continuous.
These are precisely the defining properties of the Wiener process.
Definition 3.1.3. A stochastic process Wt is called a Wiener process if
1. the finite dimensional distributions of Wt are those of lemma 3.1.1; and
2. the sample paths of Wt are continuous.
An Rn -valued process Wt = (Wt1 , . . . , Wtn ) is called an n-dimensional Wiener pro-
cess if Wt1 , . . . , Wtn are independent Wiener processes.
In order for this to make sense as a definition, we have to establish at least some
form of uniqueness—two processes Wt and Wt0 which both satisfy the definition
should have the same properties! Of course we could never require W t = Wt0 a.s.,
for the same reason that X and X 0 being (zero mean, unit variance) Gaussian random
variables does not mean X = X 0 a.s. The appropriate sense of uniqueness is that if
Wt and Wt0 both satisfy the definition above, then they have the same law.
Proposition 3.1.4 (Uniqueness). If Wt and Wt0 are two Wiener processes, then the
C([0, ∞[)-valued random variables W· , W·0 : Ω → C([0, ∞[) have the same law.
Remark 3.1.5 (C([0, ∞[)-valued random variables). A C([0, ∞[)-valued random
variable on some probability space (Ω, F, P) is, by definition, a measurable map from
Ω to C([0, ∞[). But in order to speak of a measurable map, we have to specify a σ-
algebra C on C([0, ∞[). There are two natural possibilities in this case:
3.1. Basic properties and uniqueness 72
1. Any x ∈ C([0, ∞[) represents an entire continuous path x = (xt : t ∈ [0, ∞[).
Define for every time t the evaluation map πt : C([0, ∞[) → R, πt (x) = xt . It
is then natural to set C = σ{πt : t ∈ [0, ∞[}.
2. As you might know from a course on functional analysis, the natural topology
on C([0, ∞[) is the topology of uniform convergence on compact intervals. We
could take C to be the Borel σ-algebra with respect to this topology.
It turns out that these two definitions for C coincide: see [RW00a, lemma II.82.3]. So
fortunately, what we mean by a C([0, ∞[)-valued random variable is unambiguous.
Proof of proposition 3.1.4. Let us first establish that W· is in fact measurable, and hence a
random variable (this follows identically for W·0 ). By assumption Wt is measurable for every
t (as {Wt } is assumed to be a stochastic process), so Wt−1 (A) ∈ F for every A ∈ B(R).
But Wt = πt (W· ), so W·−1 (B) ∈ F for every set B ∈ C of the form B = πt−1 (A). It
remains to note that C = σ{πt−1 (A) : A ∈ B(R), t ∈ [0, ∞[} by construction, so we find that
W·−1 (C) = σ{Wt−1 (A) : A ∈ B(R), t ∈ [0, ∞[} ⊂ F. Hence W· is indeed a C([0, ∞[)-
valued random variable, and the same holds for W·0 .
It remains to show that W· and W·0 have the same law, i.e., that they induce the same
probability measure on (C([0, ∞[), C). The usual way to show that two measures coincide is
using Dynkin’s π-system lemma 1.7.3, and this is indeed what we will do! A cylinder set is a
set C ∈ C of the form C = πt−1 1
(A1 )∩πt−1
2
(A2 )∩· · ·∩πt−1 n
(An ) for an arbitrary finite number
of times t1 , . . . , tn ∈ [0, ∞[ and Borel sets A1 , . . . , An ∈ B(R). Denote by Ccyl the collection
of all cylinder sets, and note that Ccyl is a π-system and σ{Ccyl } = C. But the definition of the
Wiener process specifies completely all finite dimensional distributions, so the laws of any two
Wiener processes must coincide on Ccyl . Dynkin’s π-system lemma does the rest.
Given a Wiener process Wt , we can introduce its natural filtration FtW = σ{Ws :
s ≤ t}. More generally, it is sometimes convenient to speak of an F t -Wiener process.
Definition 3.1.6. Let Ft be a filtration. Then a stochastic process Wt is called an Ft -
Wiener process if Wt is a Wiener process, is Ft -adapted, and Wt − Ws is independent
of Fs for any t > s. [Note that any Wiener process Wt is an FtW -Wiener process.]
Lemma 3.1.7. An Ft -Wiener process Wt is an Ft -martingale.
Proof. We need to prove that E(Wt |Fs ) = Ws for any t > s. But as Ws is Fs -measurable (by
adaptedness) this is equivalent to E(Wt − Ws |Fs ) = 0, and this is clearly true by the definition
of the Wiener process (as Wt − Ws has zero mean and is independent of Fs ).
The Wiener process is also a Markov process. You know what this means in
discrete time from previous courses, but we have not yet introduced the continuous
time definition. Let us do this now.
Definition 3.1.8. An Ft -adapted process Xt is called an Ft -Markov process if we
have E(f (Xt )|Fs ) = E(f (Xt )|Xs ) for all t ≥ s and all bounded measurable func-
tions f . When the filtration is not specified, the natural filtration FtX is implied.
Lemma 3.1.9. An Ft -Wiener process Wt is an Ft -Markov process.
3.1. Basic properties and uniqueness 73
Proof. We have to prove that E(f (Wt )|Fs ) = E(f (Wt )|Ws ). Note that we can trivially write
f (Wt ) = f ((Wt −Ws )+Ws ), where Wt −Ws is independent of Fs and Ws is Fs -measurable.
We claim that E(f (Wt )|Fs ) = g(Ws ) with g(x) = E(f (Wt −Ws +x)). As g(Ws ) is σ(Ws )-
measurable, we can then write E(f (Wt )|Ws ) = E(E(f (Wt )|Fs )|Ws ) = E(g(Ws )|Ws ) =
g(Ws ), where we have used σ(Ws ) ⊂ Fs . The result now follows.
It remains to prove E(f (Wt )|Fs ) = g(Ws ), or equivalently E(g(Ws )IA ) = E(f (Wt )IA )
for all A ∈ Fs (by Kolmogorov’s definition of the conditional expectation). Consider the pair
of random variables X = Wt −Ws and Y = (Ws , IA ), and note that X and Y are independent.
Hence by theorem 1.6.6, the law of (X, Y ) is a product measure µX × µY , and so
Z
E(f (Wt )IA ) = f (x + w)a µX (dx) × µY (dw, da) =
Z Z Z
f (x + w) µX (dx) a µY (dw, da) = g(w)a µY (dw, da) = E(g(Ws )IA ).
To complete our discussion of the elementary properties of the Wiener process, let
us exhibit some odd properties of its sample paths. The sample paths of the Wiener
process are extremely irregular, and the study of their properties remains an active
topic to this day (see, e.g., [MP06]). We will only consider those properties which we
will need to make sense of later developments.
Lemma 3.1.10. With unit probability, the sample paths of a Wiener process W t are
non-differentiable at any rational time t.
Proof. Suppose that Wt is differentiable at some point t. Then limh&0 (Wt+h − Wt )/h exists
and is finite, and in particular, there exists a constant M < ∞ (depending on ω) such that
|Wt+h − Wt |/h < M for sufficiently small h > 0. We will show that with unit probability this
cannot be true. Set h = n−1 where n is integer; then |Wt+h − Wt |/h < M for sufficiently
small h > 0 implies that supn≥1 n|Wt+n−1 (ω) − Wt (ω)| < ∞. But we can write (why?)
[ \
ω : sup n|Wt+n−1 (ω) − Wt (ω)| < ∞ = {ω : n|Wt+n−1 (ω) − Wt(ω)| < M }.
n≥1
M ≥1 n≥1
But Wt+n−1 − Wt is a Gaussian random variable with zero mean and variance n−1 , so
where ξ is a canonical Gaussian random variable with zero mean and unit variance. Hence we
find that P(limn−1 &0 (Wt+n−1 − Wt )/n−1 is finite) = 0, so Wt is a.s. not differentiable at t
for any fixed time t. But as the rational numbers are countable, the result follows.
Apparently the sample paths of Brownian motion are very rough; certainly the
derivative of the Wiener process cannot be a sensible stochastic process, once again
confirming the fact that white noise is not a stochastic process (compare with the
3.1. Basic properties and uniqueness 74
discussion in the Introduction). With a little more work one can show that with unit
probability, the sample paths of the Wiener process are not differentiable at any time
t (this does not follow trivially from the previous result, as the set of all times t is not
countable); see [KS91, theorem 2.9.18] for a proof.
Another measure of the irregularity of the sample paths of the Wiener process is
their total variation. For any real-valued function f (t), the total variation of f on the
interval t ∈ [a, b] is defined as
k
X
TV(f, a, b) = sup sup |f (ti+1 ) − f (ti )|,
k≥0 (ti )∈P (k,a,b) i=0
where P (k, a, b) denotes the set of all partitions a = t0 < t1 < · · · < tk < tk+1 = b.
You can think of the total variation as follows: suppose that we are driving around in
a car, and f (t) denotes our position at time t. Then TV(f, a, b) is the total distance
which we have travelled in the time interval [a, b] (i.e., if our car were to go for a fixed
number of miles per gallon, then TV(f, a, b) would be the amount of fuel which we
used up between time a and time b). Note that even when supa≤s≤t≤b |f (t) − f (s)|
is small, we could still travel a significant total distance if we oscillate very rapidly in
the interval [a, b]. But the Wiener process, whose time derivative is infinite at every
(rational) time, must oscillate very rapidly indeed!
Lemma 3.1.11. With unit probability, TV(W· , a, b) = ∞ for any a < b. In other
words, the sample paths of the Wiener process are a.s. of infinite variation.
S
Proof. Denote by P (a, b) = k≥0 P (k, a, b) the set of all finite partitions of [a, b]. We are
done if we can find a sequence of partitions πn ∈ P (a, b) such that
X n→∞
|Wti+1 − Wti | −−−−→ ∞ a.s.
ti ∈πn
To this end, let us concentrate on a slightly different object. For any π ∈ P (a, b), we have
X X
E (Wti+1 − Wti )2 = (ti+1 − ti ) = b − a.
ti ∈π ti ∈π
2
Call Zi = (Wti+1 − Wti ) − (ti+1 − ti ), and note that for different i, the random variables Zi
are independent and have the same law as (ξ 2 − 1)(ti+1 − ti ), where ξ is a Gaussian random
variable with zero mean and unit variance. Hence
" !2 # " #
X 2
X 2 X
E (Wti+1 − Wti ) − (b − a) =E Zi = E((ξ 2 − 1)2 ) (ti+1 − ti )2 .
ti ∈π ti ∈π ti ∈π
Let us now choose a sequence πn such that supti ∈πn |ti+1 − ti | → 0. Then
" !2 #
X 2 n→∞
E (Wti+1 − Wti ) − (b − a) ≤ (b − a)E((ξ 2 − 1)2 ) sup |ti+1 − ti | −−−−→ 0.
ti ∈πn
ti ∈πn
so Qn → b − a in probability also, and hence we can find a subsequence m(n) % ∞ such that
Qm(n) → b − a a.s. But then TV(W· , a, b) < ∞ with nonzero probability would imply
( )
X
b − a ≤ lim sup |Wti+1 − Wti | |Wti+1 − Wti | = 0
n→∞ ti ∈πm(n)
ti ∈πm(n)
with nonzero probability (as supti ∈πm(n) |Wti+1 − Wti | → 0 by continuity of the sample
paths), which contradicts a < b. Hence TV(W· , a, b) = ∞ a.s. for fixed a < b. It remains
to note that it suffices to consider rational a < b; after all, if TV(W· , a, b) is finite for some
a < b, it must be finite for all subintervals of [a, b] with rational endpoints. As we have shown
that with unit probability this cannot happen, the proof is complete.
You might argue that the Wiener process is a very bad model for Brownian mo-
tion, as clearly no physical particle can travel an infinite distance in a finite time! But
as usual, the Wiener process should be interpreted as an extremely convenient math-
ematical idealization. Indeed, any physical particle in a fluid will have travelled a
humongous total distance, due to the constant bombardment by the fluid molecules,
in its diffusion between two (not necessarily distant) points. We have idealized mat-
ters by making the total distance truly infinite, but look what we have gained: the
martingale property, the Markov property, etc., etc., etc.
It is also the infinite variation property, however, that will get us in trouble when
we define stochastic integrals. Recall from theRIntroduction that we ultimately wish to
t
give meaning to formal integrals of the form 0 fs ξs ds, where ξs is white noise, by
Rt
defining a suitable stochastic integral of the form 0 fs dWs .
The usual way to define such objects is through the Stieltjes integral. Forgetting
about probability theory for the moment, recall that we write by definition
Z t X
f (s) dg(s) = lim f (si ) (g(ti+1 ) − g(ti )),
0 π
ti ∈π
where the limit is taken over a sequence of refining partitions π ∈ P (0, t) such that
max |ti+1 − ti | → 0, and si is an arbitrary point between ti+1 and ti . When does the
limit exist? Well, note that for πm ⊂ πn (πn is a finer partition than πm ),
X X
f (si ) (g(ti+1 ) − g(ti )) − f (s0i ) (g(t0i+1 ) − g(t0i ))
ti ∈πn t0i ∈πm
X
= (f (si ) − f (s00i )) (g(ti+1 ) − g(ti )),
ti ∈πn
Hence if f is continuous and g is of finite total variation, then the sequence of sums
obtained from a refining sequence of partitions πn is a Cauchy sequence and thus
3.2. Existence: a multiscale construction 76
converges to a unique limit (which we call the Stieltjes integral). More disturbingly,
however, one can also prove the converse: if g is of infinite variation, then there
exists a continuous function f such that the Stieltjes integral does not exist. The proof
requires a little functional analysis, so we will not do it here; see [Pro04, section I.8].
Rt
The unfortunate conclusion of the story is that the integral 0 fs dWs cannot be
defined in the usual way—at least not if we insist that we can integrate at least contin-
uous processes ft , which is surely desirable. With a little insight and some amount of
work, we will succeed in circumventing this problem. In fact, you can get a hint on
how to proceed from the proof of lemma 3.1.11: even though the total variation of the
Wiener process is a.s. infinite, the quadratic variation
X
lim (Wti+1 − Wti )2 = b − a a.s.
n→∞
ti ∈πm(n)
is finite. Maybe if we square things, things will get better? Indeed they will, though
we still need to introduce a crucial insight in order not to violate the impossibility of
defining the Stieltjes integral. But we will go into this extensively in the next chapter.
1 We are cheating a little, as we only defined the countable product space in Theorem 1.6.8 for Ω = R.
In fact, such a space always exists, see [Kal97, corollary 5.18], but for our purposes Theorem 1.6.8 will turn
out to be sufficient: every Ω will be itself a space that carries a sequence of independent random variables,
and, as in the proof of Theorem 1.6.8, we can always construct such a sequence on Ω = R.
3.2. Existence: a multiscale construction 77
The second simplification is to make our random walks have continuous sample
paths, unlike xt (N ) which has jumps. The reason for this is that there is a very
simple way to prove that a sequence of continuous functions converges to a continuous
function: this is always the case when the sequence converges uniformly. This is an
elementary result from calculus, but let us recall it to refresh your memory.
Lemma 3.2.2. Let fn (t), n = 1, 2, . . . be a sequence of continuous functions on t ∈
[0, 1] that converge uniformly to some function f (t), i.e., supt∈[0,1] |fn (t)−f (t)| → 0
as n → ∞. Then f (t) must be a continuous function.
Proof. Clearly |f (x)−f (y)| ≤ |f (x)−fn (x)|+|fn (x)−fn (y)|+|fn (y)−f (y)| for any n and
x, y ∈ [0, 1]. Let ε > 0. Then we can choose n sufficiently large so that |fn (x) − f (x)| < ε/3
for any x, and then |f (x) − f (y)| ≤ 2ε/3 + |fn (x) − fn (y)| for all x, y ∈ [0, 1]. But as fn is
uniformly continuous (as [0, 1] is compact), there is a δ > 0 such that |fn (x) − fn (y)| < ε/3
for all |x − y| < δ. Hence f satisfies the (ε-δ) definition of continuity.
Here is another useful trick from the same chapter of your calculus textbook.
Lemma 3.2.3. Let fnP (t), n = 1, 2, . . . be a sequence of continuous functions on
t ∈ [0, 1], such that n supt∈[0,1] |fn+1 (t) − fn (t)| < ∞. Then fn (t) converge
uniformly to some continuous function f (t).
P
Proof. Note that fn (t) = f1 (t) + n−1 k=1 (fk+1 (t) − fk (t)). By our assumption, the sum
is absolutely convergent so fn (t) → f (t) as n → ∞ for every t ∈ [0, 1]. It remains to
show that the
P convergence is uniform. ButP∞this follows from supt∈[0,1] |fm (t) − f (t)| =
supt∈[0,1] | ∞k=m (fk+1 (t) − fk (t))| ≤ k=m supt∈[0,1] |(fk+1 (t) − fk (t))| → 0.
You are probably starting to get a picture of the strategy which we will follow:
n
we will define a sequence P Wt of random walks with continuous sample paths, and
attempt to prove that n supt∈[0,1] |Wtn − Wtn+1 | < ∞ a.s. We are then guaranteed
that Wtn converges a.s. to some stochastic process Wt with continuous sample paths,
and all that remains is to verify the finite dimensional distributions of W t . But the
finite dimensional distributions are the easy part—see, e.g., lemma 3.1.1!
It is at this point, however, that we need a little bit of real insight. Following our
intended strategy, it may seem initially that we could define Wtn just like xt (n), except
that we make the sample paths piecewise linear rather than piecewise constant (e.g.,
n
Pk n/2
set Wk/2 n = `=1 ξ` /2 for k = 0, . . . , 2n , and interpolate linearly in the time
intervals k/2 < t < (k + 1)/2n ). However, this way Wtn can never converge a.s. as
n
n → ∞. In going from Wtn to Wtn+1 , the process Wtn gets compressed to the interval
[0, 12 ], while the increments of Wtn+1 on ] 12 , 1] are defined using a set of independent
random variables ξ2n +1 , . . . , ξ2n+1 . This is illustrated in figure 3.1.
Remark 3.2.4. Of course, we do not necessarily expect our random walks to converge
almost surely; for example, lemma 3.1.1 was based on the central limit theorem, which
only gives convergence in law. The theory of weak convergence, which defines the
appropriate notion of convergence in law for stochastic processes, can indeed be used
to prove existence of the Wiener process; see [Bil99] or [KS91, section 2.4]. The
technicalities involved are a highly nontrivial, however, and we would have to spend
3.2. Existence: a multiscale construction 78
0.5
Wt7
0
-0.5
0.5
-0.5 Wt6
0.5
-0.5 Wt5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 3.1. Sample paths, given a single realization of {ξn }, of Wtn for n = 5, 6, 7. When n
is increased by one, the previous sample path is compressed to the interval [0, 1/2], scaled by
2−1/2 , and the increments in ]1/2, 1] are generated from the next batch of independent ξn s.
an entire chapter just introducing all the necessary machinery! Instead, we will use
a clever trick (due to P. Lévy and Z. Ciesielski) to define a very special sequence
of random walks Wtn which actually converges almost surely. Once we have a.s.
convergence, the proofs become much more elementary and intuitive.
The idea is illustrated in figure 3.2. The random walk Wtn consists 2n points
connected by straight lines. In going from Wtn to Wtn+1 , our previous strategy was
to concatenate another 2n points at the end of the path, and then to compress this new
path to fit in the interval [0, 1]. Rather than add points at the end of the path, however,
we will now add our new 2n points in between the existing nodes of the sample path.
This way the shape of the path remains fixed, and we just keep adding detail at finer
and finer scales. Then we would certainly expect the sample paths to converge almost
surely; the question is whether we can add points between the existing nodes in such
a way that the random walks Wtn have the desired statistics.
Let us work out how to do this. The random walk Wtn has nodes at t = k2−n ,
k = 0, . . . , 2n , connected by straight lines. For any further Wtm with m > n, we
want to only add nodes between the times k2−n , i.e., Wtm = Wtn for t = k2−n ,
k = 0, . . . , 2n . Hence the points Wk2n
−n must already be distributed according to the
n
corresponding finite dimensional distribution of lemma 3.1.1: Wk2 −n must be a Gaus-
−n n n
sian random variable with zero mean and variance k2 , and W(k+1)2 −n − Wk2−n
n
must be independent of Wk2−n for any k. Suppose that we have constructed such a
Wtn . We need to show how to generate Wtn+1 from it, by adding points between the
existing nodes only, so that Wtn+1 has the correct distribution.
Fix n and k, and assume we are given Wtn−1 . Let us write
n n−1 n n−1
Y0 = Wk2 −(n−1) = W
k2−(n−1)
, Y1 = W(k+1)2 −(n−1) = W
(k+1)2−(n−1)
.
3.2. Existence: a multiscale construction 79
0.3 Wt7
0 Wt0 ξ1
-0.3
0 1
0 0.1 0.2 0.3 0.4 0.5
Figure 3.2. Rather than concatenate additional ξn s at the end of the sample paths, we define
Wtn for increasing n by adding new ξn in between the existing nodes of the sample path. This
way detail is added at increasingly fine scales, and the sample path will in fact converge a.s. The
procedure is illustrated on the right for Wt0 and Wt1 ; the first path is a line between W00 = 0
and W10 = ξ1 , while for Wt1 another point is added at t = 1/2 (see text).
n
We wish to choose a new point X = W(2k+1)2 −n between Y0 and Y1 , such that
mean is isotropic—it is invariant under orthogonal transformations, i.e., x has the same law as Ax for any
orthogonal matrix A. If you did not know this, now is a good time to prove it!
3.2. Existence: a multiscale construction 80
Theorem 3.2.5. There exists a Wiener process Wt on some probab. space (Ω, F, P).
Proof. Let us begin by introducing the Schauder (tent-shaped) functions. Note that the deriva-
tive of a Schauder function should be piecewise constant: it has a positive value on the increas-
ing slope, a negative value on the decreasing slope, and is zero elsewhere. Such functions are
called the Haar wavelets Hn,k (t) with n = 0, 1, . . . and k = 1, 3, 5, . . . , 2n − 1, defined as
+2(n−1)/2 (k − 1)2−n < t ≤ k2−n ,
(n−1)/2
H0,1 (t) = 1, Hn,k (t) = −2 k2−n < t ≤ (k + 1)2−n , (n ≥ 1).
0 otherwise,
The Haar wavelets are localized on increasingly fine length scales for increasing n, while the
index k shifts the wavelet across the interval [0, 1]. We now define the Schauder functions
Sn,k (t) simply as indefinite integrals of the Haar wavelets:
Z t
Sn,k (t) = Hn,k (s) ds, n = 0, 1, . . . , k = 1, 3, 5, . . . , 2n − 1.
0
You can easily convince yourself that these are precisely the desired tent-shaped functions.
Let us now construct our random walks on [0, 1]. Let (Ω0 , F 0 , P0 ) be a probability space
that carries a double sequence {ξn,k : n = 0, 1, . . . , k = 1, 3, . . . , 2n − 1} of i.i.d. Gaussian
random variables with zero mean and unit variance (which exists by theorem 1.6.8). Figure 3.2
shows how to proceed: clearly Wt0 = ξ0,1 S0,1 (t), while Wt1 = ξ0,1 S0,1 (t) + ξ1,1 S1,1 (t) (note
the convenient normalization—the tent function Sn,k has height 2−(n+1)/2 ). Continuing in the
same manner, convince yourself that the N th random walk can be written as
N
X X
WtN = ξn,k Sn,k (t).
n=0 k=1,3,...,2n −1
We now arrive in the second step of our program: we would like to show that the sequence
of processes WtN converges uniformly with unit probability, in which case we can define a
stochastic process Wt = limN →∞ WtN with a.s. continuous sample paths.
We need some simple estimates. First, note that
!
P sup |Wtn − Wtn−1 | > εn = P sup |ξn,k | > 2(n+1)/2 εn ,
t∈[0,1] k=1,3,...,2n −1
as you can check directly. But note that we can estimate (why?)
X
(n+1)/2
P sup |ξn,k | > 2 εn ≤ P(|ξn,k | > 2(n+1)/2 εn ),
k=1,3,...,2n −1
k=1,3,...,2n −1
We need to estimate the latter term, but (as will become evident shortly) a direct application of
Chebyshev’s inequality is too crude. Instead, let us apply the following trick, which is often
useful. Note that ξ1,0 is symmetrically distributed around zero, so we have P(|ξ0,1 | > α) =
P(ξ0,1 > α) + P(ξ0,1 < −α) = 2P(ξ0,1 > α). Now use Chebyshev’s inequality as follows:
We thus obtain
!
P sup |Wtn − Wtn−1 | > εn ≤ exp(n log 2 + 1/2 − 2(n+1)/2 εn ).
t∈[0,1]
(this is why direct application of the Chebyshev inequality would have been too crude—we
would not have been able to obtain this conclusion!) But by the Borel-Cantelli lemma, we find
that this implies P(supt∈[0,1] |Wtn − Wtn−1 | > n−2 i.o.) = 0, so we have
1
sup |Wtn − Wtn−1 | ≤ for all n sufficiently large a.s.
t∈[0,1] n2
But then we have a.s. uniform convergence by lemma 3.2.3, and so Wtn → Wt as n → ∞
a.s., where the process Wt has a.s. continuous sample paths. Nothing changes if we set the
sample paths to zero in a null set which contains all discontinuous paths of Wt ; this is an
indistinguishable change, and now Wt has continuous sample paths everywhere. It remains to
show that Wt has the correct finite dimensional distributions, and to extend to t ∈ [0, ∞[.
To verify the finite dimensional distributions, it suffices to show that for any t > s > r, the
increment Wt − Ws is independent of Wr , and that these are Gaussian random variables with
mean zero and variance t − s and r, respectively (why do we not need to check explicitly higher
dimensional distributions?) The simplest way to check this is using characteristic functions: a
well known result, e.g., [Wil91, section 16.6,7], states that it is sufficient to show that
2
r/2−β 2 (t−s)/2
E(eiαWr +iβ(Wt −Ws ) ) = e−α .
But note that by construction, this holds for any t, s, r which are dyadic rationals (i.e., of the
form k2−n for some k and n). For arbitrary t, s, r, choose sequences rn % r, sn & s, and
tn & t of dyadic rationals. Then Wtn −Wsn is independent of Wrn for any n, and in particular
we can calculate explicitly, using dominated convergence and continuity of Wt ,
Hence Wt has the correct finite dimensional distributions for all t ∈ [0, 1]. The extension to
t ∈ [0, ∞[ was already done in lemma 3.2.1, and we finally have our Wiener process.
Now that we have done the hard work of constructing a Wiener process, it is not
difficult to prove proposition 3.1.2. Recall the statement of this result:
Proposition 3.1.2. Suppose that we have constructed some stochastic process x t
whose finite dimensional distributions are those of lemma 3.1.1. Then there exists
a modification x̃t of xt such that t 7→ x̃t is continuous.
3.3. White noise 82
How would we go about proving this? Suppose that we have constructed a Wiener
process Wt through theorem 3.2.5. It is an easy exercise to show that we can reproduce
the random variables ξn,k from Wt as follows:
But the law of any finite number of ξn,k is only determined by the finite dimensional
distributions of Wt , so for any process xt with the same finite dimensional distribu-
tions it must also be the case that
are i.i.d. Gaussian random variables with mean zero and unit variance (recall that a
sequence of random variables is independent if any finite subcollection is independent,
so this notion only depends on the finite dimensional distributions). But then
∞
X X
x̃t = χn,k Sn,k (t)
n=0 k=1,3,...,2n −1
has a.s. continuous paths—this follows from the proof of theorem 3.2.5—and xt = x̃t
for all dyadic rational times t by construction. We again set the discontinuous paths
to zero, and the only thing that remains to be shown is that x̃t is a modification of xt .
Proof of proposition 3.1.2. We need to show that x̃t = xt a.s. for fixed t ∈ [0, 1] (it suffices
to restrict to [0, 1], as we can repeat the procedure for every interval [n, n + 1] separately). As
with unit probability x̃t = xt for all dyadic rational t and x̃t has continuous sample paths, we
find x̃t = limn x̃tn = limn xtn a.s. for any sequence of dyadic rational times tn % t. But
P(|xt − x̃t | > ε) ≤ ε−2 E((xt − x̃t )2 ) = ε−2 E(lim inf(xt − xtn )2 )
≤ ε−2 lim inf E((xt − xtn )2 ) = ε−2 lim inf(t − tn ) = 0 for any ε > 0,
where we have used Chebyshev’s inequality and Fatou’s lemma. Thus xt = x̃t a.s.
where f is an element in a suitable space of test functions. The simplest space of test
functions is the space C0∞ of smooth functions of compact support. The mathematical
object δ(·) should then be seen not as a function, but as a linear functional on the
space C0∞ : it is a linear map which associates to every test function a number, in this
case δ : f 7→ f (0). The integral expression above is just suggestive notation 4 for δ
evaluated at f . The philosophy behind such a concept is that no physical measurement
can ever be infinitely sharp, even if the object which we are measuring is (which is
itself an idealization); hence we only need to make sense of measurements that are
smeared out in time by a suitable test function, and a generalized function is simply
an object that associates to every such measurement the corresponding outcome.
Let us return to white noise. Clearly ξt is not a stochastic process, as its covariance
is not a function. However, we could think of ξt as an object whose sample paths
are themselves generalized functions. To make sense of this, we have to define the
properties of white noise when integrated against a test function. So let us integrate
the defining properties of white noise against test functions: E(ξ(f )) = 0 and
Z Z !
E(ξ(f )ξ(g)) ≡ E f (s) ξs ds g(t) ξt dt
R+ R+
Z Z
= f (s) g(t) δ(t − s) ds dt = f (t) g(t) dt ≡ hf, gi.
R+ ×R+ R+
Moreover, the fact that ξt is a Gaussian “process” implies that ξ(f ) should be a Gaus-
sian random variable for any test function f . So we can now define white noise as
a generalized stochastic process: it is a random linear functional ξ on C 0∞ such that
ξ(f ) is Gaussian, E(ξ(f )) = 0 and E(ξ(f )ξ(g)) = hf, gi for every f, g ∈ C 0∞ .
What is the relation to the Wiener process? The point of this section is to show
that given a Wiener process Wt , the stochastic integral
Z ∞
ξ(f ) = f (t) dWt , f ∈ C0∞ ,
0
satisfies the definition of white noise as a generalized stochastic process. This justifies
to a large extent the intuition that stochastic integrals can be interpreted as integrals
over white noise. It also justifies the idea of using the integrated observations
Z t
Yt = as ds + Wt ,
0
3 We will prefer the name generalized function. The word distribution is often used in probability theory
to denote the law of a random variable, not in the generalized function sense of L. Schwartz!
4 The notation suggests that we can approximate δ(·) by a sequence of actual functions d (·), such
R n
that the true (Riemann) integral f (s) dn (s) ds converges to f (0) as n → ∞ for every test function
f ∈ C0∞ . This is indeed the case (think of a sequence of increasingly narrow normalized Gaussians).
3.3. White noise 84
The nice thing about stochastic integrals, however, is that they completely dispose of
the need to work with generalized functions; the former live entirely within the do-
main of ordinary stochastic processes. As long as we are willing to accept that we
sometimes have to work with integrated observations, rather than using white noise
directly, what we gain is an extremely rich theory with very well developed analytical
techniques (stochastic calculus). At the end of the day, you can still interpret these
processes in white noise style (by smearing against a test function), without being
constrained along the way by the many restrictions of the theory of generalized func-
tions. Though white noise theory has its advocates—it is a matter of taste—it is fair to
say that stochastic integrals have turned out to be by far the most fruitful and widely
applicable. As such, you will not see another generalized function in this course.
To wrap up this section, it remains to show that the stochastic integral satisfies
the properties of white noise. We have not yet introduced the stochastic integral,
however; we previously broke off in desperation when we concluded that the infinite
variation property of the Wiener process precludes the use of the Stieltjes integral
for this purpose. Nonetheless we can rescue the Stieltjes integral for the purpose of
this section, so that we can postpone the definition of a real stochastic integral until
the next chapter. The reason that we do not get into trouble is that we only wish to
integrate test functions in C0∞ —as these functions are necessarily of finite variation
(why?), we can define the stochastic integral through integration by parts. How does
this work? Note that we can write for any partition π of [0, T ]
X X
f (ti ) (Wti+1 − Wti ) = f (T ) WT − Wti+1 (f (ti+1 ) − f (ti )),
ti ∈π ti ∈π
which is simply a rearrangement of the terms in the summation. But the sum on the
right hand side limits to a Stieltjes integral: after all, f is a test function of finite
variation, while Wt has continuous sample paths. So we can simply define
Z T Z T
f (s) dWs = f (T ) WT − Ws df (s),
0 0
where the integral on the right should be interpreted as a Stieltjes integral. In fact, as
f is smooth, we can even write
Z T Z T
df (s)
f (s) dWs = f (T ) WT − Ws ds,
0 0 ds
where we have used a well known property of the Stieltjes integral with respect to a
continuously differentiable function. Our goal is to show that this functional has the
properties of a white noise functional. Note that as f has compact support, we can
simply define the integral over [0, ∞[ by choosing T to be sufficiently large.
3.4. Further reading 85
Lemma 3.3.1. The stochastic integral of f ∈ C0∞ with respect to the Wiener process
Wt (as defined through integration by parts) is a white noise functional.
Proof. The integral is an a.s. limit (and hence a limit in distribution) of Gaussian random vari-
ables, so it must be itself a Gaussian random variable. It also has zero expectation: the only
difficulty here is the exchange of the expectation and the Stieltjes integral, which is however
immediately justified by Fubini’s theorem. It remains to demonstrate the covariance identity.
To this end, choose T to be sufficiently large so that the supports of both f and g are contained
in [0, T ]. Hence we can write (using f (T ) = g(T ) = 0)
Z T Z T
df (s) dg(t)
E(ξ(f )ξ(g)) = E Ws ds Wt dt .
0 ds 0 dt
Unfortunately, this is about as far as the integration by parts trick will take us.
In principle we could extend from test functions in C0∞ to test functions of finite
variation, and we can even allow for random finite variation integrands. However, one
of the main purposes of developing stochastic integrals is to have a stochastic calculus.
Even if we naively try to apply the chain rule to calculate something like, e.g., W t2 , we
RT
would still get integrals of the form 0 Wt dWt which can never be given meaning
through integration by parts. Hence we are really not going to be able to circumvent
the limitations of the Stieltjes integral; ultimately, a different idea is called for.
The stage is finally set for introducing the solution to our stochastic integration prob-
lem: the Itô integral, and, of equal importance, the associated stochastic calculus
which allows us to manipulate such integrals.
87
4.1. What is wrong with the Stieltjes integral? 88
tending the definition by taking limits. To this end, let fn be a simple function, i.e.,
one that is piecewise constant and jumps at a finite number of times t ni . Define
Z 1 X
I(fn ) = fn (s) dg(s) = fn (tni ) (g(tni+1 ) − g(tni )),
0 i
where we have evaluated the integrand at tni for concreteness (any point in the interval
[tni , tni+1 ] should work). Now choose the sequence of simple functions {f n } so that it
converges uniformly to f , i.e., supt∈[0,1] |fn (t) − f (t)| → 0 as n → ∞. Then
Z 1 Z 1
I(f ) = f (s) dg(s) = lim I(fn ) = lim fn (s) dg(s).
0 n→∞ n→∞ 0
Does this definition make sense? We should verify two things. First, we need to show
that the limit exists. Second, we need to show that the limit is independent of how we
choose our simple approximations fn .
Lemma 4.1.2. Suppose that g has finite variation TV(g, 0, 1) < ∞ and that the
sequence of simple functions fn converges to f uniformly. Then the sequence I(fn )
converges, and its limit does not depend on the choice of the approximations f n .
Proof. First, choose some fixed m, n, and let ti be the sequence of times that includes the jump
times tm n
i and ti of both fm and fn , respectively. Then clearly
X
I(fn ) − I(fm ) = (fn (ti ) − fm (ti )) (g(ti+1 ) − g(ti )),
i
so that in particular
X
|I(fn )−I(fm )| ≤ |fn (ti )−fm (ti )| |g(ti+1 )−g(ti )| ≤ sup |fn (t)−fm (t)| TV(g, 0, 1).
t
i
Remark 4.1.3. The Riemann-type definition of section 3.1 and the current definition
coincide: the former corresponds to a particular choice of simple approximations.
We have seen nothing that we do not already know. The question is, what happens
when TV(g, 0, 1) = ∞, e.g., if g is a typical sample path of the Wiener process? The
main point is that in this case, the previous lemma fails miserably. Let us show this.
Lemma 4.1.4. Suppose that g has infinite variation TV(g, 0, 1) = ∞. Then there
exist simple functions fn which converge to f uniformly, such that I(fn ) diverges.
4.1. What is wrong with the Stieltjes integral? 89
Proof. It suffices consider f = 0. After all, suppose that fn converges uniformly to f such that
I(fn ) converges; if hn is a sequence of simple functions that converges to zero uniformly, such
that I(hn ) diverges, then fn + hn also converges uniformly to f but I(fn + hn ) diverges.
As g has infinite variation, there exists a sequence of partitions πn of [0, 1] such that
X
|g(ti+1 ) − g(ti )| → ∞ as n → ∞.
ti ∈πn
Define hn to be a simple function such that hn (ti ) = sign(g(ti+1 ) − g(ti )) for all ti ∈ πn .
Evidently I(hn ) → ∞, but certainly hn does not converge uniformly. However, define the
sequence of simple functions fn = (I(hn ))−1/2 hn . Then I(fn ) = (I(hn ))1/2 → ∞ as well,
but clearly supt |fn (t)| = (I(hn ))−1/2 → 0, so fn converges uniformly to zero.
where #∆hn is the number of jumps of hn (this follows from the fact that Wt has
independent increments). In fact, it follows from the Borel-Cantelli lemma in this
case that P(hn (ti ) = sign(Wti+1 − Wti ) ∀ i i.o.) = 0, regardless of how we choose
the sequence hn (provided #∆hn increases when n increases). Though this does
not prove anything in itself, it suggests that things might not be as bad as they seem:
lemma 4.1.4 shows that there are certain sample paths of Wt for which the integral of
fn diverges, but it seems quite likely that the set of all such sample paths is always of
probability zero! In that case (and this is indeed the case1 ) we are just fine—we only
care about defining stochastic integrals with probability one.
Unfortunately, we are not so much interested in integrating deterministic inte-
grands against a Wiener process: we would like to be able to integrate random pro-
cesses. In this case we are in trouble again, as we can apply lemma 4.1.4 for every
sample path separately2 to obtain a uniformly convergent sequence of stochastic pro-
cesses fn whose integral with respect to Wt diverges.
1 From the discussion below on the Wiener integral, it follows that the integral of f converges to zero
n
in L2 . Hence the integral can certainly not diverge with nonzero probability.
2 For every ω ∈ Ω separately, set g(t) = W (ω) apply lemma 4.1.4 to obtain f (t, ω). There is a
t n
technical issue here (is the stochastic process fn measurable?), but this can be resolved with some care.
4.1. What is wrong with the Stieltjes integral? 90
A way out
The proof of lemma 4.1.4 suggests a way out. The key to the proof of lemma 4.1.4
was that we could construct an offensive sequence fn by “looking into the future”: fn
is constructed so that its sign matches the sign of the future increment of g. By doing
this, we can express the total variation of g as a limit of simple integrals, so that the
integral diverges whenever g has infinite variation.
This cunning trick is foiled, however, if we make g a Wiener process but keep
fn non-random: in that case we can never look into the future, because f n (ti ), being
non-random, cannot contain any information on the sign of W ti+1 − Wti . Even if fn
were allowed to be random, however, this would still be the case if we require f n (ti )
to be independent of Wti+1 − Wti ! Fortunately enough, there is a rich and important
class of stochastic processes with precisely this property.
Key idea 1. Let Wt be an Ft -Wiener process. Then we will only define stochastic
integrals with respect to Wt of stochastic processes which are Ft -adapted.
This key idea puts an end to the threat posed by lemma 4.1.4. But it is still not clear
how we should proceed to actually define the stochastic integral: after all, lemma 4.1.2
does not (and can not, by lemma 4.1.4) hold water in the infinite variation setting.
This suggests that we might be able to repeat the proof of lemma 4.1.2 using con-
vergence in L2 rather than a.s. convergence, exploiting the finiteness of the quadratic
variation rather than the total variation. In particular, we can try to prove that the
sequence of simple integrals I(fn ) is a Cauchy sequence in L2 , rather than proving
that I(fn )(ω) is a Cauchy sequence in R for almost every ω. This indeed turns out to
work, provided that we stick to adapted integrands as above.
Key idea 2. As the quadratic variation of the Wiener process is finite, we should
define the stochastic integrals as limits in L2 .
Let us show that this actually works in the special case of non-random integrands.
If fn is a non-random simple function on [0, 1], then, as usual,
Z 1 X
I(fn ) = fn (s) dWs = fn (tni ) (Wtni+1 − Wtni ),
0 i
where tni are the jump times of fn . Using the independence of the increments of the
Wiener process, we obtain immediately
X Z 1
2 n 2 n n
E((I(fn )) ) = (fn (ti )) (ti+1 − ti ) = (fn (s))2 ds,
i 0
4.1. What is wrong with the Stieltjes integral? 91
Now suppose that the functions fn converge to some function f in L2 ([0, 1]), i.e.,
Z 1
(fn (s) − f (s))2 ds → 0 as n → ∞.
0
i.e., that it converges to fn fast enough. (The arguments that lead to a.s. convergence
should be very familiar to you by now!) This way we really obtain the integral for
every sample path of the noise separately, and the conceptual issues are resolved.
Exactly the same holds for the Itô integral, to be defined in the next section. Note that
this does not change the nature of the stochastic integrals—they are still fundamentally
limits in L2 , as can be seen by the requirement that fn → f in L2 ([0, 1]). The point
is merely that this does not preclude the pathwise computation of the integral (as is
most natural from a conceptual point of view). In the following we will thus not worry
about this issue, and define integrals as limits in L2 without further comment.
What can happen if we do not take fn → f fast enough? You can indeed construct ex-
amples where fn → 0 in L2 ([0, 1]), but the Wiener integral I(fn ) does not converge a.s.
(of course, it does converge to zero in L2 ). Consider, for example, the simple functions
fn = Hn,1 /αn , where Hn,1 is the Haar wavelet constructed in theorem 3.2.5, and αn > 0
is chosen such that P(|ξ| > αn ) = n−1 where ξ is a Gaussian random variable with zero
mean and unit variance. Then αn → ∞ as n → ∞, so fn → 0 in L2 ([0, 1]). On the other
hand, I(Hn,k ) are i.i.d. Gaussian random variables with zero mean and unit variance (see the
discussion after the proof of theorem 3.2.5), so some set manipulation gives (how?)
Y P −1
P(|I(fn )| > 1 i.o.) = 1− lim (1−P(|I(Hn,1 )| > αn )) ≤ 1− lim e− n≥m n = 1
m→∞ m→∞
n≥m
where we have used the estimate 1 − x ≤ e−x . Hence I(fn ) certainly cannot converge a.s.
The behavior of the integral in this case is equivalent to that of example 1.5.6, i.e., there is
an occasional excursion of I(fn ) away from zero which becomes increasingly rare as n gets
large (otherwise I(fn ) would not converge in L2 ). This need not pose any conceptual problem;
you may simply consider it an artefact of a poor choice of approximating sequence.
A bare-bones construction
Let {Xtn }t∈[0,T ] be a simple, square-integrable, Ft -adapted stochastic process. What
does this mean? In principle we could allow every sample path X tn (ω) to have its own
jump times ti (ω), but for our purposes it will suffice to assume that the jump times
4.2. The Itô integral 93
and our goal is to extend this definition to a more general class of integrands by taking
limits. To this end, we will need the following Itô isometry:
!2 N
Z T "Z #
X T
n n 2 n 2
E Xt dWt = E((Xti ) ) (ti+1 − ti ) = E (Xt ) dt .
0 i=0 0
Note that it is crucial that Xtn is Ft -adapted: because of this Xtni is Fti -measurable,
so is independent of Wti+1 − Wti , and this is absolutely necessary for the Itô isometry
to hold! The requirement that Xtn ∈ L2 is also necessary at this point, as otherwise
E((Xtn )2 ) would not be finite and we would run into trouble.
Let us look a little more closely at the various objects in the expressions above.
The Itô integral I(X·n ) is a random variable in L2 (P). It is fruitful to think of the
stochastic process Xtn as a measurable map X·n : [0, T ] × Ω → R. If we consider
the product measure µT × P on [0, T ] × Ω, where µT is the Lebesgue measure on
[0, T ] (i.e., T times the uniform probability measure), then the right-hand side of the
Itô isometry is precisely EµT ×P ((X·n )2 ). In particular, the Itô isometry reads
where k · k2,P is the L2 -norm on Ω and k · k2,µT ×P is the L2 -norm on [0, T ] × Ω. This
is precisely the reason for the name isometry—the mapping I : L2 (µT × P) → L2 (P)
preserves the L2 -distance (i.e., kI(X·n ) − I(Y·n )k2,P = kX·n − Y·n k2,µT ×P ), at least
when applied to Ft -adapted simple integrands. This fact can now be used to extend
the definition of the Itô integral to a larger class of integrands in L2 (µT × P).
Lemma 4.2.1. Let X· ∈ L2 (µT × P), and suppose there exists a sequence of Ft -
adapted simple processes X·n ∈ L2 (µT × P) such that
"Z #
T
n→∞
kX·n − X· k22,µT ×P = E (Xtn − Xt )2 dt −−−−→ 0.
0
Then I(X· ) can be defined as the limit in L2 (P) of the simple integrals I(X·n ), and
the definition does not depend on the choice of simple approximations X ·n .
3 In a more general theory, where we can integrate against arbitrary martingales instead of the Wiener
process, the sample paths of the integrator could have jumps. In that case, it can become necessary to make
the jump times ti of the simple integrands random, and we have to be more careful about whether the
integrand and integrator are left- or right-continuous at the jumps (see [Pro04]). This is not an issue for us.
4.2. The Itô integral 94
kI(Y· ) − I(X· )k2,P ≤ kI(Y· ) − I(Y·n )k2,P + kI(Y·n ) − I(X·n )k2,P + kI(X·n ) − I(X· )k2,P ,
where the first and last terms on the right converge to zero by definition, while the fact that the
second term converges to zero follows easily from the Itô isometry. Hence I(Y· ) = I(X· ) a.s.,
so the integral does not depend on the approximating sequence.
converge to Xt for every sample path separately; after all, as [0, T ] is compact, the sample paths
are uniformly continuous, so supt∈[0,T ] |Xtn − Xt | ≤ supt sups∈[0,2−n T ] |Xt − Xt+s | → 0
as n → ∞. But as the sample paths are uniformly bounded, it follows that X·n → X· in
L2 (µT × P) by the dominated convergence theorem.
Now suppose that X· is just bounded and progressively measurable. Define the process
Z
1 t
Xtε = Xs∨0 ds.
ε t−ε
Then X·ε has bounded, continuous sample paths, is Ft -adapted (by the progressive measurabil-
ity of X· ), and Xtε → Xt as ε → 0 for every sample path separately. For any ε > 0, we can
thus approximate Xtε by simple adapted processes Xtn,ε , so by dominated convergence
Z T
n,ε 2
lim lim E (Xt − Xt ) dt = 0.
ε→0 n→∞ 0
As such, we can find a subsequence εn & 0 such that X·n,εn → X· in L2 (µT × P).
Next, suppose that X· is just progressively measurable. Then the process Xt I|Xt |≤M is
progressively measurable and bounded, and, moreover,
Z T Z T
lim E (Xt I|Xt |≤M − Xt )2 dt = lim E (Xt )2 I|Xt |>M dt = 0
M →∞ 0 M →∞ 0
by the dominated convergence theorem. As before, we can find a sequence of simple adapted
processes Xtn,M that approximate Xt I|Xt |≤M as n → ∞, and hence there is a subsequence
Mn % ∞ such that X·n,Mn → X· in L2 (µT × P), as desired.
The result can be exteded further even to the case that X· is not progressively measurable.
Such generality is not of much interest to us, however; see [KS91, lemma 3.2.4].
4.2. The Itô integral 95
We have now constructed the Itô integral in its most basic form. To be completely
explicit, let us state what we have learned as a definition.
Definition 4.2.3 (Elementary Itô integral). Let Xt be any Ft -adapted process in
L2 (µT × P). Then the Itô integral I(X· ), defined as the limit in L2 (P) of simple
integrals I(X·n ), exists and is unique (i.e., is independent of the choice of X tn ).
Before we move on, let us calculate a simple example “by hand.” (This example
will become completely trivial once we have the Itô calculus!)
Example 4.2.4. We would like to calculate the integral of Wt with respect to itself.
As Wt has continuous sample paths, we find that
n
Z T 2X −1
Wt dWt = L2 -lim Wk2−n T (W(k+1)2−n T − Wk2−n T ).
0 n→∞
k=0
and the second term on the right converges in L2 to T (recall that this is precisely the
quadratic variation of Wt ). So we find that
Z T
2
WT = 2 Wt dWt + T.
0
Certainly this would not be the case if the Itô integral were a Stieltjes integral; the
ordinary chain rule suggests that d(Wt )2 /dt = 2Wt dWt /dt!
We will show in this section that we can choose the Itô integral so that it has contin-
uous sample paths, a task which is reminiscent of our efforts in defining the Wiener
process. Continuity is usually included, implicitly or explicitly, in the definition of the
Itô integral—we will always assume it throughout this course.
The second issue is that the class of processes which we can integrate using the el-
ementary Itô integral—the Ft -adapted processes in L2 (µT ×P)—is actually a little too
restrictive. This may seem on the whiny side; surely this is a huge class of stochastic
4.2. The Itô integral 96
processes? But here is the problem. We will shortly be setting up a stochastic calcu-
lus, which will allow us to express functions of Itô integrals as new Itô integrals. For
example, given two Itô integrals I(X· ) and I(Y· ), we will obtain a rule which allows
us to express the product I(X· )I(Y· ) as the sum of a single Itô integral I(Z· ) of some
process Z· and a time integral. Even if X· and Y· are in L2 (µT × P), however, this
does not guarantee that the appropriate process Z· will be in L2 (µT × P). By extend-
ing the class of integrable processes, we will make sure that the the product of two Itô
integrals can always be expressed as another Itô integral. This ultimately makes the
theory easier to use, as you do not have to think, every time you wish to manipulate
an Itô integral, whether that particular manipulation is actually allowed.
We proceed as follows. We first prove that for adapted integrands in L 2 (µT × P),
we can define the Itô integral as a stochastic process on [0, T ] with continuous sample
paths. Next, we define the Itô integral as a stochastic process on [0, ∞[ with contin-
uous paths, by extending our previous construction through a simple process called
localization. By modifying the localization trick just a little bit, we will subsequently
be able to extend the Itô integral to a much larger class of integrands.
Remark 4.2.6. The fact the the simple integral is a martingale should not come as
a surprise—the discrete time process Iti (X·n ) is a martingale transform! We can
interpret the Itô integral as a continuous time martingale transform, albeit for a very
specific martingale: the Wiener process. (The more general theory, where you can
integrate against any martingale, makes this interpretation even more convincing.)
The martingale property is extremely helpful in constructing continuous sample
paths. To accomplish the latter, we will copy almost literally the argument used in
the proof of theorem 3.2.5 to construct the Wiener process with continuous sample
paths. That argument, however, relied on an estimate which, in the current context,
corresponds to a bound on P(supt∈[0,T ] |It (X·n )| > εn ). The martingale property
provides an ideal tool to obtain such bounds: we can simply copy the argument that
led to the proof of the supermartingale inequality, lemma 2.3.21.
Lemma 4.2.7. Let Xt be an Ft -adapted process in L2 (µT × P). Then the Itô integral
It (X· ), t ∈ [0, T ] can be chosen to have continuous sample paths. 4
Proof. As usual, we choose a sequence of simple approximations Xtn . Then It (X·n ) is a
martingale with continuous sample paths, and so is Mtn = It (X·n ) − It (X·n−1 ). By Jensen’s
inequality E((Mtn )2 |Fs ) ≥ (E(Mtn |Fs ))2 = (Msn )2 for s ≤ t, so (Mtn )2 is a submartingale.
Define the stopping time τ = inf{t ∈ [0, T ] : |Mtn | ≥ ε}. Then
!
P sup |Mtn | > ε ≤ P (Mτn )2 ≥ ε2 ≤ ε−2 E (Mτn )2 ≤ ε−2 E (MTn )2 ,
t∈[0,T ]
where we have used continuity of the sample paths, Chebyshev’s inequality, and the submartin-
gale property. In particular, we obtain the estimate
Z t ! Z T
1
n n−1
P sup (Xs − Xs ) dWs > 2 ≤ n4 E (Xsn − Xsn−1 )2 ds .
t∈[0,T ] 0 n 0
But we may assume that kX·n − X·n−1 k2,µT ×P ≤ 2−n ; if this is not the case, we can always
m(n) m(n−1)
choose a subsequence m(n) % ∞ such that kX· − X· k2,µT ×P ≤ 2−n . Thus we
find, proceding with a suitable subsequence if necessary, that
∞ Z t !
X 1
n n−1
P sup (Xs − Xs ) dWs > 2 < ∞.
n=2 t∈[0,T ] 0 n
But then it follows, using the Borel-Cantelli lemma, that
X∞ Z t
sup (Xsn − Xsn−1 ) dWs < ∞ a.s.,
n=2 t∈[0,T ] 0
and hence It (X·n ) a.s. converges uniformly to some process Ht with continuous sample paths.
As the discontinuous paths live in a null set, we may set them to zero without inflicting any
harm. It remains to show that for every t ∈ [0, T ], the random variable Ht is the limit in L2 (P)
of It (X·n ), i.e., that Ht coincides with the definition of the elementary Itô integral for every
time t. But as It (X·n ) → It (X· ) in L2 (P) and It (X·n ) → Ht a.s., we find that
E((Ht −It (X· ))2 ) = E lim inf (It (X·n ) − It (X· ))2 ≤ lim inf E((It (X·n )−It (X· ))2 ) = 0
n→∞ n→∞
where we have used Fatou’s lemma. Hence Ht = It (X· ) a.s., and we are done.
4 An immediate consequence is that any version of the Itô integral has a continuous modification.
4.2. The Itô integral 98
Localization
We would like to define the Itô integral as a continuous process on the entire interval
[0, ∞[. Which integrands can we do this for? The most straightforward idea would
be to require the integrands to be Ft -adapted processes in L2 (µ × P), where µ is the
Lebesgue measure on [0, ∞[. This would indeed be necessary if we wish to define
Z ∞
I∞ (X· ) = Xt dWt ,
0
but this is not our goal: we only wish to define the integral as a stochastic process
It (X· ) for every finite time t ∈ [0, ∞[, and we do not necessarily care whether I ∞ (X· )
actually exists. Hence the condition X· ∈ L2 (µ × P) seems excessively restrictive.
To weaken this condition, we use a trick called localization. This is not a very
deep idea. To define It (X· ) on [0, ∞[, it suffices that we can define it on every interval
[0, T ]: after all, to compute Xt for fixed t, we can simply choose T ≥ t and proceed
with the construction in the previous sections. Hence it shouldTsuffice to require that
X[0,T ] ∈ L2 (µT × P) for every T < ∞, i.e., that X· ∈ T <∞ L2 (µT × P), a
much weaker condition than X· ∈ L2 (µ × P)! This is called localization, because
we have taken a global construction on [0, T ]—in the previous section we defined the
sample paths of It (X· ) on all of [0, T ] simultaneously—and applied it locally to every
subinterval [0, T ] ⊂ [0, ∞[. The advantage is that the integrands need not be square
integrable on [0, ∞[×Ω; they only need to be locally square integrable, i.e., square
integrable when restricted to any bounded set of times [0, T ].
It is not immediately obvious that this procedure is consistent, however. We have
to verify that our definition of It (X· ) does not depend on which T > t we choose for
its construction; if the definition does depend on T , then our localization procedure is
ambiguous! Fortunately, the local property of the Itô integral is easy to verify.
Lemma 4.2.8. For any Ft -adapted process X· ∈ T <∞ L2 (µT × P), we can define
T
uniquely the Itô integral It (X· ) as an Ft -adapted stochastic process on [0, ∞[ with
continuous sample paths.
Proof. For any finite time T , we can construct the Itô integral of X[0,T ] as a stochastic process
on [0, T ] with continuous sample paths (by lemma 4.2.7), and it is clear from the construction
that XtT (X· ) is Ft -adapted. Let us call the process thus constructed ItT (X· ). We would like
to prove that for fixed T , P(Ist (X· ) = IsT (X· ) for all s ≤ t ≤ T ) = 1. But this is immediate
from the definition when X· is a simple integrand, and follows for the general case by choosing
the same approximating sequence Xtn , defined on [0, T ], to define both Ist (X· ) and IsT (X· ).
Finally, as this holds for any T ∈ N, we find P(Ist (X· ) = IsT (X· ) for all s ≤ t ≤ T < ∞) =
1, so that It (X· ) is unambiguously defined by setting It (X· ) = ItT (X· ) for any T ≥ t.
Lemma 4.2.9. Let Xt be an Ft -adapted process in T <∞ L2 (µT × P), and let τ be
T
an Ft -stopping time. Then It∧τ (X· ) = It (X· I·<τ ).
4.2. The Itô integral 99
Proof. As τ is a stopping time, It<τ is Ft -adapted, and hence Xt It<τ is Ft -adapted and in
T 2
T <∞ L (µT × P). Hence the integral in the statement of the lemma exists. To prove the
result, fix some interval [0, T ] and choose a sequence Xtn of simple processes on [0, T ] that
converge to Xt fast enough. Let us suppose additionally that τ ≤ T a.s. Define the random
time τ n to be the value of τ rounded upwards to the earliest jump time of Xtn that is larger or
equal to τ . The times τ n are still stopping times, and thus Xtn It<τ n is a sequence of Ft -adapted
simple approximations that converges to Xt It≤τ . But for the simple integrands, you can verify
immediately that IT (X·n I·<τ n ) = Iτ n (X·n ), and hence we find that IT (X· I·<τ ) = Iτ (X· )
by letting n → ∞ and using continuity of the sample paths. When τ is not bounded by
T , we simply apply the above procedure to the bounded stopping time τ ∧ T . Finally, we
have only proved the statement for every T separately, so you might worry that the processes
IT ∧τ (X· ) and IT (X· I·<τ ) are only modificiations of each other. But both these processes have
continuous sample paths, and modifications with continuous paths are indistinguishable.
Recall that in this case, localization consisted of defining It (X· ) as ItT (X· ) (the in-
tegral constructed in L2 (µT × P)) for T large enough, and lemma 4.2.8 guarantees
that the definition does not depend on the particular choice of T . Now suppose that
instead of the above condition, we are in a situation where
Z τn
2
E Xt dt < ∞ for all n ∈ N,
0
It remains to show that the definition thus obtained does not depend on the choice of lo-
calizing sequence. To see this, let τn0 be another localizing sequence for Xt , and denote by
It0 (X· ) the Itô integral constructed using this sequence. Introduce also the stopping times
σn = τn ∧ τn0 , and note that this forms another localizing sequence. Denote by Jt (X· ) the
Itô integral constructed using σn . But then, by the same argument as above, Jt (X· ) = It (X· )
and Jt (X· ) = It0 (X· ) for all t ≤ σn , and hence It0 (X· ) = It (X· ) for all t.
T How does2
this help us? There is a natural class of integrands—much larger than
T <∞ L (µT × P)—whose elements admit a localizing sequence. We only need to
require that the integrand Xt is Ft -adapted and satisfies
Z T
AT (X· ) = Xt2 dt < ∞ a.s. for all T < ∞.
0
Remark 4.2.13. The generalization to the class of integrands in definition 4.2.11 will
make the stochastic calculus much more transparent. If you care about generality for
its own sake, however, you might be interested to know that the current conditions can
not be reasonably relaxed; see [McK69, section 2.5, problem 1].
4.3. Some elementary properties 101
Lemma 4.3.2. Let Xt be Itô integrable and let τ be an Ft -stopping time. Then
Z t∧τ Z t
Xs dWs = Xs Is<τ dWs .
0 0
T 2
Proof. For integrands in T <∞ L (µT × P), this follows from lemma 4.2.9. For the general
case, let σn be a localizing sequence. Then by definition, It∧τ (X· ) = It∧τ (X· I·<σn ) for
t < σn , and using lemma 4.2.9 we find It∧τ (X· I·<σn ) = It (X· I·<τ I·<σn ). But σn is clearly
also a localizing sequence for Xt It<τ , so the result follows by localization.
When the integrand is in T <∞ L2 (µT × P), the Itô integral inherits the elemen-
T
tary properties of the simple integrals. This is very convenient in computations.
Lemma 4.3.3. Let X· ∈ T <∞ L2 (µT × P). Then for any T < ∞
T
"Z # !2 "Z #
T Z T T
2
E Xt dWt = 0, E Xt dWt = E Xt dt ,
0 0 0
In the general case, i.e., when X· 6∈ T <∞ L2 (µT × P), the nice properties of
T
lemma 4.3.3 are unfortunately not guaranteed. In fact, in the general case X t may not
even be in L1 (P), in which case its expectation need not be defined and the martingale
property need not even make sense. However, there is a weakening of the martingale
property that is especially suitable for stochastic integration.
Definition 4.3.5. An Ft -measurable process Xt is called an Ft -local martingale if
there exists a sequence of Ft -stopping times τn % ∞ such that Xt∧τn is a martingale
for every n. The sequence τn is called a reducing sequence for Xt .
Lemma 4.3.6. Any Itô integral It (X· ) is a local martingale.
Proof. Any localizing sequence for Xt is a reducing sequence for It (X· ).
Remark 4.3.7. The local martingale property is fundamental in the general theory of
stochastic integration, and is intimately relatedTwith the notion of localization. How-
ever, lemma 4.3.3 shows that the integrands in T <∞ L2 (µT × P) behave much nicer
in computations, at least where expectations are involved, than their more general
localized counterparts. When applying the Itô calculus, to be developed next, local-
ization will allow us to manipulate the stochastic integrals very easily; but at the end
of
T the day2 we will still need to prove separately that the resulting integrands are in
T <∞ L (µT × P) if we wish to calculate the expectation.
where we have written u0 (t, x) = ∂u(t, x)/∂t and ui (t, x) = ∂u(t, x)/∂xi .
Remark 4.4.3 (Itô differentials). We will often use another notation for the Itô pro-
cess, particularly when dealing with stochastic differential equations:
dXt = Ft dt + Gt dWt .
dt dWtj
dt 0 0
dWti 0 δij dt
For example, if (in one dimension) dXt = Ft dt+G1t dWt1 +G2t dWt2 , then (dXt )2 =
{(G1t )2 + (G2t )2 } dt. You should inspect the symbolic expression of the Itô rule care-
fully and convince yourself that it does indeed coincide with theorem 4.4.2.
4.4. The Itô calculus 104
You can now easily see how extraordinarily simple Itô’s rule really is. If we write
our processes in terms of the “differentials” dXt , dWti , etc., then Itô’s rule reduces
to applying some easy to remember multiplication rules to the differentials. This is
highly reminiscent of the chain rule in ordinary calculus—the first two terms of the
Itô rule are the ordinary chain rule, and Itô’s rule does indeed reduce to the chain rule
when Gij t = 0 (as it should!) When stochastic integrals are present, we evidently need
to take a second-order term into account as well. Once you get used to pushing around
the various symbols in the right way, applying Itô’s rule will be no more difficult than
calculating derivatives in your high school calculus class.
Important example 4.4.4. Let Xt1 , Xt2 be two one-dimensional Itô processes, and
consider the function u(t, x1 , x2 ) = x1 x2 . Then u ∈ C 2 , and so by Itô’s rule
m Z t
X
Xt1 Xt2 = X01 X02 + {Xs1 G2k 2 1k
s + Xs Gs } dWs
k
k=1 0
Z t m
X
+ Xs1 Fs2 + Xs2 Fs1 + G1k 2k
s Gs ds.
0 k=1
In differential form, we find the product rule d(Xt1 Xt2 ) = Xt1 dXt2 + Xt2 dXt1 +
dXt1 dXt2 . In particular, we see that the class of Itô processes is closed under multi-
plication. Moreover, the class of Itô processes is trivially closed under the formation
of linear combinations, so apparently the Itô processes form an algebra. 5
Let us now proceed to the proof of Itô’s rule. The proof goes a little fast at times;
if we would work out every minor point in the goriest detail, the proof would be many
pages longer. You can fill in the details yourself without too much effort, or look them
up in one of the references mentioned in section 4.7.
Proof of theorem 4.4.2. Until further notice, we consider the case where the integrands Fti and
Gij
t are bounded, simple, and Ft -adapted. We also assume that u(t, x) is independent of t and
that all first and second derivatives of u are uniformly bounded. Once we have proved the Itô
rule for this special case, we will generalize to obtain the full Itô rule.
The key, of course, is Taylor’s formula [Apo69, theorem 9.4]:
1
u(x) = u(x0 ) + ∂u(x0 ) (x − x0 ) + (x − x0 )∗ ∂ 2 u(x0 ) (x − x0 ) + kx − x0 k2 E(x, x0 ),
2
where E(x, x0 ) is uniformly bounded and E(x, x0 ) → 0 as kx − x0 k → 0. As Fti and Gij t
are simple integrands, we may assume that they all have the same jump times tk , k = 1, . . . , N
(otherwise we can join all the jump times into one sequence, and proceed with that). Write
N
X
u(Xt ) = u(X0 ) + (u(Xtk+1 ) − u(Xtk )),
k=0
where we use the convention t0 = 0 and tN +1 = t. We are going to deal with every term of
this sum separately. Let us thus fix some k, and define sp` = tk + `2−p (tk+1 − tk ). Then
p
2
X
u(Xtk+1 ) − u(Xtk ) = (u(Xsp ) − u(Xsp ))
` `−1
`=1
for any p ∈ N. Note that as Ft and Gt are constant in the interval tk+1 − tk , we have
u(Xtk+1 ) − u(Xtk ) =
2p
X
∂u(Xsp ) (Ftk 2−p ∆ + Gtk (Wsp − Wsp ) (4.4.1)
`−1 ` `−1
`=1
p
2
1X
+ (Gtk (Wsp − Wsp ))∗ ∂ 2 u(Xsp ) (Gtk (Wsp − Wsp )) (4.4.2)
2 `=1 ` `−1 `−1 ` `−1
p
X2
1
+ (Ftk 2−p ∆)∗ ∂ 2 u(Xsp ) (Ftk 2−p ∆) (4.4.3)
2 `=1
`−1
p
2
X
+ (Ftk 2−p ∆)∗ ∂ 2 u(Xsp ) (Gtk (Wsp − Wsp )) (4.4.4)
`−1 ` `−1
`=1
p
2
X
+ kFtk 2−p ∆ + Gtk (Wsp − Wsp )k2 E(Xsp , Xsp ). (4.4.5)
` `−1 ` `−1
`=1
We now let p → ∞ and look whether the various terms on the right converge in L2 (P).
Consider first the terms (4.4.3) and (4.4.4). As ∂ 2 u(Xs ) is bounded and has continuous
sample paths, the sums in (4.4.3) and (4.4.4) converge in L2 (P) to a time integral and an Itô
integral, respectively. But both terms are premultiplied by 2−p , so we conclude that these terms
converge to zero. Next, note that the term (4.4.5) can be estimated as
( 2p
)
2 2 2
X 2
sup |E(Xsp , Xsp )| ∆ kFtk k + kGtk k kWsp − Wsp k .
` `−1 ` `−1
`
`=1
But the sum converges in L2 (P) to the quadratic variation m∆, while the supremum term
is uniformly bounded and converges a.s. to zero. Hence the entire term converges to zero in
L2 (P). Next, note that the term (4.4.1) converges in L2 (P) to
Z tk+1 Z tk+1
∂u(Xr ) Ftk dr + ∂u(Xr ) Gtk dWr .
tk tk
It remains to investigate the term (4.4.2). We claim that this term converges in L2 (P) to
Z
1 tk+1
Tr[∂ 2 u(Xr ) Gtk (Gtk )∗ ] dr.
2 tk
But this calculation is almost identical to the calculation of the quadratic variation of the Wiener
process in the proof of lemma 3.1.11, so we will leave it as an exercise.
Finally, summing over all k, we find that Itô’s rule is indeed satisfied in the case that Fti and
ij
Gt are Ft -adapted bounded simple integrands, u(t, x) is independent of t and has uniformly
bounded first and second derivatives. We now need to generalize this statement.
First, suppose that Gik 2
t ∈ L (µT × P) for all i, k, and that Ft satisfies the general condi-
tion of the theorem. Then we can find a sequence of simple approximations to Ft , Gt which
4.5. Girsanov’s theorem 106
converges fast enough, so that the simple Itô processes converge a.s. to Xt . Moreover, the Itô
rule holds for each of these simple approximations. But as we have assumed that the derivatives
of u are bounded and continuous, it is easy to see that the integrands obtained by applying the
Itô rule to the simple approximations converge sufficiently fast to the integrands in the Itô rule
applied to Xt . Taking the a.s. limit, we can conclude using corollary 4.3.4 that the Itô rule still
holds when Gik 2
t ∈ L (µT × P) and Ft satisfies the general condition of the theorem.
Our next job is to add time to the picture (but u is still bounded). If u(t, x) is required to
be C 2 in all variables, then this is trivial: Xti = t is an Itô process, so we can always extend the
dimension of our Itô process by one to include time in the picture. To allow u to be only C 1
in time, we can always find (e.g., by convolution) a sequence un of C 2 approximations to u so
that un , un n n 0
i , uij and (u ) converge uniformly on compact sets fast enough. The result follows
by taking the limit. (Actually, if it happens to be the case for some i that Gim = 0 for all m,
then we could similarly only require u to be C 1 in the variable xi .)
It remains to weaken the requirement that Gt is square-integrable and that u has bounded
derivatives. But we can solve both these problems simultaneously by localization. Indeed,
choose a localizing sequence τn for Xt , and choose another sequence of stopping times σn %
∞ such that ui (t, Xt ), uij (t, Xt ) and u0 (t, Xt ) are bounded by n for all t < σn . Then τn ∧ σn
is another localizing sequence, and we can apply Itô’s rule to Xt∧τn ∧σn . We are done.
Remark 4.4.5. Suppose that for all times t, the Itô process Xt a.s. takes values in
some open set U . Then, using another localization trick, it suffices for Itô’s theorem
that u(t, x) is C 1 in t and C 2 for x ∈ U . Proving this is a good exercise in localization.
For example, we will sometimes encounter an Itô√process such that X t > 0 a.s.
for all t. We can then apply Itô’s rule with u(t, x) = x or even u(t, x) = x−1 , even
though these functions are not C 2 on all of R (but they are C 2 on ]0, ∞[).
EP (f (ξ10 , . . . , ξn0 )) =
2 2
e−(x1 +···+xn )/2
Z
f (x1 + a1 , . . . , xn + an (x1 , . . . , xn−1 )) dx1 · · · dxn ,
Rn (2π)n/2
where we have explicitly introduced the predictablity assumption by setting a k =
ak (ξ1 , . . . , ξk−1 ). Under Q, on the other hand, we would like to have
0 2 0 2
e−((x1 ) +···+(xn ) )/2 0
Z
EQ (f (ξ10 , . . . , ξn0 )) = f (x01 , . . . , x0n ) dx1 · · · dx0n
Rn (2π)n/2
Z
= f (x1 + a1 , . . . , xn + an (x1 , . . . , xn−1 ))
Rn
2
+···+(xn +an (x1 ,...,xn−1 ))2 )/2
e−((x1 +a1 )
× dx1 · · · dxn ,
(2π)n/2
which is almost the same as in the previous example. (You should verify that this does
not give the right answer if ak is not assumed to be predictable!)
Apparently we can “add” to a sequence of i.i.d. Gaussian random variables an ar-
bitrary predictable process simply by changing to an absolutely continuous probability
measure. This can be very convenient. Many problems, random and non-random, can
be simplified by making a suitable change of coordinates (using, e.g., Itô’s rule). But
here we have another tool at our disposal which is purely probabilistic in nature: we
can try to simplify our problem by changing to a more convenient probability measure.
We will see this idea in action, for example, when we discuss filtering.
6 If you are unconvinced, assume first that f and a , . . . , a are smooth functions of x , . . . , x , ap-
1 n 1 n
ply the usual change of variables formula, and then extend to arbitrary f and a 1 , . . . , an by approximation.
4.5. Girsanov’s theorem 108
With this bit of discrete time intuition under our belt, the Girsanov theorem should
come as no great surprise. Indeed, it is simply the appropriate extension of the previ-
ous example, with a Wiener process replacing the i.i.d. sequence ξ k .
Theorem 4.5.3 (Girsanov). Let Wt be an m-dimensional Ft -Wiener process on the
probability space (Ω, F, {Ft }t∈[0,T ] , P), and let Xt be an Itô process of the form
Z t
Xt = Fs ds + Wt , t ∈ [0, T ].
0
where we have applied Itô’s rule to obtain the expression on the right. Hence Λt is a local
martingale. In fact, Novikov’s condition implies that Λt is a true martingale: see theorem 4.5.8.
Let us assume until further notice that Ft is bounded; we will generalize later. We wish
to prove that Xt is an m-dimensional Ft -Wiener process under Q. Clearly Xt has continuous
sample paths, so it suffices to verify its finite dimensional distributions. In fact, by a simple
induction argument, it suffices to prove that for any Fs -measurable bounded random variable
Z, the increment Xt − Xs (t > s) is an m-dimensional Gaussian random variable with zero
mean and covariance matrix (t − s)I (I is the identity matrix), independent of Z. We will do
this as in the proof of theorem 3.2.5 by calculating the characteristic function. Given α ∈ Rm
4.5. Girsanov’s theorem 109
and applying lemma 4.5.6 (use the boundedness of Ft ) we find that the real and imaginary parts
of (iα − Fsi )Msα are all in L2 (µT × P). Hence Mtα is indeed a martingale. But then
∗
EQ (eiα (Xt −Xs )+iβZ
)
Rt Rt
−kαk2 (t−s)/2 ∗
dWr − 1 kiα−Fr k2 dr
=e EP (Λs eiβZ EP (e s (iα−Fr ) 2 s |Fs ))
2 2
= e−kαk (t−s)/2
EP (EP (ΛT eiβZ |Fs )) = e−kαk (t−s)/2
EQ (eiβZ ).
As the characteristic function factorizes into the characteristic function of Z and the character-
istic function of an m-dimensional Gaussian random variable with mean zero and covariance
(t − s)I, we find that Xt − Xs is independent of Z and has the desired distribution. Hence we
have proved the theorem for the case that Ft is bounded.
Let us now tackle the general case where Ft is not necessarily bounded. Define the pro-
cesses Gn n
t = Ft IkFt k≤n ; then Gt is bounded for any n. Moreover,
Z T Z T
1 1
exp EP kFt k2 dt ≤ EP exp kFt k2 dt <∞
2 0 2 0
7 Abuse of notation alert: here x∗ means the transpose of the vector x, not the conjugate transpose; in
P
particular, α∗ = α for all α ∈ C. Similarly, kxk2 denotes i (xi )2 . We have not defined complex Itô
Rt Rt R
integrals, but you may set 0 (as +ibs ) dWs ≡ 0 as dWs +i 0t bs dWs when as and bs are real-valued,
i.e., the complex integral is the linear extension of the real integral. As this is the only place, other than in
theorem 3.2.5, where we will use complex numbers, we will put up with our lousy notation just this once.
4.5. Girsanov’s theorem 110
m(n)
Let us continue with this subsequence, i.e., we define Ftn = Gt . For any n, define Xtn ,
Λn
t, Λ = ΛT and Q by replacing Ft by Ft in their definitions. Then Xtn is a Wiener process
n n n n
Once this has been established, we are done: taking the limit as n → ∞, we find that
∗ 2
EQ (eiα (Xt −Xs )+iβZ ) = e−kαk (t−s)/2 EQ (eiβZ ) which is precisely what we want to show.
To proceed, let us first show that Λn → Λ in L1 (P). Note that Λn → Λ a.s., and using
theorem 4.5.8 we find that EP (Λn ) = EP (Λ) = 1. As (Λn − Λ)− ≤ Λ and Λn → Λ a.s., we
find that EP ((Λn − Λ)− ) → 0. Similarly, EP ((Λn − Λ)+ ) = EP (Λn − Λ + (Λn − Λ)− ) → 0.
Hence EP (|Λn − Λ|) = EP ((Λn − Λ)+ + (Λn − Λ)− ) → 0, and we have determined that
Λn → Λ in L1 (P). (The previous argument is also known as Scheffé’s lemma.)
The remaining work is easy. Clearly Λn eiβZ → ΛeiβZ in L1 (P), as eiβZ is independent
∗ n n
of n and has bounded real and imaginary parts. To deal with Λn eiα (Xt −Xs )+iβZ , note that
its real and imaginary parts are of the form αn βn where αn is bounded and converges a.s. to α,
while βn converges to β in L1 (P). But then the following expression converges to zero:
EP (|αn βn −αβ|) ≤ EP (|αn −α| |β|)+EP (|αn | |βn −β|) ≤ EP (|αn −α| |β|)+K EP (|βn −β|)
using dominated convergence for the first term on the right and L1 (P) convergence for the
∗ n n ∗
second. Hence Λn eiα (Xt −Xs )+iβZ → Λ eiα (Xt −Xs )+iβZ in L1 (P), and we are done.
Lemma 4.5.6. Let Ft be Itô integrable and let Wt be a Wiener process. Then
Z t s Z t
1 1 2
E exp Fs dWs ≤ E exp (Fs ) ds .
2 0 2 0
By convergence in L1 (P), we mean that the real and imaginary parts converge in L1 (P).
8
To be precise, we should first check that Mt ∈ L1 (P), otherwise the conditional expectation is not
9
defined. But an identical application of Fatou’s lemma shows that E(M t ) ≤ E(M0 ), so Mt ∈ L1 (P).
4.5. Girsanov’s theorem 111
Rt Rt Rt Rt
Proof. As exp( 12 0
Fs dWs ) = exp( 21 0
Fs dWs − 1
4 0
(Fs )2 ds) exp( 41 0
(Fs )2 ds),
Z t s Z t R
1 1 t 1 Rt 2
E exp Fs dWs ≤ E exp (Fs )2 ds E e 0 Fs dWs − 2 0 (Fs ) ds
2 0 2 0
using Hölder’s inequality. But using Itô’s rule, we find that
Rt 1 Rt
Z t
2
e 0 Fs dWs − 2 0 (Fs ) ds ≡ Mt = 1 + Ms Fs dWs ,
0
It remains to prove Novikov’s theorem, i.e., that the Novikov condition implies
that Λt is a martingale; this was key for the Girsanov theorem to hold. Evidently it
suffices to prove that EP (ΛT ) = 1 (lemma 4.5.5). To show this, let us introduce the
following supporting lemma; the proof of Novikov’s theorem then essentially amounts
to reducing the Novikov condition to this lemma.
Lemma 4.5.7. Let Mt be a nonnegative local martingale and let τn be a reducing
sequence. If supn kMT ∧τn kp < ∞ for some p > 1, then {Mt }t∈[0,T ] is a martingale.
Proof. We begin by writing
E(|MT −MT ∧τn |) ≤ E(MT −r∧MT )+E(|r∧MT −r∧MT ∧τn |)+E(MT ∧τn −r∧MT ∧τn ).
We claim that if we take the limit as n → ∞ and as r → ∞, in that order, then the right-hand
side vanishes. This is clear for the first two terms, using dominated convergence and the fact
that E(MT ) ≤ E(M0 ) < ∞ by the supermartingale property. To tackle the last term, note that
E(MT ∧τn − r ∧ MT ∧τn ) = E(MT ∧τn IMT ∧τn >r ) − r P(MT ∧τn > r).
Using Chebyshev’s inequality,
r E((MT ∧τn )p ) supn E((MT ∧τn )p
r P(MT ∧τn > r) ≤ p
≤ ,
r rp−1
so as n → ∞ and r → ∞ this term vanishes. Now let 0 < r ≤ x; then we have the trivial
estimate x ≤ r 1−p xp for p > 1. Hence
E(MT ∧τn IMT ∧τn >r ) ≤ r1−p E((MT ∧τn )p IMT ∧τn >r ) ≤ r1−p sup E((MT ∧τn )p ),
n
so this term also vanishes in the limit. Hence E(|MT − MT ∧τn |) → 0 as n → ∞. But then
E(MT ) = limn→∞ E(MT ∧τn ) = E(M0 ), so Mt is a martingale by lemma 4.5.5.
Theorem 4.5.8 (Novikov). For Itô integrable Ft , define the local martingale
Z t
1 t
Z
∗ 2
Et (F· ) = exp (Fs ) dWs − kFs k ds .
0 2 0
Suppose furthermore that the following condition is satisfied:
" !#
1 T
Z
2
E exp kFs k ds = K < ∞.
2 0
√
Proof. We are going to show below that Et (F· α) is a martingale for all 0 < α < 1. Let us
first argue that the result follows from this. By inspection, we easily find that
√ √ √ Rt ∗
Et (F· α) = (Et (F· ))α e α(1− α) 0 (Fs ) dWs .
√ √
But as x(1+ α)/2 α
is a convex function for all 0 < α < 1, we find
1 RT ∗ √ √ √ √
1 ≤ (E(ET (F· )))α (E(e 2 0 (Fs ) dWs ))2 α(1− α)
≤ (E(ET (F· )))α K α(1− α)
,
where we have used lemma 4.5.6. Letting α % 1, we find that E(ET (F· )) ≥ 1. But we already
know that E(ET (F· )) ≤ 1 by lemma √ 4.5.5, so the result follows. √
It remains to show that Et (F· α) is a martingale for any 0 < α < 1. Fix α. As Et (F· α)
is a local martingale, we can choose a localizing sequence τn . By lemma 4.5.7, it suffices to
√
prove that supn E((ET ∧τn (F· α))u ) < ∞ for some u > 1. But exactly as above, we find
√ √ √ Rt ∗
(Et (F· α))u = (Et (F· ))uα eu α(1− α) 0 (Fs ) dWs .
where we have used that Et (F· ) is a supermartingale and that lemma 4.5.6 still holds when t is
replaced by t ∧ τn (just replace Fs by Fs Is<τn , then apply lemma 4.5.6). Taking the supremum
over n, we obtain what we set out to demonstrate. The proof is complete.
ourselves to a more limited setting than in the rest of the chapter. We will work on a
probability space (Ω, F, P) on which is defined a Wiener process Wt (we will take it
to be one-dimensional, though you can easily extend the result to higher dimensions).
The restriction will come in due to the fact that we only consider the natural filtration
FtW = σ{Ws : s ≤ t}, unlike in the rest of the chapter where we could work with a
larger filtration Ft . The statement of the theorem is then as follows.
Theorem 4.6.1 (Martingale representation). Let Mt be an FtW -martingale such
that MT ∈ L2 (P). Then for a unique FtW -adapted process {Ht }t∈[0,T ] in L2 (µT ×P)
Z t
Mt = M 0 + Hs dWs a.s. for all t ∈ [0, T ],
0
for a unique FtW -adapted process {Ht }t∈[0,T ] in L2 (µT ×P). It remains to note that E(MT ) =
M0 (as F0W is the trivial filtration, M0 must be non-random) and that the martingale represen-
tation result follows from the Itô representation of MT using Mt = E(MT |FtW ).
Lemma 4.6.3. Introduce the following class of FTW -measurable random variables:
(recall that C0∞ is the class of smooth functions with compact support). Then for any
ε > 0 and FTW -measurable X ∈ L2 (P), there is a Y ∈ S such that kX − Y k2 < ε.
Proof. First, we claim that the statement holds if f is just assumed to be Borel-measurable
rather than C0∞ . To show this, introduce the filtration Gn = σ{Wk2−n T : k = 0, . . . , 2n },
and note that FTW = σ{Gn : n = 1, 2, . . .}. Fix some FTW -measurable X ∈ L2 (P), and
define the sequence X n = E(X|Gn ). Then X n → X in L2 (P) by lemma 4.6.4 below. But
X n = f (W2−n T , . . . , WT ) for some Borel function f (as it is Gn -measurable), so X can be
approximated arbitrarily closely by Borel functions of the Wiener process at a finite number
of times. Note that we may also restrict ourselves to bounded Borel functions: after all, the
sequence X n ∧ n of bounded random variables converges to X as well.
We now claim that any bounded Borel function f can be approximated arbitrarily well
by functions f n ∈ C ∞ . But this is well known: the approximations f n can be found, e.g.,
by convolving f with a smooth function of compact support, and f n (W2−n T , . . . , WT ) →
f (W2−n T , . . . , WT ) in L2 (P) by dominated convergence. It remains to note that we can re-
strict ourselves to functions in C0∞ , as we can always multiply f n by g n , where g n is a sequence
of [0, 1]-valued functions with compact support such that g n % 1 pointwise, and dominated
convergence still gives the desired result. We are done.
In particular, this implies that any FTW -measurable random variable X ∈ L2 (P) can
be approximated arbitrarily closely in L2 (P) by an Itô integral.
Proof. Let us first consider the simplest case Y = f (Wt ), where f ∈ C0∞ and t ∈ [0, T ]. Note
that for any function g(s, x) which is sufficiently smooth, we have by Itô’s rule
Z t Z t
∂g 1 ∂2g ∂g
g(t, Wt ) = g(0, 0) + + 2
(s, W s ) ds + (s, Ws ) dWs .
0 ∂s 2 ∂x 0 ∂x
∂g 1 ∂2g
+ = 0, g(t, x) = f (x).
∂s 2 ∂x2
But such a function exists and is even sufficiently smooth: explicitly,
Z ∞
1 2
g(s, x) = p f (y) e−(x−y) /2(t−s) dy, s < t.
2π(t − s) −∞
Hence the integrand Hs = (∂g/∂x)(s, Ws ) gets the job done, and is clearly FtW -adapted. It
is also in L2 (µT × P) as f is of compact support, so ∂g/∂x is of compact support as well (and
in particular bounded). Hence in the very simplest case, we are done.
Let us now consider the next most difficult case: Y = f (Wr , Wt ) with r < t. Introduce
g(s, x, z), and apply Itô’s rule to g(s, Ws∧r , Ws ). We find that
Z t Z t
∂g 1 ∂ 2 g ∂g
g(t, Wr , Wt ) = g(r, Wr , Wr )+ + 2
(s, W r , W s ) ds+ (s, Wr , Ws ) dWs .
r ∂s 2 ∂z r ∂z
But for some function g 0 (s, x), applying Itô’s rule to g 0 (s, Ws ) gives
Z r 0 Z r
∂g 1 ∂ 2 g0 ∂g 0
g 0 (r, Wr ) = g 0 (0, 0) + + 2
(s, W s ) ds + (s, Ws ) dWs .
0 ∂s 2 ∂x 0 ∂x
∂g 1 ∂2g ∂g 0 1 ∂ 2 g0
+ = 0, g(t, x, z) = f (x, z), + = 0, g 0 (r, x) = g(r, x, x).
∂s 2 ∂z 2 ∂s 2 ∂x2
This can be done exactly as before, and we set
∂g 0 ∂g
Ht = (s, Ws ) for 0 ≤ s ≤ r, Ht = (s, Wr , Ws ) for r ≤ s ≤ t.
∂x ∂z
Proceeding in the same manner, we find by induction that the result holds for any Y ∈ S.
by the Itô isometry. But this can only be the case if H· = H·0 µT × P-a.s.
Now that we have Itô integrals, we can introduce stochastic differential equations—
one of the major reasons to set up the Itô theory in the first place. After dealing with
issues of existence and uniqueness, we will exhibit an important property of stochastic
differential equations: the solutions of such equations obey the Markov property. A
particular consequence is the connection with the classic PDE methods for studying
diffusions, the Kolmogorov forward (Fokker-Planck) and backward equations.
We would like to think of stochastic differential equations (SDE) as ordinary dif-
ferential equations (ODE) driven by white noise. Unfortunately, this connection is not
entirely clear; after all, we have only justified the connection between the Itô integral
and white noise in the case of non-random integrands (interpreted as test functions).
We will show that if we take a sequence of ODEs, driven by approximations to white
noise, then these do indeed limit to an SDE—though not entirely the expected one.
This issue is particularly important in the stochastic modelling of physical systems. A
related issue is the simulation of SDE on a computer, which we will briefly discuss.
Finally, the chapter concludes with a brief discussion of some more advanced
topics in stochastic differential equations.
which looks almost like an ordinary differential equation. However, as usual, the “Itô
differentials” are not sensible mathematical objects in themselves; rather, we should
117
5.1. Stochastic differential equations: existence and uniqueness 118
If there exists a stochastic process Xt that satisfies this equation, we say that it solves
the stochastic differential equation. The main goal of this section is to find conditions
on the coefficients b an σ that guarantee the existence and uniqueness of solutions.
Example 5.1.1 (Linear SDE). Let Wt be an m-dimensional Wiener process, let A
be an n × n matrix and let B be an n × m matrix. Then the n-dimensional equation
is called a linear stochastic differential equation. Such equations always have a solu-
tion; in fact, the solution can be given explicitly by
Z t
At
Xt = e x + eA(t−s) B dWs ,
0
as you can verify directly by applying Itô’s rule. The solution is also unique. To see
this, let Yt be another solution with the same initial condition. Then
Z t
d
Xt − Y t = A(Xs − Ys ) ds =⇒ (Xt − Yt ) = A(Xt − Yt ), X0 − Y0 = 0,
0 dt
and it is a standard fact that the unique solution of this equation is Xt − Yt = 0.
For nonlinear b and σ one can rarely write down the solution in explicit form, so
we have to resort to a less explicit proof of existence. The same problem appears
in the theory of ordinary differential equations where it is often resolved by imposing
Lipschitz conditions, whereupon existence can be proved by Picard iteration (see, e.g.,
[Apo69, theorem 7.19]). It turns out that this technique works almost identically in
the current setting. Let us work out the details. Recall what it means to be Lipschitz:
Definition 5.1.2. A function f : Rn → Rm is called Lipschitz continuous (or just
Lipschitz) if there exists a constant K < ∞ such that kf (x) − f (y)k ≤ Kkx − yk
for all x, y ∈ Rn . A function g : S × Rn → Rm is Lipschitz uniformly in S if
kg(s, x) − g(s, y)k ≤ Kkx − yk for a constant K < ∞ which does not depend on s.
We work in the following setting, where we restrict ourselves to a finite time hori-
zon [0, T ] for simplicity. Consider a filtered probability space (Ω, F, {F t }t∈[0,T ] , P)
on which is defined an m-dimensional Ft -Wiener process Wt . We now choose X0
to be an F0 -measurable n-dimensional random variable (it is often chosen to be non-
random, but this is not necessary), and we seek a solution to the equation
Z t Z t
Xt = X 0 + b(s, Xs ) ds + σ(s, Xs ) dWs .
0 0
We claim that under the conditions which we have imposed, P(Y· ) is again an Ft -adapted
process in L2 (µT × P) (we will show this shortly). Our goal is to find a fixed point of the
operator P: i.e., we wish to find an Ft -adapted process X· ∈ L2 (µT × P) such that P(X· ) =
X· . Such an X· is then, by definition, a solution of our stochastic differential equation.
We begin by showing that P does indeed map to an Ft -adapted process in L2 (µT × P).
Note that kb(t, x)k ≤ kb(t, x) − b(t, 0)k + kb(t, 0)k ≤ Kkxk + K 0 ≤ C(1 + kxk) where
K, K 0 , C < ∞ are constants that do not depend on t. We say that b satisfies a linear growth
condition. Clearly the same argument holds for σ, and to make our notation lighter we will
choose our constant C such that kσ(t, x)k ≤ C(1 + kxk) as well. We can now estimate
Z ·
Z ·
kP(Y· )k2,µT ×P ≤ kX0 k2,µT ×P +
b(s, Y s ) ds
+
σ(s, Y s ) dW s
.
0 2,µT ×P 0 2,µT ×P
√
The first term gives kX0 k2,µT ×P = T kX0 k2,P < ∞ by assumption. Next,
Z ·
2
b(s, Ys ) ds
≤ T 2 kb(·, Y· )k22,µT ×P ≤ T 2 C 2 k(1 + kY· k)k22,µT ×P < ∞,
0 2,µT ×P
Rt Rt
where we have used (t−1 0 as ds)2 ≤ t−1 0 a2s ds (Jensen’s inequality), the linear growth
condition, and Y· ∈ L2 (µT × P). Finally, let us estimate the stochastic integral term:
Z ·
2
σ(s, Ys ) dWs
≤ T kσ(·, Y· )k22,µT ×P ≤ T C 2 k(1 + kY· k)k22,µT ×P < ∞,
0 2,µT ×P
where we have used the Itô isometry. Hence kP(Y· )k2,µT ×P < ∞ for Ft -adapted Y· ∈
L2 (µT × P), and clearly P(Y· ) is Ft -adapted, so the claim is established.
Our next claim is that P is a continuous map; i.e., we claim that if kY·n − Y· k2,µT ×P → 0,
then kP(Y·n ) − P(Y· )k2,µT ×P → 0 as well. But proceeding exactly as before, we find that
√
kP(Y·n )−P(Y· )k2,µT ×P ≤ T kb(·, Y·n )−b(·, Y· )k2,µT ×P + T kσ(·, Y·n )−σ(·, Y· )k2,µT ×P .
where K is a Lipschitz constant for both b and σ. This establishes the claim.
5.1. Stochastic differential equations: existence and uniqueness 120
With these preliminary issues out of the way, we now come to the heart of the proof.
Starting from an arbitrary Ft -adapted process Yt0 in L2 (µT × P), consider the sequence
Y·1 = P(Y·0 ), Y·2 = P(Y·1 ) = P2 (Y·0 ), etc. This is called Picard iteration. We will
show below that Y·n is a Cauchy sequence in L2 (µT × P); hence it converges to some Ft -
adapted process Yt in L2 (µT × P). But then Y· is necessarily a fixed point of P: after all,
P(Y·n ) → P(Y· ) by the continuity of P, whereas P(Y·n ) = Y·n+1 → Y· . Thus P(Y· ) = Y· ,
and we have found a solution of our stochastic differential equation with the desired properties.
It only remains to show that Y·n is a Cauchy sequence in L2 (µT × P). This follows from
a slightly refined version of the argument that we used to prove continuity of P. Note that
√
k(P(Z· ))t − (P(Y· ))t k2,P ≤ t kb(·, Z· ) − b(·, Y· )k2,µt ×P + kσ(·, Z· ) − σ(·, Y· )k2,µt ×P ,
which follows exactly as above. In particular, using the Lipschitz property, we find
√
k(P(Z· ))t − (P(Y· ))t k2,P ≤ K( T + 1) kZ· − Y· k2,µt ×P .
√
Set L = K( T + 1). Iterating this bound, we obtain
which establishes that Pn (Y·0 ) is a Cauchy sequence in L2 (µT × P). We are done.
Remark 5.1.4. The condition X0 ∈ L2 (P) can be relaxed through localization, see
[GS96, theorem VIII.3.1]; we then have a solution of the SDE for any initial condi-
tion, but we need no longer have Xt , b(t, Xt ), and σ(t, Xt ) in L2 (µT × P). More
interesting, perhaps, is that if X0 ∈ Lp (P) (p ≥ 2), then we can prove with a little
more work that Xt , b(t, Xt ), and σ(t, Xt ) will actually be in Lp (µT ×P) (see [LS01a,
sec. 4.4] or [Arn74, theorem 7.1.2]). Hence the integrability of the initial condition
really determines the integrability of the solution in the Lipschitz setting.
It remains to prove uniqueness of the solution found in theorem 5.1.3.
Theorem 5.1.5 (Uniqueness). The solution of theorem 5.1.3 is unique P-a.s.
Proof. Let X· be the solution of theorem 5.1.3, and let Y· be any other solution. It suffices to
show that X· = Y· µT × P-a.s.; after all, both Xt and Yt must have continuous sample paths,
so X· = Y· µT × P-a.s. implies that they are P-a.s. indistinguishable (lemma 2.4.6).
Let us first suppose that Y· ∈ L2 (µT × P) as well; then Pn (Y· ) = Y· and Pn (X· ) = X· .
Using the estimate in the proof of theorem 5.1.3, we find that
L2n T n
kY· − X· k22,µT ×P = kPn (Y· ) − Pn (X· )k22,µT ×P ≤ kY· − X· k22,µT ×P .
n!
5.2. The Markov property and Kolmogorov’s equations 121
Now let τn = inf{t : kYt k ≥ n}, and note that this sequence of stopping times is a localizing
sequence for the stochastic integral; in particular,
Z t∧τn
E(kYt∧τn k2 ) = E(kX0 k2 ) + E (2(Ys )∗ b(s, Ys ) + kσ(s, Ys )k2 ) ds .
0
Using Fatou’s lemma on the left and monotone convergence on the right to let n → ∞, applying
Tonelli’s theorem, and using the simple estimate (a + b)2 ≤ 2(a2 + b2 ), we obtain
Z t
E(1 + kYt k2 ) ≤ E(1 + kX0 k2 ) + 2C(2 + C) E(1 + kYs k2 ) ds.
0
But then we find that E(1 + kYt k2 ) ≤ E(1 + kX0 k2 ) e2C(2+C)t using Gronwall’s lemma, from
which the claim follows easily. Hence the proof is complete.
Theorem 5.2.2 (Markov property). Suppose that the conditions of theorem 5.1.3
hold. Then the unique solution Xt of the corresponding SDE is an Ft -Markov process.
Proof. Let us begin by rewriting the SDE in the following form:
Z t Z t
Xt = X s + b(r, Xr ) dr + σ(r, Xr ) dWr ,
s s
where Yr = Xr+s , b̃(r, x) = b(r + s, x), σ̃(r, x) = σ(r + s, x), and W̃r = Wr+s − Ws . But
this equation for Yt is again an SDE that satisfies the conditions of theorems 5.1.3 and 5.1.5 in
the interval r ∈ [0, T −s], and in particular, it follows that Yr is σ{Y0 , W̃s : s ≤ r}-measurable.
Identically, we find that Xt is σ{Xs , Wr − Ws : r ∈ [s, t]}-measurable, and can hence be
written as a measurable functional Xt = F (Xs , W·+s − Ws ). Now using Fubini’s theorem
exactly as in lemma 3.1.9, we find that E(g(Xt )|Fs ) = E(g(F (x, W·+s −Ws )))|x=Xs for any
bounded measurable function g, so in particular E(g(Xt )|Fs ) = E(g(Xt )|Xs ) by the tower
property of the conditional expectation. But this is the Markov property, so we are done.
Remark 5.2.3 (Strong Markov property). The solutions of Lipschitz stochastic dif-
ferential equations, and in particular the Wiener process itself, actually satisfy a much
stronger variant of the Markov property. Let τ be an a.s. finite stopping time; then it
turns out that E(g(Xτ +r )|Fτ ) = E(g(Xτ +r )|Xτ ). This is called the strong Markov
property, which extends the Markov property even to random times. This fact is often
very useful, but we will not prove it here; see, e.g., [Fri75, theorem 5.3.4].
The Markov property implies that for any bounded and measurable f , we have
E(f (Xt )|Fs ) = gt,s (Xs ) for some (non-random) measurable function gt,s . For the
rest of this section, let us assume for simplicity that b(t, x) and σ(t, x) are independent
of t (this is not essential, but will make the notation a little lighter); then you can read
off from the previous proof that in fact E(f (Xt )|Fs ) = gt−s (Xs ) for some function
gt−s . We say that the Markov process is time-homogeneous in this case.
Rather than studying the random process Xt , we can now study how the non-
random function gt varies with t. This is a standard idea in the theory of Markov
processes. Note that if E(f (Xt )|Fs ) = gt−s (Xs ) and E(f 0 (Xt )|Fs ) = gt−s 0
(Xs ),
0 0
then E(αf (Xt )+βf (Xt )|Fs ) = αgt−s (Xs )+βgt−s (Xs ); i.e., the map Pt : f 7→ gt
is linear. Moreover, using the tower property of the conditional expectation, you can
easily convince yourself that Pt gs = gt+s . Hence Pt Ps = Pt+s , so the family {Pt }
forms a semigroup. This suggests1 that we can write something like
d
Pt f = L Pt f, P0 f = f,
dt
where L is a suitable linear operator. If such an equation holds for a sufficiently large
class of functions f , then L is called the infinitesimal generator of the semigroup P t .
1 Think of a finite-dimensional semigroup Pt x = eAt x, where A is a square matrix and x is a vector.
5.2. The Markov property and Kolmogorov’s equations 123
Making these ideas mathematically sound is well beyond our scope; but let us show
that under certain conditions, we can indeed obtain an equation of this form for P t f .
Proposition 5.2.4 (Kolmogorov backward equation). For g ∈ C 2 , define
n n m
X ∂g 1 X X ik ∂2g
L g(x) = bi (x) i
(x) + σ (x)σ jk (x) i j (x).
i=1
∂x 2 i,j=1 ∂x ∂x
k=1
∂
u(t, x) = L u(t, x), u(0, x) = f (x).
∂t
Then E(f (Xt )|Fs ) = u(t − s, Xs ) a.s. for all 0 ≤ s ≤ t ≤ T , i.e., u(t, x) = Pt f (x).
Remark 5.2.5. The operator L should look extremely familiar—this is precisely the
expression that shows up in Itô’s rule! Not surprisingly, this is the key to the proof.
L will show up frequently in the rest of the course.
Remark 5.2.6. You might wonder why the above PDE is called the backward equa-
tion. In fact, we can just as easily write the equation backwards in time: setting
v(t, x) = u(T − t, x) and using the chain rule, we obtain
∂
v(t, x) + L v(t, x) = 0, v(T, x) = f (x),
∂t
which has a terminal condition (at t = T ) rather than an initial condition (at t =
0). For time-nonhomogeneous Markov processes, the latter (backward) form is the
appropriate one, so it is in some sense more fundamental. As we have assumed time-
homogeneity, however, the two forms are completely equivalent in our case.
Remark 5.2.7. The choice to present proposition 5.2.4 in this way raises a dilemma.
In principle the result is “the wrong way around”: we would like to use the expression
E(f (Xt )|Fs ) = u(t − s, Xs ) to define u(t, x), and then prove that u(t, x) must con-
sequently satisfy the PDE. This is indeed possible in many cases, see [Fri75, theorem
5.6.1]; it is a more technical exercise, however, as we would have to prove that u(t, x)
is sufficiently smooth rather than postulating it. More generally, one could prove that
this PDE almost always makes sense, in a suitable weak sense, even when u(t, x) is
not sufficiently differentiable. Though this is theoretically interesting, it is not obvious
how to use such a result in applications (are there numerical methods for solving such
equations?). We will face this dilemma again when we study optimal control.
After all the remarks, the proof is a bit of an anti-climax.
Proof. Set v(r, x) = u(t − r, x), and apply Itô’s rule to Yr = v(r, Xr ). Then we obtain
Z t
0
v(t, Xt ) = v(0, X0 ) + v (r, Xr ) + L v(r, Xr ) dr + local martingale.
0
5.2. The Markov property and Kolmogorov’s equations 124
The time integral vanishes by the Kolmogorov backward equation, so v(t, Xt ) is a local mar-
tingale. Introducing a localizing sequence τn % ∞, we find using the martingale property
But as we have assumed that v is bounded, we obtain using dominated convergence for condi-
tional expectations that E(f (Xt )|Fs ) = E(v(t, Xt )|Fs ) = v(s, Xs ) = u(t − s, Xs ).
Let us now investigate the Kolmogorov forward equation, which is in essence the
dual of the backward equation. The idea is as follows. For a fixed time t, the random
variable Xt is just an Rn -valued random variable. If we are in luck, then the law of
this random variable is absolutely continuous with respect to the Lebesgue measure,
and, in particular, we can write undergraduate-style
Z
E(f (Xt )) = f (y) pt (y) dy
Rn
with some probability density pt (y). More generally, we could try to find a transition
density pt (x, y) that satisfies for all sufficiently nice f
Z
E(f (Xt )|Fs ) = f (y) pt−s (Xs , y) dy.
Rn
The existence of such densities is a nontrivial matter; in fact, there are many reason-
able models for which they do not exist. On the other hand, if they were to exist, one
can ask whether pt (y) or pt (x, y) can again be obtained as the solution of a PDE.
Let us consider, in particular, the (unconditional) density pt (y). Note that the
tower property of the conditional expectation implies that
Z Z
f (y) pt (y) dy = E(f (Xt )) = E(E(f (Xt )|X0 )) = Pt f (y) p0 (y) dy,
Rn Rn
Proof. Fix an f ∈ C02 (in C 2 and with compact support). By Itô’s rule, we obtain
Z t
f (Xt ) = f (X0 ) + L f (Xs ) ds + martingale
0
(the last term is a martingale as f , and hence its derivatives, have compact support, and thus the
integrand is bounded). Taking the expectation and using Fubini’s theorem, we obtain
Z t
E(f (Xt )) = E(f (X0 )) + E(L f (Xs )) ds.
0
Substituting the definition of pt (y), integrating by parts, and using Fubini’s theorem again,
Z Z Z Z t
f (y) pt (y) dy = f (y) p0 (y) dy + f (y) L ∗ ps (y) ds dy.
Rn Rn Rn 0
Now note that this expression holds for any f ∈ C02 , so we can conclude that
Z t
α(y) = pt (y) − p0 (y) − L ∗ ps (y) ds = 0
0
for all y, except possibly on some subset with measure zero with respect to the Lebesgue mea-
∞
R To see this, let κ ∈ C0 be a nonnegative
sure. function such that κ(y) = 1 for for kyk ≤ K.
As α(y)f (y)dy = 0 for any f ∈ C02 , we find in particular that
Z Z
|α(y)|2 dy ≤ κ(y)|α(y)|2 dy = 0
kyk≤K Rn
by setting f (y) = κ(y)α(y). But then evidently the set {y : kyk ≤ K, α(y) 6= 0} has
measure zero, and as this is the case for any K the claim is established. But α(y) must then be
zero everywhere, as it is a continuous in y (this follows by dominated convergence, as L ∗ pt (y)
is continuous in (t, y), and hence locally bounded). It remains to take the time derivative.
Remark 5.2.10. As stated, these theorems are not too useful; the backward equa-
tion requires us to show the existence of a sufficiently smooth solution to the back-
ward PDE, while for the forward equation we somehow need to establish that the
density of Xt exists and is sufficiently smooth. As a rule of thumb, the backward
equation is very well behaved, and will often have a solution provided only that f
is sufficiently smooth; the forward equation is much less well behaved and requires
stronger conditions on the coefficients b and σ. This is why the backward equa-
tion is often more useful as a mathematical tool. Of course, this is only a rule of
thumb; a good source for actual results is the book by Friedman [Fri75]. A typical
condition for the existence of a smooth density is the uniform ellipticity requirement
i ik jk j 2 n
P
i,j,k v σ (x)σ (x)v ≥ εkvk for all x, v ∈ R and some ε > 0.
by white noise; but does this actually make sense? The resolution of this problem is
important if we want to use SDE to model noise-driven physical systems.
To study this problem, let us start with ordinary differential equations that are
driven by rapidly fluctuating—but not white—noise. In particular, define X tn to be
the solution of the ordinary differential equation
d n
X = b(Xtn ) + σ(Xtn ) ξtn , X0n = X0 ,
dt t
where b and σ are Lipschitz continuous as usual, X0 is a random variable independent
of ξtn , and ξtn is some “nice” m-dimensional random process which “approximates”
white noise. What does this mean? By “nice”, we mean that it is sufficiently smooth
that the above equation has a unique solution in the usual ODE sense; to be precise, we
will assume that every sample path of ξtn is piecewise continuous. By “approximates
white noise”, we mean that there is an m-dimensional Wiener process W t such that
Z t
n n→∞ n
sup kWt − Wt k −−−−→ 0 a.s., Wt = ξsn ds.
t∈[0,T ] 0
In other words, the time integral of ξtn approximates (uniformly) the Wiener process
Wt , which conforms to our intuition of white noise as the derivative of a Wiener
process. You can now think of the processes Xtn as being physically realistic models;
on the other hand, these models are almost certainly not Markov, for example. The
question is whether when n is very large, Xtn is well approximated by the solution of
a suitable SDE. That SDE is then the corresponding idealized model, which, formally,
corresponds to replacing ξtn by white noise.
Can we implement these ideas? Let us first consider the simplest case.
Proposition 5.3.1. Suppose that σ(x) = σ does not depend on x, and consider the
SDE dXt = b(Xt ) dt + σ dWt . Then supt∈[0,T ] kXtn − Xt k → 0 a.s.
Proof. Note that in this case, we can write
Z t
Xtn − Xt = (b(Xsn ) − b(Xs )) ds + σ (Wtn − Wt ).
0
Hence we obtain using the triangle inequality and the Lipschitz property
Z t
kXtn − Xt k ≤ K kXsn − Xs k ds + kσk sup kWtn − Wt k.
0 t∈[0,T ]
Apparently the processes Xtn limit to the solution of an SDE, which is precisely
the equation that we naively expect, when σ is constant. When σ(x) does depend on
x, however, we are in for a surprise. For sake of demonstration we will develop this
5.3. The Wong-Zakai theorem 127
case only in the very simplest setting, following in the footsteps of Wong and Zakai
[WZ65]. This is sufficient to see what is going on and avoids excesssive pain and
suffering; a more general result is quoted at the end of the section.
Let us make the following assumptions (even simpler than those in [WZ65]).
1. Xtn and ξtn are scalar processes (we work in one dimension);
2. b and σ are Lipschitz continuous and bounded;
3. σ is C 1 and σ(x) dσ(x)/dx is Lipschitz continuous; and
4. σ(x) ≥ β for all x and some β > 0.
The claim is that the solutions of the ODEs
d n
X = b(Xtn ) + σ(Xtn ) ξtn , X0n = X0 ,
dt t
converge, as n → ∞, to the solution of the following SDE:
1 dσ
dXt = b(Xt ) + σ(Xt ) (Xt ) dt + σ(Xt ) dWt .
2 dx
By our assumptions, the latter equation still has Lipschitz coefficients and thus has
a unique solution. The question is, of course, why the additional term in the time
integral (the Itô correction term) has suddenly appeared out of nowhere.
Remark 5.3.2. Let us give a heuristic argument for why we expect the Itô correction
to be there. Let f : R → R be a diffeomorphism (a smooth bijection with smooth
inverse). Then setting Ytn = f (Xtn ) and using the chain rule gives another ODE
d n df −1 n df −1 n
Yt = (f (Yt )) b(f −1 (Ytn )) + (f (Yt )) σ(f −1 (Ytn )) ξtn .
dt dx dx
The only thing that has happened here is a (smooth) change of variables. If our limit
as n → ∞ is consistent, then it should commute with such a change of variables,
i.e., it should not matter whether we first perform a change of variables and then take
the white noise limit, or whether we first take the white noise limit and then make
a change of variables (after all, we have not changed anything about our system, we
have only reparametrized it!). Let us verify that this is indeed the case.
We presume that the limit as n → ∞ works the way we claim it does. Then to
obtain the limiting SDE for Ytn , we need to calculate the corresponding Itô correction
term. This is a slightly gory computation, but some calculus gives
1 df dσ 1 d2 f
Itô corr. = σ(f −1 (x)) (f −1 (x)) (f −1 (x)) + (σ(f −1 (x)))2 2 (f −1 (x)).
2 dx dx 2 dx
n
In particular, we expect that Yt limits to the solution Yt of the SDE
df −1 df −1
dYt = (f (Yt )) b(f −1 (Yt )) dt + (f (Yt )) σ(f −1 (Yt )) dWt +
dx dx
d2 f
1 df dσ
σ(f −1 (Yt )) (f −1 (Yt )) (f −1 (Yt )) + (σ(f −1 (Yt )))2 2 (f −1 (Yt )) dt.
2 dx dx dx
5.3. The Wong-Zakai theorem 128
But this is precisely the same expression as we would obtain by applying Itô’s rule
to Yt = f (Xt )! Hence we do indeed find that our limit is invariant under change of
variables, precisely as it should be. On the other hand, if we were to neglect to add the
Itô correction term, then you can easily verify that this would no longer be the case.
In some sense, the Itô correction term “corrects” for the fact that integrals
Z t Z t
· · · dWs and · · · ξsn ds
0 0
do not obey the same calculus rules. The additional term in the Itô rule as compared
to the ordinary chain rule is magically cancelled by the Itô correction term, thus pre-
venting us from ending up with an unsettling paradox.
That the Itô correction term should cancel the additional term in the Itô rule does
not only guide our intuition; this idea is implicitly present in the proof. Notice what
happens below when we calculate Φ(Xt ) and Φ(Xtn )!
Theorem 5.3.3 (Wong-Zakai). supt∈[0,T ] |Xtn −Xt | → 0 a.s. (assuming 1–4 above).
Rx
Proof. Consider the function Φ(x) = 0 (σ(y))−1 dy, which is well defined and C 2 by the
assumption that σ is C 1 and σ(y) ≥ β > 0. Then we obtain
d b(Xtn ) b(Xt )
Φ(Xtn ) = + ξtn , dΦ(Xt ) = dt + dWt ,
dt σ(Xtn ) σ(Xt )
using the chain rule and the Itô rule, respectively. In particular, can estimate
Z t
b(Xsn ) b(Xs )
n
|Φ(Xt ) − Φ(Xt )| ≤ n
σ(X n ) − σ(Xs ) ds + sup |Wt − Wt |.
0 s t∈[0,T ]
Applying Gronwall’s lemma and taking the limit as n → ∞ completes the proof.
The result that we have just proved is too restrictive to be of much practical use;
however, the lesson learned is an important one. Similar results can be obtained in
higher dimensions, with multiple driving noises, unbounded coefficients, etc., at the
expense of a large number of gory calculations. To provide a result that is sufficiently
general to be of practical interest, let us quote the following theorem from [IW89,
theorem VI.7.2] (modulo some natural, but technical, conditions).
5.3. The Wong-Zakai theorem 129
You have all the tools you need to prove this theorem, provided you have a reliable
supply of scratch paper and a pen which is not about to run out of ink. A brief glance
at the proof in [IW89] will convince you why it is omitted here.
Remark 5.3.5 (Fisk-Stratonovich integrals). As mentioned previously, the reason
for the Itô correction term is essentially that the Itô integral does not obey the ordinary
chain rule. This is by no means a conceptual problem; you should simply see the
definition of the Itô integral as a mathematical construction, while the theorems in this
section justify the modelling of physical phenomena within this framework (and tell
us how this should be done properly). However, we have an alternative choice at our
disposal: we can choose a different definition for the stochastic integral which does
obey the chain rule, as was done by Fisk and Stratonovich. When expressed in terms
of the Fisk-Stratonovich (FS-)integral, it is precisely the Itô correction which vanishes
and we are left with an SDE which looks identical to the ODEs we started with.
There are many problems with the FS-integral, however. First of all, the inte-
gral is not a martingale, and its expectation consequently rarely vanishes. This means
that this integral is extremely inconvenient in computations that involve expectations.
Second, the FS-integral is much less general than the Itô integral, in the sense that
the class of stochastic processes which are integrable is significantly smaller than the
Itô integrable processes. In fact, the most mathematically sound way to define the
FS-integral is as the sum of an Itô integral and a correction term (involving quadratic
variations of the integrand and the Wiener process), see [Pro04]. Hence very little is
won by using the FS-integral, except a whole bunch of completely avoidable inconve-
nience. What you win is that the ordinary chain rule holds for the FS-integral, but the
Itô rule is just as easy to remember as the chain rule once you know what it is!
For these reasons, we will avoid discussing FS-integrals any further in this course.
That being said, however, there is one important case where FS-integrals make more
sense than Itô integrals. If we are working on a manifold rather than in R n , the FS-
integral can be given an intrinsic (coordinate-free) meaning, whereas this is not true
for the Itô integral. This makes the FS-integral the tool of choice for studying stochas-
tic calculus in manifolds (see, e.g., [Bis81]). Itô integrals can also be defined in this
setting, but one needs some additional structure: a Riemannian connection.
5.4. The Euler-Maruyama method 130
This expression can not be used as a numerical method, as Xtn depends not only on
Xtn−1 but on all Xs in the interval s ∈ [tn−1 , tn ]. As Xs has continuous sample
paths, however, it seems plausible that Xs ≈ Xtn−1 for s ∈ [tn−1 , tn ], provided that
p is sufficiently large. Then we can try to approximate
Z tn Z tn
Xtn ≈ Xtn−1 + b(Xtn−1 ) ds + σ(Xtn−1 ) dWs ,
tn−1 tn−1
or, equivalently,
Xtn ≈ Xtn−1 + b(Xtn−1 ) (tn − tn−1 ) + σ(Xtn−1 ) (Wtn − Wtn−1 ).
This simple recursion is easily implemented on a computer, where we can obtain a
suitable sequence Wtn − Wtn−1 by generating i.i.d. m-dimensional Gaussian random
variables with mean zero and covariance (T /p)I using a (pseudo-)random number
generator. The question that we wish to answer is whether this algorithm really does
approximate the solution of the full SDE when p is sufficiently large.
The remainder of this section is devoted to proving convergence of the Euler-
Mayurama method. Before we proceed to the proof of that result, we need a simple
estimate on the increments of the solution of an SDE.
Lemma 5.4.1. Under the assumptions of theorem 5.1.3, we can estimate
√
kXt − Xs k2,P ≤ L t − s, 0 ≤ s ≤ t ≤ T,
where the constant L depends only on T , X0 , b and σ.
Proof. The arguments are similar to those used in the proof of theorem 5.1.3. Write
Z t
2 !
Z t
2 !
2
E(kXt − Xs k ) ≤ 2 E
b(Xr ) dr
+ 2 E
σ(Xr ) dWr
,
s s
using the linear growth condition and the same identity for (a + b)2 . Similarly,
Z t
2 ! Z t Z t
E
σ(Xr ) dWr
= E(kσ(X )k 2
) dr ≤ 2C E(1 + kXr k2 ) dr,
r
s s s
using the Itô isometry. But it was established in the proof of theorem 5.1.5 that E(1 + kXr k2 )
is bounded by a constant that only depends on X0 , T and C. Hence the result follows.
where we have used eB ≥ 1+B. But α0 ≤ A = AeB·0 , so the result follows by induction.
Taking the expectation and proceeding as in the proof of theorem 5.1.3, we can estimate
kXs − Ys k2,P = kXs − Ytk−1 k2,P ≤ kXs − Xtk−1 k2,P + kXtk−1 − Ytk−1 k2,P .
2L2 T
E(kXs − Ys k2 ) ≤ + 2 E(kXtk−1 − Ytk−1 k2 ).
p
Thus we can now write, using the definition of C1 ,
n
3C12 + 6K 2 L2 T 2 (T + 1) 6K 2 T (T + 1) X
E(kXtn −Ytn k2 ) ≤ + E(kXtk−1 −Ytk−1 k2 ).
p p
k=1
2
If we set Jn = max0≤i≤n E(kXti − Yti k ), then we can evidently write
n
C3 C4 T X
Jn ≤ + Jk−1 ,
p p k=1
where we have replaced some of the unsightly expressions by friendly-looking symbols. But
√ 5.4.2 to obtain Jn ≤ (C3 /p) exp(C4 T n/p).
we can now apply the discrete Gronwall lemma
Hence the result follows if we define C2 = C3 exp(C4 T /2).
2 For the particular case of small noise, there is a powerful theory to study the asymptotic properties of
SDEs as ε → 0: the Freidlin-Wentzell theory of large deviations [FW98]. Unfortunately, we will not have
the time to explore this interesting subject. The theorems in this section are fundamentally different; they
are not asymptotic in nature, and work for any SDE (provided a suitable function V can be found!).
5.5. Stochastic stability 134
where we have used Fatou’s lemma and monotone convergence to take the limit as n → ∞ on
the left- and right-hand sides, respectively. The conclusion of the result is straightforward.
A very different situation is one where for the SDE with non-random X 0
We now claim that for every ε > 0, there exists an α > 0 such that kx − x∗ k ≥ ε implies
V (x) ≥ α (for x ∈ U ); indeed, just choose α to be the minimum of V (x) over the compact
set {x ∈ U : kx − x∗ k ≥ ε} (which is nonempty for sufficiently small ε), and this minimum is
strictly positive by our assumptions on V . Hence for any ε > 0, there is an α > 0 such that
∗ V (X0 )
P sup kXt∧τ − x k ≥ ε ≤ ,
t≥0 α
and term on the right can be made arbitrarily small by choosing X0 sufficiently close to x∗ .
Finally, it remains to note that supt≥0 kXt − x∗ k ≥ ε implies supt≥0 kXt∧τ − x∗ k ≥ ε if ε
is sufficiently small that kx − x∗ k ≤ ε implies that x ∈ U .
But the term on the left is nonnegative and nondecreasing by our assumptions, so we obtain
Z τ
E (−L V )(Xs ) ds ≤ V (X0 ) < ∞
0
If τ = ∞ for some sample path ω, then we conclude that at least lim inf s→∞ (−L V )(Xs ) = 0
for that sample path (except possibly in a set of measure zero). But by our assumption that
L V (x) < 0 for x 6= x∗ , an entirely parallell argument to the one used in the previous proof
establishes that τ = ∞ implies lim inf s→∞ kXs −x∗ k = 0 and even lim inf s→∞ V (Xs ) = 0.
5.5. Stochastic stability 136
Finally, let us obtain a condition for global stability. The strategy should look a
little predictable by now, and indeed there is nothing new here; we only need to assume
that our function V is radially unbounded to be able to conclude that V (X t ) → 0
implies Xt → x∗ (as we are no longer working in a bounded neighborhood).
Proposition 5.5.6. Suppose there exists V : Rn → [0, ∞[ which is C 2 and satisfies
V (x∗ ) = 0, V (x) > 0 and L V (x) < 0 for any x ∈ Rn such that x 6= x∗ . Moreover,
suppose V (x) → ∞ and |L V (x)| → ∞ as kxk → ∞. Then x∗ is globally stable.
Proof. Using Itô’s rule we obtain, by choosing a suitable localizing sequence τn % ∞,
Z t∧τn
E (−L V )(Xs ) ds = V (X0 ) − E(V (Xt∧τn )) ≤ V (X0 ) < ∞,
0
But using the fact that |L V (x)| → ∞ as kxk → ∞, we find that lim inf s→∞ V (Xs ) = 0
a.s. On the other hand, by Itô’s rule, we find that V (Xt ) is the sum of a nonincreasing process
and a nonnegative local martingale. But then V (Xt ) is a nonnegative supermartingle, and the
martingale convergence theorem applies. Thus V (Xt ) → 0 a.s. It remains to note that we can
conclude that Xt → x∗ a.s. using the fact that V (x) → ∞ as kxk → ∞.
i=1
d
X(t) = 3 (X(t))2/3 , X(0) = 0.
dt
Then X(t) = (t − a)3 ∨ 0 is a perfectly respectable solution for any a > 0. Evidently,
this equation has many solutions for the same initial condition, so uniqueness fails!
On the other hand, consider the ODE
d
X(t) = (X(t))2 , X(0) = 1.
dt
This equation is satisfied only by X(t) = (1 − t)−1 for t < 1, but the solution blows
up at t = 1. Hence a solution does not exist if we are interested, for example, in the
interval t ∈ [0, 2]. Note that neither of these examples satisfy the Lipschitz condition.
There is a crucial difference between these two examples, however. In the first
example, the Lipschitz property fails at x = 0. On the other hand, in the second
example the Lipschitz property fails as x → ∞, but in any compact set the Lipschitz
property still holds. Such a function is called locally Lipschitz continuous.
Definition 5.6.1. f : Rn → Rm is called locally Lipschitz continuous if for any
r < ∞, there is a Kr < ∞ such that kf (x)−f (y)k ≤ Kr kx−yk for all kxk, kyk ≤ r.
For locally Lipschitz coefficents, we have the following result.
Theorem 5.6.2. Suppose that b and σ are locally Lipschitz continuous. Then the SDE
Z t Z t
Xt = X 0 + b(Xs ) ds + σ(Xs ) dWs ,
0 0
has a unique solution in the time interval [0, ζ[, where the stopping time ζ is called
the explosion time (ζ may be ∞ with positive probability).
Remark 5.6.3. A similar result holds with time-dependent coefficients; we restrict
ourselves to the time-homogeneous case for notational simplicity only.
Proof. For any r < ∞, we can find functions br (x) and σr (x) which are (globally) Lipschitz
and such that b(x) = br (x) and σ(x) = σr (x) for all kxk ≤ r. For the SDE with coefficients
br and σr and the initial condition X0 (r) = X0 IkX0 k≤r , we can find a unique solution Xt (r)
for all t ∈ [0, ∞[ using theorem 5.1.3 (and by trivial localization). Now denote by τr =
IkX0 k≤r inf{t : Xt (r) ≥ r}, and note that this is a stopping time. Moreover, the process
Xt∧τr (r) evidently satisfies the SDE in the statement of the theorem for t < τr . Hence we
obtain a unique solution for our SDE in the interval [0, τr ]. But we can do this for any r < ∞,
so letting r → ∞ we obtain a unique solution in the interval [0, ζ[ with ζ = limr→∞ τr .
5.6. Is there life beyond the Lipschitz condition? 138
The proof of this result is rather telling; in going from global Lipschitz coefficients
to local Lipschitz coefficients, we proceed as we have done so often by introducing a
localizing sequence of stopping time and constructing the solution up to every stop-
ping time. Unlike in the case of the Itô integral, however, these stopping times may
accumulate—and we end up with an explosion at the accumulation point.
All is not lost, however: there are many SDEs whose coefficients are only locally
Lipschitz, but which nonetheless do not explode! Here is one possible condition.
Proposition 5.6.4. If kX0 k2,P < ∞, b and σ are locally Lipschitz continuous and
satisfy a linear growth condition, then the explosion time ζ = ∞ a.s.
Recally that for Lipschitz coefficients, the linear growth condition follows (see the
proof of theorem 5.1.3). In the local Lipschitz setting this is not the case, however,
and we must impose it as an additional condition (evidently with desirable results!)
Proof. Proceeding as in the proof of theorem 5.1.5, we find that E(kXt∧ζ k2 ) < ∞ for all
t < ∞. But then Xt∧ζ < ∞ a.s. for all t < ∞, so ζ = ∞ a.s. (as Xζ = ∞ by definition!).
Remark 5.6.5. All of the conditions which we have discussed for the existence and
uniqueness of solutions are only sufficient, but not necessary. Even an SDE with very
strange coefficients may have a unique, non-exploding solution; but if it does not fall
under any of the standard categories, it might take some specialized work to prove
that this is indeed the case. An example of a useful SDE that is not covered by our
theorems is the Cox-Ingersoll-Ross equation for the modelling of interest rates:
p
dXt = (a − bXt ) dt + σ |Xt | dWt , X0 > 0,
with a, b, σ > 0. Fortunately, however, many (if not most) SDEs which are encoun-
tered in applications have at least locally Lipschitz coefficients.
There is an entirely different concept of what it means to obtain a solution of
a stochastic differential equation, which we will now discuss very briefly. Let us
consider the simplest example: we wish to find a solution of the SDE
Z t
Xt = b(Xs ) ds + Wt ,
0
where b is some bounded measurable function. Previously, we considered W t as being
a given Wiener process, and we sought to find the solution Xt with respect to this
particular Wiener process. This is called a strong solution. We can, however, ask a
different question: if we do not start from a fixed Wiener process, can we construct (on
some probability space) both a Wiener process Wt and a process Xt simultaneously
such that the above equation holds? If we can do this, then the solution is called a weak
solution. Surprisingly, we can always find a weak solution of the above equation—
despite the fact that we have imposed almost no structure on b!
Let us perform this miracle. We start with some probability space (Ω, F, P), on
which is defined a Wiener process Xt . Note that Xt is now the Wiener process! Next,
we perform a cunning trick. We introduce a new measure Q as follows:
Z t
1 t
dQ
Z
= exp b(Xt ) dXt − (b(Xt ))2 dt .
dP 0 2 0
5.7. Further reading 139
is a Wiener process under Q. But now we are done: we have constructed a process X t
and a Wiener process Wt on the space (Ω, F, Q), so that the desired SDE is satisfied!
You might well be regarding this story with some amount of suspicion—where is
the catch? If we fix in hindsight the Wiener process which we have just constructed,
and ask for a solution with respect to that Wiener process, then can we not regard
Xt as a strong solution with respect to Wt ? There is a subtle but very important
reason why this is not the case. When we constructed strong solutions, we found
that the solution Xt was a functional of the driving noise: a strong solution Xt is
FtW = σ{Ws : s ≤ t} measurable. This is precisely what you would expect from
the point of view of causality: the noise drives a physical system, and thus the state of
the physical system is a functional of the realization of the noise. On the other hand,
if you look carefully at the construction of our weak solution, you will find precisely
the opposite conclusion: that the noise Wt is FtX = σ{Xs : s ≤ t} measurable.
Evidently, for a weak solution the noise is a functional of the solution of the SDE.
Thus it appears that causality is reversed in the weak solution case.
For this reason, you might want to think twice before using weak solutions in
modelling applications; the concept of a weak solution is much more probabilistic in
nature, while stong solutions are much closer to the classical notion of a differential
equation (as our existence and uniqueness proofs, the Wong-Zakai theorem, and the
Euler-Maruyama method abundantly demonstrate). Nonetheless weak solutions are
an extremely valuable technical tool, both for mathematical purposes and in appli-
cations where the existence of solutions in a strong sense may be too restrictive or
difficult to verify. Of course, many weak solutions are also strong solutions, so the
dilemma only appears if it turns out that a strong solution does not exist.
Stochastic optimal control is a highly technical subject, much of which centers around
mathematical issues of existence and regularity and is not directly relevant from an
engineering perspective. Nonetheless the theory has a large number of applications,
many (but not all) of which revolve around the important linear case. In this course
we will avoid almost all of the technicalities by focusing on the so-called “verification
theorems”, which we will encounter shortly, instead of on the more mathematical
aspects of the theory. Hopefully this will make the theory both accessible and useful;
in any case, it should give you enough ideas to get started.
where the superscript u denotes that we are considering the system state with the
control strategy u in operation. Here b and σ are functions b : [0, ∞[ × R n × U → Rn
and σ : [0, ∞[ × Rn × U → Rn×m , where U is the control set (the set of values that
the control input can take). Often we will choose U = Rq , but this is not necessary.
Definition 6.1.1. The control strategy u = {ut } is called an admissible strategy if
1. ut is an Ft -adapted stochastic process; and
141
6.1. Stochastic control problems and dynamic programming 142
where w : Rn × U → R is measurable.
The various types of cost functionals are not so dissimilar; once we figure out how to
solve one of them, we can develop the other ones without too much trouble.
Remark 6.1.4. These are the most common types of cost functionals found in ap-
plications; we have seen some examples in the Introduction, and we will encounter
more examples throughout this chapter. Others control costs have been considered as
well, however (particularly the risk-sensitive cost criteria); see, e.g., [Bor05] for an
overview of the various cost structures considered in the literature.
To motivate the development in the following sections, let us perform an illumi-
nating but heuristic calculation; in particular, we will introduce nontrivial assumptions
left and right and throw caution to the wind for the time being. What we will gain from
this is a good intuition on the structure of the problem, armed with which we can pro-
ceed to obtain some genuinely useful results in the following sections.
for t ∈ [0, T ], where we have used the Markov property of Xtu (as u is a Markov
strategy). The measurable function Jtu (x) is called the cost-to-go of the strategy u.
You can interpret Jtu (x) as the portion of the total cost of the strategy u incurred in
the time interval [t, T ], given that the control strategy in operation on the time interval
[0, t] has left us in the state Xtu = x. In particular, J0u (x) is the total cost of the
strategy u if we start our system in the non-random state X0 = x.
Remark 6.1.5. As we have defined a Markov process as a process X t that satisfies
E(f (Xt )|Fs ) = E(f (Xt )|Xs ) for bounded measurable f , the equality above may not
be entirely obvious. The expression does follow from the following fundamental fact.
6.1. Stochastic control problems and dynamic programming 144
Lemma 6.1.6. If Xt is an Ft -Markov process, then E(F |Fs ) = E(F |Xs ) for any
σ{Xt : t ∈ [s, ∞[}-measurable random variable F .
Proof. First, note that for t > r ≥ s and bounded measurable f, g, E(f (Xt )g(Xr )|Fs ) =
E(E(f (Xt )|Fr )g(Xr )|Fs ) = E(E(f (Xt )|Xr )g(Xr )|Fs ) = E(E(f (Xt )|Xr )g(Xr )|Xs ),
using the Markov property and the fact that E(f (Xt )|Xr )g(Xr ) is a bounded measurable func-
tion of Xr . By induction, E(f1 (Xt1 ) · · · fn (Xtn )|Fs ) = E(f1 (Xt1 ) · · · fn (Xtn )|Xs ) for any
n < ∞, bounded measurable f1 , . . . , fn , and times t1 , . . . , tn ≥ s.
Next, using the classical Stone-Weierstrass theorem, we find that any continuous func-
tion of n variables with compact support can be approximated uniformly by linear combina-
tions of functions of the form f1 (x1 ) · · · fn (xn ), where fi are continuous functions with com-
pact support. Hence using dominated convergence, we find that E(f (Xt1 , . . . , Xtn )|Fs ) =
E(f (Xt1 , . . . , Xtn )|Xs ) for any continuous f with compact support.
Finally, successive approximation establishes the claim for every σ{Xt : t ∈ [s, ∞[}-
measurable random variable F . This follows exactly as in the proof of lemma 4.6.3.
where the minimum is taken over all admissible Markov strategies u0 that coincide
with u on the interval [0, r[, and that this minimum is attained by the strategy which
coincides with the optimal strategy u∗ on the inteval [r, t]. Before we establish this
claim, let us see why this is useful. Split the interval [0, T ] up into chunks [0, t 1 ],
[t1 , t2 ], . . . , [tn , T ]. Clearly VT (x) = z(x). We can now obtain Vtn (x) by computing
the minimum above with r = tn and t = T , and this immediately gives us the
optimal strategy on the interval [tn , T ]. Next, we can compute the optimal strategy on
the previous interval [tn−1 , tn ] by minimizing the above expression with r = tn−1 ,
t = tn (as we now know Vtn (x) from the previous minimization), and iterating this
procedure gives the optimal strategy u∗ on the entire interval [0, T ]. We will see
below that this idea becomes particularly powerful if we let the partition size go to
6.1. Stochastic control problems and dynamic programming 145
zero: the calculation of the optimal control then becomes a pointwise minimization
(i.e., separately for every time t), which is particularly straightforward to compute!
Let us now justify the dynamic programming principle. We begin by establishing
a recursion for the cost-to-go: for any admissible Markov strategy u, we have
Z t
u u u u u
u
Jr (Xr ) = E w(s, Xs , us ) ds + Jt (Xt ) Xr , 0 ≤ r ≤ t ≤ T.
r
This follows immediately from the definition of Jru (x), using the Markov property
and the tower property of the conditional expectation. Now choose u 0 to be a strategy
that coincides with u on the interval [0, t[, and with u∗ on the interval [t, T ]. Then
Z t
0 0 0 0 0
Vr (Xru ) ≤ Jru (Xru ) = E w(s, Xsu , u0s ) ds + Vt (Xtu ) Xru ,
r
where we have used that Vr (x) ≤ Jru (x)for any admissible Markov strategy u (by
assumption), that Xsu only depends on the strategy u in the time interval [0, s[, and
that Jsu (x) only depends on u in the interval [s, T ] (use the Markov property). On the
other hand, if we choose u0 such that it coincides with u∗ in the interval [r, T ], then
we obtain this expression with equality rather than inequality using precisely the same
reasoning. The dynamic programming recursion follows directly.
Remark 6.1.7 (Martingale dynamic programming principle). There is an equiv-
alent, but more probabilistic, point of view on the dynamic programming principle
which is worth mentioning (it will not be used in the following). Define the process
Z t
u
Mt = w(s, Xsu , us ) ds + Vt (Xtu )
0
for every admissible Markov strategy u. You can easily establish (using the Markov
property) that the dynamic programming principle is equivalent to the following state-
ment: Mtu is always a submartingale, while it is a martingale for u = u∗ .
This is called the Bellman equation, and is “merely” an (extremely) nonlinear PDE.
Remark 6.1.8. To write the equation in more conventional PDE notation, note that
we can write Lsα Vs (x) + w(s, x, α) as a function H 0 of α, s, x and the first and
second derivatives of Vs (x). Hence the minimum of H 0 over α is simply some (highly
nonlinear) function H(s, x, ∂Vs (x), ∂ 2 Vs (x)), and the Bellman equation reads
∂Vs (x)
+ H(s, x, ∂Vs (x), ∂ 2 Vs (x)) = 0.
∂s
We will encounter specific examples later on where this PDE can be solved explicitly.
If we can find a solution to the Bellman equation (with the terminal condition
VT (x) = z(x)) then we should be done: after all, the minimum over α (which depends
∗
both on s and x) must coincide with the optimal Markov control u∗t = α(t, Xtu ).
Note that what we have done here is precisely the limit of the recursive procedure
described above when the partition size goes to zero: we have reduced the computation
to a pointwise optimization for every time s separately; indeed, the minimum above is
merely over the set U, not over the set of U-valued control strategies on [0, T ]. This
makes finding optimal control strategies, if not easy, at least computationally feasible.
How to proceed?
The previous discussion is only intended as motivation. We have made various en-
tirely unfounded assumptions, which you should immediately discard from this point
onward. Let us take a moment for orientation; where can one proceed from here?
One direction in which we could go is the development of the story we have just
told “for real”, replacing all our assumptions by actual mathematical arguments. The
assumption that an optimal control strategy exists and the obsession with Markov
strategies can be dropped: in fact, one can show that the dynamic programming
principle always holds (under suitable technical conditions, of course), regardless of
whether an optimal strategy exists, provided we replace all the minima by infima! In
6.1. Stochastic control problems and dynamic programming 147
other words, the infimum of the cost-to-go always satisfies a recursion in the form en-
countered above. Moreover, we can drop the assumption that the value function is suf-
ficiently smooth, and the Bellman equation will still hold under surprisingly general
conditions—provided that we introduce an appropriate theory of weak solutions. The
highly fine-tuned theory of viscosity solutions is designed especially for this purpose,
and provides “just the right stuff” to build the foundations of a complete mathemati-
cal theory of optimal stochastic control. This direction is highly technical, however,
while the practical payoff is not great: though there are applications of this theory, in
particular in the analysis of numerical algorithms and in the search for near-optimal
controls (which might be the only recourse if optimal controls do not exist), the main
results of this theory are much more fundamental than practical in nature.
We will take the perpendicular direction by turning the story above upside down.
Rather than starting with the optimal control problem, and showing that the Bellman
equation follows, we will start with the Bellman equation (regarded simply as a non-
linear PDE) and suppose that we have found a solution. We will then show that this
solution does indeed coincide with the value function of an optimal control problem,
and that the control strategy obtained from the minimum in the Bellman equation is in-
deed optimal. This procedure is called verification, and is extremely practical: it says
that if we can actually find a nice solution to the Bellman equation, then that solution
gives an optimal control, which is what we care about in practice. This will allow us
to solve a variety of control problems, while avoiding almost all technicalities.
Note that we previously encountered a similar tradeoff between the direct ap-
proach and verification: our discussion of the Kolmogorov backward equation is of
the verification type. See remark 5.2.7 for further discussion on this matter.
Remark 6.1.9. It should be noted that stochastic optimal control problems are much
better behaved, in general, than their deterministic counterparts. In particular, hardly
any deterministic optimal control problem admits a “nice” solution to the Bellman
equation, so that the approach of this chapter would be very restrictive in the deter-
ministic case; however, the noise in our equations actually regularizes the Bellman
equation somewhat, so that sufficiently smooth solutions are not uncommon (results
in this direction usually follow from the theory of parabolic PDEs, and need not have
much probabilistic content). Such regularity issues are beyond our scope, but see
[FR75, section VI.6] and [FS06, section IV.4] for some details and further references.
Before we move on, let us give a simple example where the optimal control does
not exist. This is very common, particularly if one is not careful in selecting a suitable
cost functional, and it is important to realize the cause of such a problem.
Example 6.1.10. Consider the one-dimensional control system dX tu = ut dt + dWt ,
where our goal is to bring Xtu as close as possible to zero by some terminal time T .
It seems reasonable, then, to use a cost functional which only has a terminal cost: for
example, consider the functional J[u] = E((XTu )2 ). Using the Itô rule, we obtain
Z T
E((XTu )2 ) 2
= E((X0 ) ) + E(2us Xsu + 1) ds.
0
6.2. Verification: finite time horizon 148
Now consider admissible Markov strategies of the form ut = −cXtu , where c > 0 is
some gain constant. Substituting into the previous expression, we find explicitly
1 1 − 2c E((X0 )2 )
E((XTu )2 ) = − e−2cT .
2c 2c
Evidently we can make the cost J[u] arbitrarily close to zero by choosing a sufficiently
large gain c. But ut = −∞Xtu is obviously not an admissible control strategy, and
you can easily convince yourself that no admissible control strategy can achieve zero
cost (as this would require the control to instantaneously set Xtu to zero and keep it
there). Hence an optimal control does not exist in this case. Similarly, the Bellman
equation also fails to work here: we would like to write
but a minimum is clearly not attained (set α = −c ∂Vs (x)/∂x with c arbitrarily large).
The problem is that we have not included a control-dependent term in the cost
functional; the control is “free”, and so we can apply an arbitrarily large gain without
any negative consequences. In order to obtain a control problem which does have an
optimal solution, we need to attach a large cost to control strategies that take large
values. The easiest way to do this is to introduce a cost of the form
" Z #
T
J[u] = E C (us )2 ds + (XTu )2 ,
0
where the constant C > 0 adjusts the tradeoff between the magnitude of the control
and the distance of the terminal state XTu from the origin. In this case, the Bellman
equation does make sense: we obtain the Hamilton-Jacobi PDE
∂Vt (x)
+ min {Ltα Vt (x) + w(t, x, α)} = 0, VT (x) = z(x),
∂t α∈U
and |E(V0 (X0 ))| < ∞, and choose a minimum (which we implicitly assume to exist)
is a martingale (rather than a local martingale), and suppose that the control u ∗t =
∗
α∗ (t, Xtu ) defines an admissible Markov strategy which is in K. Then J[u∗ ] ≤ J[u]
∗
for any u ∈ K, and Vt (x) = Jtu (x) is the value function for the control problem.
Remark 6.2.2. Note that J[u∗ ] ≤ J[u] for any u ∈ K, i.e., u is not necessarily
Markov (though the optimal strategy is always necessarily Markov if it is obtaind from
a Bellman equation). On the other hand, we are restricted to admissible strategies
which are sufficiently integrable to be in K; this is inevitable without some further
hypotheses. It should be noted that such an integrability condition is often added to
the definition of an admissible control strategy, i.e., we could interpret K as the class
of ‘truly’ admissible strategies. In applications, this is rarely restrictive.
Proof. For any u ∈ K, we obtain using Itô’s rule and the martingale assumption
Z T
∂Vs u us u u
E(V0 (X0 )) = E − (Xs ) − Ls Vs (Xs ) + VT (XT ) ds .
0 ∂s
But using VT (x) = z(x) and the Bellman equation, we find that
Z T
E(V0 (X0 )) ≤ E w(s, Xsu , us ) ds + z(XTu ) = J[u].
0
On the other hand, if we set u = u , then we obtain E(V0 (X0 )) = J[u∗ ] following exactly the
∗
∗
same steps. Hence J[u∗ ] ≤ J[u] for all u ∈ K. The fact that Vt (x) = Jtu (x) follows easily in
a similar manner (use Itô’s rule and the martingale assumption), and the proof is complete.
6.2. Verification: finite time horizon 150
dzt
= βut , xt = x0 + σWt ,
dt
where zt is the position of the slide relative to the focus of the microscope, x t is the
position of the particle we wish to view under the microscope relative to the center of
the slide, β ∈ R is the gain in our servo loop and σ > 0 is the diffusion constant of
the particle. We would like to keep the particle in focus, i.e., we would like to keep
xt + zt as close to zero as possible. However, we have to introduce a power constraint
on the control as well, as we cannot drive the servo motor with arbitrarily large input
powers. We thus introduce the control cost (see the Introduction)
" Z #
p T q T
Z
2 2
J[u] = E (xt + zt ) dt + (ut ) dt ,
T 0 T 0
where p, q > 0 allow us to select the tradeoff between good tracking and low feedback
power. To get rid of the pesky T −1 terms, let us define P = p/T and Q = q/T .
As the control cost only depends on xt + zt , it is more convenient to proceed
directly with this quantity. That is, define et = xt + zt , and note that
" Z #
T Z T
det = βut dt + σ dWt , J[u] = E P (et )2 dt + Q (ut )2 dt .
0 0
We need to solve the Bellman equation. To this end, plug the following ansatz into
the equation: Vt (x) = at x2 + bt . This gives, using VT (x) = 0,
dat β2 2 dbt
+P − a = 0, aT = 0, + σ 2 at = 0, bT = 0.
dt Q t dt
6.2. Verification: finite time horizon 151
Now note that Vt (x) is smooth in x and t and that α∗ (t, x) is uniformly Lipschitz on
[0, T ]. Hence if we assume that E((x0 + z0 )2 ) < ∞ (surely a reasonable requirement
in this application!), then by theorem 5.1.3 we find that the feedback control
s s !
∗ ∗ P P
ut = α (t, et ) = − tanh β (T − t) (xt + zt )
Q Q
We now assume that we can modify our investment at any point in time. However, we
only consider self-financing investment strategies: i.e., we begin with some starting
capital X0 > 0 (to be divided between the bank account and the stock), and we
subsequently only transfer money between the bank account and the stock (without
adding in any new money from the outside). Denote by Xt our total wealth at time t,
and by ut the fraction of our wealth that is invested in stock at time t (the remaining
fraction 1 − ut being in the bank). Then the self-financing condition implies that
This can be justified as a limit of discrete time self-financing strategies; you have seen
how this works in one of the homeworks, so we will not elaborate further.
Our goal is (obviously) to make money. Let us thus fix a terminal time T , and try to
choose a strategy ut that maximizes a suitable functional U of our total wealth at time
T ; in other words, we choose the cost functional J[u] = E(−U (XTu )) (the minus sign
appears as we have chosen, as a convention, to minimize our cost functionals). How
to choose the utility function U is a bit of an art; the obvious choice U (x) = x turns
out not to admit an optimal control if we set U = R, while if we set U = [0, 1] (we do
not allow borrowing money or selling short) then we get a rather boring answer: we
should always put all our money in stock if µ > r, while if µ ≤ r we should put all
our money in the bank (verify this using proposition 6.2.1!)
Other utility functions, however, can be used to encode our risk preferences. For
example, suppose that U is nondecreasing and concave, e.g., U (x) = log(x) (the
Kelly criterion). Then the relative penalty for ending up with a low total wealth is
much heavier than for U (x) = x, so that the resulting strategy will be less risky
6.3. Verification: indefinite time horizon 152
(concave utility functions lead to risk-averse strategies, while the utility U (x) = x
is called risk-neutral). As such, we would expect the Kelly criterion to tell us to put
some money in the bank to reduce our risk! Let us see whether this is the case. 1
The Bellman equation for the Kelly criterion reads (with U = R)
2 2 2 2
∂Vt (x) σ α x ∂ Vt (x) ∂Vt (x)
0= + min + (µα + r(1 − α))x
∂t α∈R 2 ∂x2 ∂x
∂Vt (x) ∂Vt (x) (µ − r)2 (∂Vt (x)/∂x)2
= + rx −
∂t ∂x 2σ 2 ∂ 2 Vt (x)/∂x2
µ − r ∂Vt (x)/∂x
α∗ (t, x) = − ,
σ 2 x ∂ 2 Vt (x)/∂x2
provided that ∂ 2 Vt (x)/∂x2 > 0 for all x > 0 (otherwise a minimum does not exist!).
Once we have solved for Vt (x), we must remember to check this assumption.
These unsightly expressions seem more hopeless than they actually are. Fill in the
ansatz Vt (x) = − log(x) + bt ; then we obtain the simple ODE
dbt (µ − r)2
− C = 0, bT = 0, C =r+ .
dt 2σ 2
Thus evidently Vt (x) = − log(x) − C(T − t) solves the Bellman equation, and more-
over this function is smooth on x > 0 and ∂ 2 Vt (x)/∂x2 > 0 as required. Furthermore,
the corresponding control is α∗ (t, x) = (µ − r)/σ 2 , which is as regular as it gets. By
theorem 5.1.3 (and by the fact that our starting capital X0 > 0 is non-random), the
conditions of propostion 6.2.1 are met and we find that ut = (µ − r)/σ 2 is indeed
the optimal control. Evidently the Kelly criterion tells us to put money in the bank,
provided that µ − r < σ 2 . On the other hand, if µ − r is large, it is advantageous to
borrow money from the bank to invest in stock (this is possible in the current setting
as we have chosen U = R, rather than restricting to U = [0, 1]).
everything goes through as usual through localization (see the remark after the proof of Itô’s rule).
6.3. Verification: indefinite time horizon 153
Denote by K the class of admissible strategies u such that τ u < ∞ a.s. and
" n m Z u #
X X τ ∂V
u ik u k
E (X ) σ (Xs , us ) dWs = 0.
i=1 0 ∂xi s
k=1
∗
If u∗t = α∗ (Xtu ) defines an admissible Markov strategy in K, then J[u∗ ] ≤ J[u] for
any u ∈ K, and the optimal cost can be expressed as E(V (X0 )) = J[u∗ ].
Proof. Using a simple localization argument and the assumption on u ∈ K, Itô’s rule gives
"Z u #
τ
E(V (Xτuu )) = E(V (X0 )) + E L us V (Xsu ) ds .
0
Example 6.3.2 (Tracking under a microscope II). We consider again the problem
of tracking a particle under a microscope, but with a slightly different premise. Most
microscopes have a field of view whose shape is a disc of some radius r around the
focal point of the microscope. In other words, we will see the particle if it is within a
distance r of the focus of the microscope, but we will have no idea where the particle
is if it is outside the field of view. Given that we begin with the particle inside the
field of view, our goal should thus be to keep the particle in the field of view as long
6.3. Verification: indefinite time horizon 154
as possible by moving around the slide; once we lose the particle, we might as well
give up. On the other hand, as before, we do not allow arbitrary controls: we have to
impose some sort of power constraint to keep the feedback signal sane.
Let us study the following cost. Set S = {x : |x| < r}, let τ u = inf{t : eut 6∈ S}
(recall that et = xt + zt is the position of the particle relative to the focus), and define
" Z u # "Z u #
τ τ
J[u] = E p (us )2 ds − q τ u = E {p (us )2 − q} ds
0 0
where p > 0 and q > 0 are constants. We assume that e0 ∈ S a.s. A control strategy
that minimizes J[u] then attempts to make τ u large (i.e., the time until we lose the
particle is large), while keeping the total feedback power relatively small; the tradeoff
between these conflicting goals can be selected by playing around with p and q.
To find the optimal strategy, we try to solve the Bellman equation as usual:
2 2
σ ∂ V (x) ∂V (x) 2
0 = min + βα +pα −q
α∈R 2 ∂x2 ∂x
2
σ 2 ∂ 2 V (x) β 2 ∂V (x)
= − q −
2 ∂x2 4p ∂x
with the boundary conditions V (r) = V (−r) = 0, and a minimum is attained at
β ∂V (x)
α∗ (x) = − .
2p ∂x
But we can now solve the Bellman equation explicitly: it evidently reduces to a one-
dimensional ODE for ∂V (x)/∂x. Some work gives the solution
√ √
2pσ 2 rβ q xβ q
V (x) = log cos √ − log cos √ ,
β2 σ2 p σ2 p
while the minimum is attained at
√
xβ q
q
r
α∗ (x) = − tan √ ,
p σ2 p
√ √
provided that rβ q/σ 2 p is sufficiently small; in fact, we clearly need to require
√ √
2rβ q < πσ 2 p, as only in this case are V (x) and α∗ (x) in C 2 on [−r, r]. Ap-
parently this magic inequality, which balances the various parameters in our control
problem, determines whether an optimal control exists; you would have probably had
a difficult time guessing this fact without performing the calculation!
It remains to verify the technical conditions of proposition 6.3.1, i.e., that the
∗
control strategy u∗t = α∗ (et ) satisfies τ u < ∞ a.s. and the condition on the stochastic
integral (clearly u∗t is admissible, as α∗ (x) is Lipschitz continuous on [−r, r]). The
∗
finiteness of E(τ u ) follows from lemma 6.3.3 below, while the stochastic integral
condition follows from lemma 6.3.4 below. Hence u∗t is indeed an optimal strategy.
The technical conditions of proposition 6.3.1 are not entirely trivial to check; the
following two lemmas are often helpful in this regard, and can save a lot of effort.
6.3. Verification: indefinite time horizon 155
Lemma 6.3.3. Let Xt be the solution of the SDE dXt = b(Xt ) dt + σ(Xt ) dWt ,
where b and σ are assumed to be Lipschitz as usual, and suppose that X 0 ∈ S a.s. for
some bounded domain S ⊂ Rn . If σ satisfies the nondegeneracy condition on S
n X
X m
v i σ ik (x)σ jk (x)v j ≥ γ kvk2 ∀ v ∈ Rm , x ∈ S,
i,j=1 k=1
Here k, β and n are suitable constants which we will currently choose. As S is bounded, we
can choose β ∈ R such that 0 < c1 < |x1 + β| < c2 < ∞ for all x ∈ S. Next, note that as b
is continuous on Rn it must be bounded on S; in particular, |b1 (x)| < b0 for some b0 ∈ R and
all x ∈ S. Hence we can estimate, using the nondegeneracy condition,
L W (x) < {2nb0 c2 − n(2n − 1)γ/2}(x1 + β)2n−2 ∀ x ∈ S.
Clearly we can choose n sufficiently large so that the prefactor is bounded from above by −c3
for some c3 > 0; then we obtain L W (x) < −c3 c2n−2 1 < 0 for all x ∈ S. Finally, we can
choose k sufficiently large so that W (x) is nonnegative.
It remains to show that the existence of W implies E(τS ) < ∞. To this end, write
Z t∧τS
W (Xt∧τS ) = W (X0 ) = L W (Xr ) dr + martingale,
0
where the stochastic integral is a martingale (rather than a local martingale) as the integrand is
bounded on S. Taking the expectation and using L W (x) ≤ −c4 (c4 > 0) for x ∈ S, we find
E(W (Xt∧τS )) ≤ E(W (X0 )) − c4 E(t ∧ τS ).
But W is bounded on S, so we have established that E(t ∧ τS ) ≤ K for some K < ∞ and for
all t. Letting t → ∞ and using monotone convergence establishes the result.
Lemma 6.3.4. Let τ be a stopping time such that E(τ ) < ∞, and let Ru t be an adapted
τ
process that satisfies |ut | ≤ K for all t ≤ τ and a K < ∞. Then E[ 0 us dWs ] = 0.
Proof. Define the stochastic process
Z t∧τ
Mt = us dWs .
0
m∧τ
which converges to zero as m, n → ∞ by dominated convergence (use that E(τ ) < ∞). Hence
Mn is a Cauchy sequence in L2 (P), and thus converges in L2 (P). We are done.
6.4. Verification: infinite time horizon 156
Here λ > 0 is the discounting factor. Such a cost often makes sense in economic
applications, where discounting is a natural thing to do (inflation will make one dollar
at time t be worth much less than one dollar at time zero). Now if w is bounded, or if
Xsu does not grow too fast, then this cost is guaranteed to be finite and we can attempt
to find optimal controls as usual. Alternatively, we can average over time by setting
" Z #
1 T u
J[u] = lim sup E w(Xs , us ) ds ,
T →∞ T 0
which might make more sense in applications which ought to perform well uniformly
in time. Once again, if w does not grow too fast, this cost will be bounded.
Remark 6.4.1. It should be emphasized that these cost functionals, as well as those
discussed in the previous sections, certainly do not exhaust the possibilities! There are
many variations on this theme, and with your current intuition you should not have too
much trouble obtaining related verification theorems. For example, try to work out a
verification theorem for a discounted version of the indefinite time interval problem.
Let us now develop appropriate verification theorems for the costs J λ [u] and J[u].
Proposition 6.4.2 (Discounted case). Assume that w(x, α) is either bounded from
below or from above. Suppose there is a V (x) in C 2 such that |E(V (X0 ))| < ∞ and
t→∞
Denote by K the admissible strategies u such that e−λt E(V (Xtu )) −−−→ 0 and
n X
m Z t
X ∂V
e−λs (X u ) σ ik (Xsu , us ) dWsk
i=1 k=1 0 ∂xi s
is a martingale (rather than a local martingale), and suppose that the control u ∗t =
∗
α∗ (Xtu ) defines an admissible Markov strategy which is in K. Then Jλ [u∗ ] ≤ Jλ [u]
for any u ∈ K, and the optimal cost can be written as E(V (X0 )) = Jλ [u∗ ].
6.4. Verification: infinite time horizon 157
Proof. Applying Itô’s rule to V (Xtu ) e−λt and using the assumptions on u ∈ K,
Z t
−λt u −λs us u u
E(V (X0 )) − e E(V (Xt )) = E e {−L V (Xs ) + λ V (Xs )} ds .
0
We may assume without loss of generality that w is either nonnegative or nonpositive; otherwise
this is easily arranged by shifting the cost. Letting t → ∞ using monotone convergence,
Z ∞
E(V (X0 )) ≤ E e−λs w(Xsu , us ) ds = Jλ [u].
0
∗
But we obtain equality if we use u = u , so we are done.
The time-average problem has a new ingredient: the function V (x) no longer
determines the optimal cost (note that on the infinite time horizon, the optimal cost is
independent of X0 ; on the other hand, the control must depend on x!). We need to
introduce another free parameter for the Bellman equation to admit a solution.
Proposition 6.4.3 (Time-average case). Suppose that V (x) in C 2 and η ∈ R satisfy
min {L α V (x) + w(x, α) − η} = 0,
α∈U
is a martingale (rather than a local martingale), and suppose that the control u ∗t =
∗
α∗ (Xtu ) defines an admissible Markov strategy which is in K. Then J[u∗ ] ≤ J[u] for
any u ∈ K, and the optimal cost is given by η = J[u∗ ].
Proof. Applying Itô’s rule to V (Xtu ) and using the assumptions on u ∈ K, we obtain
Z T Z T
E(V (X0 ) − V (XTu )) 1 1
+η = E {η − L us V (Xsu )} ds ≤ E w(Xsu , us ) ds ,
T T 0 T 0
where we have already used the Bellman equation. Taking the limit gives
Z T
1
η ≤ lim sup E w(Xsu , us ) ds = J[u].
T →∞ T 0
But we obtain equality if we use u = u∗ , so we are done.
6.4. Verification: infinite time horizon 158
Let us begin by investigating the discounted cost. The Bellman equation becomes
2 2
σ ∂ V (x) ∂V (x) 2 2
0 = min + βα − λ V (x) + px + qα
α∈R 2 ∂x2 ∂x
2
σ 2 ∂ 2 V (x) β 2 ∂V (x)
= − − λ V (x) + px2 ,
2 ∂x2 4q ∂x
To solve the Bellman equation, substitute the ansatz V (x) = ax2 + b. We obtain
p
σ2 a β 2 a2 qλ ± q 2 λ2 + 4pqβ 2
b= , p − λa − = 0 =⇒ a = − .
λ q 2β 2
There are multiple solutions! Now what? The key is that every solution to the Bell-
man equation yields a candidate control α∗ (x), but only one of these will satisfy the
technical conditions in the verification. Let us check this. The candidate strategies are
p p
∗ λ + λ2 + 4pβ 2 /q ∗ λ − λ2 + 4pβ 2 /q
α1 (x) = x, α2 (x) = x.
2β 2β
Note that α∗1 (x) = c1 x with βc1 > λ, while α∗2 (x) = −c2 x with βc2 > 0 (assuming
p > 0; the case p = 0 is trivial, as then the optimal control is clearly ut = 0). But
d
det = βc et dt + σdWt =⇒ E(V (et )) = 2βc E(V (et )) − 2βbc + aσ 2 .
dt
Hence provided that E((e0 )2 ) < ∞, the quantity E(V (et )) grows exponentially at a
rate faster than λ for the control α∗1 , whereas E(V (et )) is bounded for the control α∗2 .
Hence α∗2 is the only remaining candidate control. It remains to check the martingale
condition, but this follows immediately from theorem 5.1.3. Hence we conclude that
u∗t = α∗2 (et ) is an optimal control for the discounted problem.
6.4. Verification: infinite time horizon 159
then C(U ) = minU 0 ≤U K(U 0 ). Hence it suffices to compute the function K(U ).
How does one solve such a problem? The trick is to use the constant q in our
previous cost functional as a Lagrange multiplier (we can set p = 1), i.e., we consider
" Z #
1 T u 2 q T 2
Z
Jq,U [u] = lim sup E (et ) dt + u dt − qU .
T →∞ T 0 T 0 t
6.5. The linear regulator 160
Then we have minu Jq,U [u] ≤ K(U ) for all q > 0 (why?). Hence if we can find a
q > 0 such that this inequality becomes an equality, then we have determined K(U ).
Let us work out the details. We already established above that
√
σ2 q et
min Jq,U [u] = − qU, argmin Jq,U [u] = u∗ with u∗t = − √ .
u β u q
In particular, we can calculate explicitly (how?)
" Z # " Z # √
1 T u∗ 2 1 T ∗ 2 σ2 q
lim sup E (et ) dt = q lim sup E (ut ) dt = .
T →∞ T 0 T →∞ T 0 2β
where T < ∞ is the terminal time, P (t) and Q(t) are time-dependent (but non-
random) n × n and k × k matrices that determine the state and control running cost,
respectively, and R is a fixed (non-random) n × n matrix which determines the termi-
nal cost. Let us make the following additional assumptions.
1. E(kX0 k2 ) < ∞;
2. A(t), B(t), C(t), P (t), Q(t) are continuous on t ∈ [0, T ];
3. P (t), Q(t) and R are symmetric matrices (they can always be symmetrized);
4. P (t) and R are positive semidefinite for all t ∈ [0, T ];
5. Q(t) is positive definite on t ∈ [0, T ].
Our goal is to find a control strategy that minimizes J[u].
Theorem 6.5.1 (Linear regulator, finite time). Denote by {F (t)}t∈[0,T ] the unique
solution, with terminal condition F (T ) = R, of the matrix Riccati equation
d
F (t) + A(t)∗ F (t) + F (t)A(t) − F (t)B(t)Q(t)−1 B(t)∗ F (t) + P (t) = 0.
dt
Then u∗t = −Q(t)−1 B(t)∗ F (t)Xtu is an optimal control for the cost J[u].
Proof. We need to solve the Bellman equation. In the current setting, this is
∂Vt (x)
0= +
∂t
1 ∗
min (A(t)x + B(t)α)∗ ∇Vt (x) + ∇ C(t)C(t)∗ ∇Vt (x) + x∗ P (t)x + α∗ Q(t)α ,
α∈Rk 2
where we set VT (x) = x∗ Rx. As Q(t) is positive definite, the minimum is attained at
1
α∗ (t, x) = − Q(t)−1 B(t)∗ ∇Vt (x),
2
so the Bellman equation can be written as
∂Vt (x) 1
0= + ∇∗ C(t)C(t)∗ ∇Vt (x)
∂t 2
1
+ x∗ A(t)∗ ∇Vt (x) − kQ(t)−1/2 B(t)∗ ∇Vt (x)k2 + x∗ P (t)x.
4
Let us try a value function of the form Vt (x) = x∗ F (t)x+g(t), where F (t) is a time-dependent
n × n symmetric matrix and g(t) is a scalar function. Straightforward computation gives
d
F (t) + A(t)∗ F (t) + F (t)A(t) − F (t)B(t)Q(t)−1B(t)∗ F (t) + P (t) = 0,
dt
d
g(t) + Tr[C(t)∗ F (t)C(t)] = 0,
dt
with the terminal conditions F (T ) = R and g(T ) = 0, and the associated candidate policy
By well known properties of the matrix Riccati equation, see [Won68a, theorem 2.1], and our
assumptions on the various matrices that appear in the problem, the equation for F (t) has
∗
a unique C 1 solution on [0, T ]. Hence the coefficients of the controlled equation for Xtu ,
∗ ∗ u∗
with ut = α (t, Xt ), are uniformly Lipschitz continuous, and thus by proposition 6.2.1 and
theorem 5.1.3 all the requirements for verification are satisfied. Thus we are done.
We can also investigate the linear regulator on the infinite time horizon. Let us
investigate the time-average cost (the discounted problem can also be solved, but this
is less common in applications). To this end, we consider the time-homogeneous case,
Let us try a function of the form V (x) = x∗ F x, where F is an n × n symmetric matrix. Then
A∗ F + F A − F BQ−1 B ∗ F + P = 0, η = Tr[C ∗ F C],
and the associated candidate policy becomes α∗ (x) = −Q−1 B ∗ F x. We now invoke the
properties of the algebraic Riccati equation, see [Won68a, theorem 4.1]. By the stabilizability
assumption, there is at least one positive semidefinite solution F , such that A − BQ−1 B ∗ F
∗ ∗
is a stable matrix. Using the latter and the controlled equation for Xtu with u∗t = α∗ (Xtu ),
u∗
you can verify by explicit computation that E(V (Xt )) is bounded in time. Thus the asymp-
totic condition for verification is satisfied, while the martingale condition is clearly satisfied by
theorem 5.1.3. Thus we find that u∗t is indeed an optimal control.
for x ∈ Sδ0 = Sδ \{−r, r} (the interior of Sδ ). This particular choice for the dis-
cretization of the differential operators is not arbitrary: we will shortly see that the
careful choice of discretization results in a particularly sensible approximation.
Let us call the approximate value function Vδ (x). Then
σ 2 Vδ (x + δ) − 2Vδ (x) + Vδ (x − δ)
0 = min
α∈Uδ 2 δ2
Vδ (x + δ) − Vδ (x − δ) 2
+ βα + pα − q , x ∈ Sδ0 ,
2δ
which becomes after a little rearranging
1
Vδ (x) = min (Vδ (x + δ) + Vδ (x − δ))
α∈Uδ 2
pα2 δ 2 qδ 2
βαδ
+ (Vδ (x + δ) − Vδ (x − δ)) + − 2 , x ∈ Sδ0 ,
2σ 2 σ2 σ
where for x 6∈ Sδ0 we obviously choose the boundary conditions Vδ (r) = Vδ (−r) = 0.
Let us now define the (2N − 1) × (2N − 1)-dimensional matrix P α with entries
α 1 βαδ α 1 βαδ α
Pi,i+1 = + , Pi,i−1 = − , Pi,j = 0 for j 6= i + 1, i − 1.
2 2σ 2 2 2σ 2
2 2
P α set Uδ ⊂ [−σ /βδ, σ /βδ], we see
Provided that we choose our approximate control
that the entries of P α are nonnegative and j Pi,j = 1 for j 6= 1, 2N − 1. Evidently,
P α is the transition probability matrix for a discrete time Markov chain xα n with val-
ues in Sδ and with absorbing boundaries. But there is more: as we show next, our
approximation to the Bellman equation is itself the dynamic programming equation
for an optimal control problem for the Markov chain xα n ! Hence our finite-difference
approximation is much more than an approximation to a PDE: it approximates our
entire control problem by a new (discretized) optimal control problem.
Proposition 6.6.2. Denote by xun the controlled Markov chain on Sδ with
1 βδ
P(xun = (k ± 1)δ|xun−1 = kδ) = ± 2 α(n, kδ), k = −N + 1, . . . , N − 1,
2 2σ
P(xun = ±r|xun−1 = ±r) = 1, P(x0 ∈ Sδ0 ) = 1,
Denote by Vδ (x) the solution to the equation above, by α∗ (x) the associated minimum,
∗ ∗
and u∗n = α∗ (xun−1 ). If E(σ u ) < ∞, then K[u∗ ] ≤ K[u] for any Markov control u
with values in Uδ such that E(σ u ) < ∞. Moreover, we can write K[u∗ ] = E(Vδ (x0 )).
6.6. Markov chain approximation 165
δ2
E(Vδ (xun−1 ) − Vδ (xun )|xun−1 = x) ≤ (p(α(n, x))2 − q) .
σ2
Multiplying both sides by Ix∈Sδ0 , setting x = xun−1 and taking the expectation, we find that
δ2
E (Vδ (xun−1 ) − Vδ (xun )) Ixun−1 ∈Sδ0 ≤ E Ixun−1 ∈Sδ0 (p(un )2 − q) 2 .
σ
Summing over n up to some T ∈ N, we find that
T
! " T #
X u u
X 2 δ2
E (Vδ (xn−1 ) − Vδ (xn )) Ixun−1 ∈Sδ0 ≤ E Ixun−1 ∈Sδ0 (p(un ) − q) 2 ,
n=1 n=1
σ
Now that we have discretized our control problem, how do we solve the discrete
problem? There are various ways of doing this, many of which are detailed in the
books [Kus71, KD01]. One of the simplest is the Jacobi method, which works as
follows. Start with an arbitrary choice for Vδ0 (x). Now define, for any n ∈ N,
1 n−1
Vδn (x) = min (V (x + δ) + Vδn−1 (x − δ))
α∈Uδ 2 δ
pα2 δ 2 qδ 2
βαδ n−1 n−1
+ (V (x + δ) − Vδ (x − δ)) + − 2 , x ∈ Sδ0 ,
2σ 2 δ σ2 σ
where we impose the boundary conditions Vδn−1 (±r) = 0 for every n. The minimum
in this iteration is easily seen to be attained at (setting Uδ = [−σ 2 /βδ, σ 2 /βδ])
σ2 σ2
∗ β n−1 n−1
αδ,n (x) = − (Vδ (x + δ) − Vδ (x − δ)) ∨ − ∧ .
4pδ βδ βδ
It is not difficult to prove that the iteration for Vδn (x) will converge to some function
Vδ (x) as n → ∞, see [Kus71, theorem 4.4], and this limit is indeed the value function
for our approximate optimal control problem, while the limit of α∗δ,n (x) as n → ∞
is the optimal control for our approximate optimal control problem. Other methods
6.6. Markov chain approximation 166
-0.5
-1
-1.5
-2
-5
Figure 6.1. Numerical solution of example 6.3.2 with r = .5, β = .7, p = q = 1, and σ = .5.
From left to right, the interval was discretized into 6, 11, 21 and 51 points. The top plots show
Vδ (x) (dashed line) and the analytic solution for V (x) (solid line). The bottom plots show the
discrete optimal strategy α∗δ (x) (dashed line) and the analytic solution for α∗ (x) (solid line).
The dotted horizontal lines are the upper and lower bounds on the discrete control set Uδ .
often converge faster (e.g., the Gauss-Seidel method [Kus71, theorem 4.6]) and are
not much more difficult to implement, but the Jacobi method will do for our purposes.
The result of implementing this procedure on a computer is shown in figure 6.1, to-
gether then the analytical solution obtained in example 6.3.2, for a particular choice of
parameters. For all but the coarsest discretization, both the discretized value function
and the optimal control are quite close to their analytic solutions; in fact, it appears
that not too fine a grid already gives excellent performance!
Remark 6.6.3. We will not give a proof of convergence here, but you can imagine
why some form of Markov chain approximation would be a good thing to do. The
convergence proof for this procedure does not rely at all on showing that the solution
Vδ (x) converges to the solution V (x) of the continuous Bellman equation. Instead,
one proceeds by showing that the Markov chain xun converges as δ → 0, in a suitable
sense, to the controlled diffusion Xtu . One can then show that the optimal control
policy for xun also converges, in a suitable sense, to an optimal control policy for the
diffusion Xtu , without invoking directly the continuous time verification theorems.
The fact that all the objects in these approximations are probabilistic—and that every
discretized problem is itself an optimal control problem—is thus a key (and quite
nontrivial) idea. For detailed references on this topic, see section 6.7.
Remark 6.6.4. The finite-difference method is only a tool to obtain a suitable Markov
chain approximation for the original problem. The fact that this approximation has its
origins in a finite-difference method is not used in the convergence proofs; in fact, any
Markov chain that satisfies a set of “local consistency” conditions suffices, though
some approximations will converge faster (as δ → 0) than others. Even the finite-
6.6. Markov chain approximation 167
difference scheme is not unique: there are many such schemes that give rise to Markov
chain approximations (though there are also many which do not!). In particular, one
can obtain an approximation without the constraint on Uδ by using one-sided differ-
ences for the derivatives, provided their sign is chosen correctly; on the other hand,
the central difference approximation that we have used is known to converge faster in
most cases (see [KD01, chapter 5] for details).
To demonstrate the method further, let us discuss another very simple example.
Example 6.6.5 (Inverted pendulum). Consider a simple pendulum in one dimen-
sion, which experiences random forcing and is heavily damped. We use the model
where θt is the angle relative to the up position (θ = 0), c1 , c2 , σ > 0 are constants,
and we have allowed for a control input ut of the “pendulum on a cart” type (the
control is ineffective when the pendulum is horizontal). Starting in the down posi-
tion (θ = π), we would like to flip the pendulum to the up position as quickly as
possible—with an angular precision ε > 0, say—while minimizing the total control
power necessary to achieve this task. We thus introduce the stopping time and cost
"Z u #
τ
τ u = inf{t : θtu ≤ ε or θtu ≥ 2π − ε}, J[u] = E {p (us )2 + q} ds ,
0
where p, q > 0 are constants that determine the tradeoff between minimizing the
inversion time and minimizing the necessary power. The Bellman equation becomes
2 2
σ ∂ V (x) ∂V (x) 2
0 = min + (c1 sin(x) − c2 cos(x) α) + pα + q
α∈U 2 ∂x2 ∂x
for x ∈ ]ε, 2π − ε[, with the boundary conditions V (ε) = V (2π − ε) = 0.
We proceed to approximate the Bellman equation using the same finite-difference
approximation used in the previous example. This gives, after some manipulation,
pα2 δ 2 qδ 2
1
Vδ (x) = min (Vδ (x + δ) + Vδ (x − δ)) + +
α∈Uδ 2 σ2 σ2
δ
+ 2 (c1 sin(x) − c2 cos(x) α)(Vδ (x + δ) − Vδ (x − δ)) , x ∈ Sδ0 ,
2σ
where we have set δ = (π − ε)/N , Sδ = {π + k(π − ε)/N : k = −N, . . . , N }, Sδ0 =
Sδ \{ε, 2π −ε}, and we impose the boundary conditions Vδ (ε) = Vδ (2π −ε) = 0. We
now need to choose the control interval Uδ so that the coefficients in the approximate
Bellman equation are transition probabilities, i.e., we need to make sure that
δ 1
|c1 sin(x) − c2 cos(x) α| ≤ ∀ α ∈ Uδ .
2σ 2 2
For example, we can set Uδ = [−G, G] with G = σ 2 /c2 δ − c1 /c2 , provided that we
require δ to be sufficiently small that the constant G is positive.
6.7. Further reading 168
10
80
5
60
α*(x)
V(x)
0
40
-5
20
0 -10
ε π 2π − ε ε π 2π − ε
Figure 6.2. Numerical solution of example 6.6.5 with ε = .25, c1 = c2 = .5, p = q = 1, and
σ = .5. The interval was discretized into 201 points (N = 100). The left plot shows the value
function V (x); the right plot shows the optimal control α∗ (x).
Next, we define the Jacobi iterations, starting from any Vδ0 (x), by
pα2 δ 2 qδ 2
1 n−1
Vδn (x) = min (Vδ (x + δ) + Vδn−1 (x − δ)) + +
α∈Uδ 2 σ2 σ2
δ n−1 n−1
+ 2 (c1 sin(x) − c2 cos(x) α)(Vδ (x + δ) − Vδ (x − δ)) ,
2σ
Little remains but to implement the method, the result of which is shown in figure 6.2.
Remark 6.6.6. We have only discussed approximation of the indefinite time control
problems in one dimension. The method extends readily to multiple dimensions, pro-
vided (as always) that sufficient care is taken to choose appropriate finite differences,
and that sufficient computational power is available. This type of method is also ex-
tremely flexible in that it extends to a wide variety of control problems, and is certainly
not restricted to the indefinite time problem. However, the latter has the nice property
that it is naturally restricted to a bounded domain. For other costs this need not be the
case, so that one has to take care to truncate the state space appropriately. Of course,
any grid-based numerical method will suffer from the same problem.
[FR75], Fleming and Soner [FS06], Yong and Zhou [YZ99] or Krylov [Kry80]. A
recent review article with many further references is Borkar [Bor05]. A nice recent
book with a strong emphasis on verification theorems is Øksendal and Sulem [ØS05].
Finally, Hanson [Han07] gives a non-mathematical introduction to the subject with an
emphasis on applications and computational methods.
Readers familiar with optimal control in the deterministic setting would likely be
quick to point out that dynamic programming is not the only way to go; in fact, meth-
ods based on Pontryagin’s maximum principle are often preferable in the deterministic
setting. Such methods also exist in the stochastic case; see Yong and Zhou [YZ99] for
an extensive discussion and for further references. To date, the dynamic programming
approach has been more successful in the stochastic case than the maximum principle
approach, if only for technical reasons. The maximum principle requires the solution
of an SDE with a terminal condition rather than an initial condition, but whose solu-
tion is nonetheless adapted—a feat that our SDE theory certainly cannot accomplish!
On the other hand, the dynamic programming approach requires little more than the
basic tools of the trade, at least in the simplest setting (as we have seen).
The martingale dynamic programming principle gives a rather attractive proba-
bilistic spin to the dynamic programming method; it is also useful in cases where
there is insufficient regularity (the value function is not “nice enough”) for the usual
approach to work. A lucid discussion of the martingale approach can be found in the
book by Elliott [Ell82]; an overview is given by Davis in [Dav79].
Lemma 6.3.3, which guarantees that the exit time from a bounded set has finite
expectation (under a nondegeneracy condition), is taken from [Has80, section III.7].
Robin [Rob83] gives a nice overview of stochastic optimal control problems with
time-average cost. The book by Davis [Dav77] gives an excellent introduction to
linear stochastic control theory (i.e., the linear regulator and its relatives).
Markov chain approximations in stochastic control are developed extensively in
the books by Kushner [Kus77] and by Kushner and Dupuis [KD01]. A nice overview
can be found in Kushner [Kus90]. In this context, it is important to understand the
optimal control of discrete time, discrete state space Markov chains; this is treated in
detail in Kushner [Kus71] and in Kumar and Varaiya [KV86]. Kushner and Dupuis
[KD01] and Kushner [Kus71] detail various algorithms for the solution of discrete
stochastic optimal control problems, including the simple by effective Jacobi and
Gauss-Seidel methods. The convergence proofs for the Markov chain approximation
itself rely heavily on the theory of weak convergence, see the classic book by Billings-
ley [Bil99], Ethier and Kurtz [EK86], and yet another book by Kushner [Kus84].
A different approach to numerical methods for stochastic optimal control prob-
lems in continuous time is direct approximation of the Bellman PDE. Once a suitable
numerical method has been obtained, one can then attempt to prove that its solution
converges in some sense to a solution (in the viscosity sense) of the continuous Bell-
man equation. See, for example, the last chapter of Fleming and Soner [FS06].
One of the most intriguing aspects of optimal stochastic control theory is that it
can sometimes be applied to obtain results in other, seemingly unrelated, areas of
mathematics. Some selected applications can be found in Borell [Bor00] (geometric
analysis), Sheu [She91] (heat kernel estimates), Dupuis and Ellis [DE97] and Boué
and Dupuis [BD98] (large deviations), Fleming and Soner [FS06] (singular perturba-
6.7. Further reading 170
tion methods), and in Fleming and Mitter [FM83] and Mitter and Newton [MN03]
(nonlinear filtering). Dupuis and Oliensis [DO94] discuss an interesting application
to three-dimensional surface reconstruction from a two-dimensional image.
To date, the most important areas of application for optimal stochastic control
are mathematical finance and engineering. An excellent reference for financial appli-
cations is the well-known book by Karatzas and Shreve [KS98]. In engineering, the
most important part of the theory remains (due to the fact that it is tractable) stochastic
control of linear systems. However, this theory goes far beyond the linear regulator;
for one example of this, see the recent article by Petersen [Pet06].
CHAPTER
7
Filtering Theory
Filtering theory is concerned with the following problem. Suppose we have some
signal process—a stochastic process Xt —which we cannot observe directly. Instead,
we are given an observation process Yt which is correlated with Xt ; we will restrict
ourselves to the important special case of “signal plus white noise” type observations
dYt = h(Xt ) dt + σ dWt , where Wt is a Wiener process. Given that by time t
we can only observe {Ys : s ≤ t}, it becomes necessary to estimate Xt from the
observations Ys≤t . For any function f , we have already seen that the best estimate, in
the mean square sense, of f (Xt ) given Ys≤t , is given by the conditional expectation
πt (f ) ≡ E(f (Xt )|FtY ), where FtY = σ{Ys : s ≤ t} (see proposition 2.3.3).
The goal of the filtering problem is to find an explicit expression for π t (f ) in terms
of Ys≤t ; in particular, we will seek to express πt (f ) as the solution of a stochastic dif-
ferential equation driven by Yt . This is interesting in itself: it leads to algorithms that
allow us to optimally estimate a signal in white noise, which is important in many ap-
plications. In addition, we will see that filtering also forms an integral part of stochas-
tic optimal control in the case where the feedback signal in only allowed to depend on
noisy observations (which is the case in many applications), rather than assuming that
we can precisely observe the state of the system (which we have done throughout the
previous chapter). Before we can tackle any of these problems, however, we need to
take a closer look at some of the properties of the conditional expectation.
171
7.1. The Bayes formula 172
proach is for continuous random variables with probability densities; as we have not
yet discussed conditional expectations in this setting, let us take a moment to show
how this idea relates to the general definition of the conditional expectation.
Example 7.1.1 (Conditioning with densities). Consider two random variables X, Y
on some probability space (Ω, F, P), such that X and Y both take values in the inter-
val [0, 1]. We will assume that X and Y possess as joint density p(x, y); by this we
mean that for any bounded measurable function f : [0, 1] × [0, 1] → R, we can write
Z 1 Z 1
E(f (X, Y )) = f (x, y) p(x, y) dx dy.
0 0
In your undergraduate course, you likely learned that the “conditional expectation” of
f (X, Y ), given Y = y (y ∈ [0, 1]), is given by
R1
0 f (x, y) p(x, y) dx
E(f (X, Y )|Y = y) = R1 .
0
p(x, y) dx
This is often justified by analogy with the discrete case: if X, Y take discrete values
x1 , . . . , xm and y1 , . . . , yn , rather than continuous values x, y ∈ [0, 1], then (why?)
Pm
f (xi , yj ) pij
E(f (X, Y )|Y = yj ) = i=1Pm , pij = P(X = xi and Y = yj ).
i=1 pij
On the other hand, it is not at all clear that the quantity E(f (X, Y )|Y = y) is even
meaningful in the continuous case—after all, P(Y = y) = 0 for any y ∈ [0, 1]! (This
is necessarily true as, by our assumption that p(x, y) exists, the law of Y is absolutely
continuous with respect to the uniform measure on [0, 1].)
To make mathematical sense of this construction, consider the more meaningful
expression (which is similarly the natural analog of the discrete case)
R1
0 f (x, Y ) p(x, Y ) dx
E(f (X, Y )|Y ) = R1 ≡ Mf (Y ).
0
p(x, Y ) dx
We claim that the random variable Mf (Y ), defined in this way, does indeed satisfy
the Kolmogorov definition of the conditional expectation. To verify this, it suffices to
show that E(E(f (X, Y )|Y ) u(Y )) = E(f (X, Y ) u(Y )) for any bounded measurable
u (after all, the indicator function IA is of this form for any A ∈ σ{Y }). But
Z 1 Z 1
E(Mf (Y ) u(Y )) = Mf (y) u(y) p(x, y) dx dy
0 0
1
R1 1
f (x, y) p(x, y) dx
Z Z
0
= R1 u(y) p(x, y) dx dy = E(f (X, Y ) u(Y )),
0 0 p(x, y) dx 0
A trivial but particularly interesting case of this construction occurs for the density
p(x, y) = p(x)q(y), i.e., when X and Y are independent, X has density p(x) and Y
has density q(y). In this case, we find that the conditional expectation is given by
Z 1
E(f (X, Y )|Y ) = f (x, Y ) p(x) dx.
0
Evidently, when two random variables are independent, conditioning on one of the
random variables simply corresponds to averaging over the other. You should con-
vince yourself that intuitively, this makes perfect sense! In fact, we have already used
this idea in disguise: have another look at the proof of lemma 3.1.9.
The conclusion of the previous example provides a good excuse for introductory
courses not to introduce measure theory as the cornerstone of probability. However,
the fundamental idea that underlies this example becomes much more powerful (and
conceptually clear!) when interpreted in a measure-theoretic framework. Let us thus
repeat the previous example, but from a measure-theoretic point of view.
Example 7.1.2 (Conditioning with densities II). Consider the space Ω = [0, 1] ×
[0, 1], endowed with its Borel σ-algebra F = B([0, 1]) × B([0, 1]) and some probabil-
ity measure P. Denote by Y : Ω → [0, 1] the canonical random variable Y (x, y) = y,
and let Z be any integrable random variable on Ω. Beside P, we introduce also the
product measure Q = µ0 × µ0 , where µ0 is the uniform measure on [0, 1].
Now suppose that the measure P is absolutely continuous with respect to Q. Then
Z 1Z 1
dP dP
EP (Z) = EQ Z = Z(x, y) (x, y) dx dy,
dQ 0 0 dQ
where we have expressed the uniform measure in the usual calculus notation. Clearly
dP/dQ is the density p(x, y) of the previous example, and (by the Radon-Nikodym
theorem) the existence of p(x, y) is precisely the requirement that P Q.
We now have two probability measures P and Q. Ultimately, we are interested in
computing the conditional expectation EP (Z|Y ). It is not immediately obvious how
to do this! On the other hand, under the measure Q, the computation of the conditional
expectation EQ (Z|Y ) is particularly simple. Let us consider this problem first. We
claim that for any integrable random variable Z (i.e., EQ (|Z|) < ∞), we can write
Z Z 1
EQ (Z|Y )(x, y) = Z(x, y) µ0 (dx) = Z(x, y) dx.
[0,1] 0
To be precise, we should first verify that this random variable is in fact measurable—
this is indeed the case by Fubini’s theorem. Let us now check Kolmogorov’s definition
of the conditional expectation. First, note that σ{Y } = {Y −1 (A) : A ∈ B([0, 1])} =
{[0, 1] × A : A ∈ B([0, 1])}. Hence for any S ∈ σ{Y }, the indicator function
7.1. The Bayes formula 174
IS (x, y) = IS (y) is only a function of y. Therefore, we find that for any S ∈ σ{Y },
Z ( Z )
EQ (IS EQ (Z|Y )) = IS (y) Z(x, y) µ0 (dx) Q(dx, dy) =
Ω [0,1]
Z ( Z )
IS (y) Z(x, y) µ0 (dx) µ0 (dy) = EQ (IS Z),
[0,1] [0,1]
where we have used Fubini’s theorem to write the repeated integral as a product inte-
gral. But this is precisely the Kolmogorov definition—so the claim is established.
Apparently the computation of EQ (Z|Y ) is more or less trivial. If only we were
interested in the measure Q! Unfortunately, in real life we are interested in the mea-
sure P, which could be much more complicated than Q (and is most likely not a
product measure). Now, however, we have a cunning idea. As P Q, we know
that one can express expectations under P as expectations under Q by inserting the
Radon-Nikodym derivative: EP (Z) = EQ (Z dP/dQ). Perhaps we can do the same
with conditional expectations? In other words, we can try to express conditional ex-
pectations under P in terms of conditional expectations under Q, which, one would
think, should come down to dropping in Radon-Nikodym derivatives in the appropri-
ate places. If we can make this work, then we can enjoy all the benefits of Q: in
particular, the simple formula for EQ (Z|Y ) would apply, which is precisely the idea.
The question is thus: how do conditional expectations transform under a change
of measure? Let us briefly interrupt our example develop the relevant result.
Lemma 7.1.3 (Bayes formula). Let (Ω, F, P) be a probability space, and P Q for
some probability measure Q. Then for any σ-algebra G ⊂ F and for any integrable
random variable X (i.e., we require EP (|X|) < ∞), the Bayes formula holds:
EQ (X dP/dQ | G)
EP (X|G) = P-a.s.
EQ (dP/dQ | G)
The rest is essentially an exercise in using the elementary properties of the conditional
expectation: you should verify that you understand all the steps! Let S ∈ G be arbitrary, and
note that EQ (IS EQ (X dP/dQ | G)) = EQ (IS X dP/dQ) = EP (IS X) = EP (IS EP (X|G)) =
EQ (IS dP/dQ EP (X|G)) = EQ (IS EQ (dP/dQ | G) EP (X|G)). But this holds for any S ∈ G,
so we find that EQ (X dP/dQ | G) = EQ (dP/dQ | G) EP (X|G) Q-a.s. (why?).
We would like to divide both sides by EQ (dP/dQ | G), so we must verify that this quantity
is nonzero (it is clearly nonnegative). Define the set S = {ω : EQ (dP/dQ | G)(ω) = 0},
and note that S ∈ G as the conditional expectation is G-measurable by definition. Then 0 =
EQ (IS EQ (dP/dQ | G)) = EQ (IS dP/dQ) = P(S). Hence we can go ahead with our division
on the set S c , which has unit probability under P. The result follows directly.
We can now complete our example. Using the Bayes formula, we find that P-a.s.
dP
R
EQ (Z dP/dQ | Y )(x, y) [0,1] Z(x, y) dQ (x, y) µ0 (dx)
EP (Z|Y )(x, y) = = R dP
,
EQ (dP/dQ | Y )(x, y) [0,1] dQ
(x, y) µ0 (dx)
where we have substituted in our simple expression for EQ (Z|Y ). But this is pre-
cisely the density expression for the conditional expectation! When viewed in this
light, there is nothing particularly fundamental about the textbook example 7.1.1: it
is simply a particular example of the behavior of the conditional expectation under
an absolutely continuous change of measure. The new measure µ 0 × µ0 , called the
reference measure, is chosen for convenience; under the latter, the computation of the
conditional expectations reduces to straightforward integration.
The generalization of example 7.1.1 to the measure-theoretic setting will pay off
handsomely in the solution of the filtering problem. What is the benefit of abstraction?
In general, we wish to calculate EP (X|G), where G need not be generated by a simple
random variable—for example, in the filtering problem G = σ{Ys : s ≤ t} is gener-
ated by an entire continuous path. On the space of continuous paths, the concept of a
“density” in the sense of example 7.1.1 does not make sense; there is no such thing as
the uniform measure (or even a Lebesgue measure) on the space of continuous paths!
However, in our more abstract setup, we are free to choose any reference measure we
wish; the important insight is that what really simplified example 7.1.1 was not the
representation of the densities with respect to the uniform measure per se, but that
under the uniform measure X and Y were independent (which allowed us to reduce
conditioning under the reference measure to simple integration). In the general case,
we can still seek a reference measure under which X and G are independent—we even
already have the perfect tool for this purpose, the Girsanov theorem, in our toolbox
waiting to be used! The abstract theory then allows us to proceed just like in example
7.1.1, even though we are no longer operating within its (restrictive) setting.
Example 7.1.5. Before developing a more general theory, let us demonstrate these
ideas in the simplest filtering example: the estimation of a constant in white noise.
We work on the space (Ω, F, {Ft }, P), on which are defined a (one-dimensional)
Ft -Wiener process Wt and an F0 -measurable random variable X0 (which is thus by
definition independent of Wt ). We consider a situation in which we cannot observe
X0 directly: we only have access to noisy observations of the form y t = X0 + κ ξt ,
7.1. The Bayes formula 176
where ξt is white noise. As usual, we will work in practice with the integrated form
of the observations to obtain a sensible mathematical model (see the Introduction and
section 3.3 for discussion); i.e., we set Yt = X0 t + κWt . The goal of the filtering
problem is to compute πt (f ) ≡ EP (f (X0 )|FtY ), where FtY = σ{Ys : s ≤ t} is the
observation filtration, for a sufficiently large class of functions f ; then π t (f ) is the
optimal (least mean square) estimate of f (X0 ), given the observations up to time t.
To tackle this problem, we will essentially repeat example 7.1.1 in this setting. As
we are conditioning on entire observation paths, we do not have a uniform measure
available to us; nonetheless, we can find a reference measure Q under which X 0 and
Yt are independent, at least on finite time intervals! This is just Girsanov’s theorem.
Lemma 7.1.6. Suppose Λ−1 −1
T below satisfies EP (ΛT ) = 1, and define QT P by
dQT 1 −2
= exp −κ X0 WT − κ (X0 ) T ≡ Λ−1
−1 2
T .
dP 2
Then under QT the random variable X0 has the same law as under P, and the process
{κ−1 Yt }t∈[0,T ] is an Ft -Wiener process independent of X0 . Moreover, P QT with
dP 1
= exp κ−2 X0 YT − κ−2 (X0 )2 T = ΛT .
dQT 2
Proof. By Girsanov’s theorem (see also remark 4.5.4), {κ−1 Yt }t∈[0,T ] is an Ft -Wiener process
under QT . But X0 is F0 -measurable, so {κ−1 Yt }t∈[0,T ] must be independent of X0 under
QT . To show that X0 has the same law under QT , note that Λ−1 t is a martingale under P (as
it is a nonnegative local martingale, and thus a supermartingale, with constant expectation);
hence EQT (f (X0 )) = EP (f (X0 )Λ−1 −1
T ) = EP (f (X0 ) EP (ΛT |F0 )) = EP (f (X0 )) for every
bounded measurable f , which establishes the claim. Finally, to show that P QT , note that
ΛT is (trivially) the corresponding Radon-Nikodym derivative. We are done.
where µX0 is the law of the random variable X0 , and πt (f ) = σt (f )/σt (1).
Proof. It suffices to check that EQt (IA EQt (f (X0 )Λt |FtY )) = EQt (IA f (X0 )Λt ) for every
A ∈ FtY . But this follows by an argument identical to the one employed in lemma 3.1.9.
7.2. Nonlinear filtering for stochastic differential equations 177
Example 7.1.10 (Gaussian case). For Gaussian X0 with mean µ and variance σ 2 ,
Z ∞
1 −2 −2 2 2 2
σt (f ) = √ f (x) eκ xYt −κ x t/2 e−(x−µ) /2σ dx.
σ 2π −∞
This expression can be evaluated explicitly for f (x) = x and f (x) = x 2 , for example.
The calculation is a little tedious, but gives the following answer:
κ2 µ + σ 2 Yt κ2 σ 2
EP (X0 |FtY ) = , EP ((X0 )2 |FtY ) − (EP (X0 |FtY ))2 = .
κ2 + σ 2 t κ2 + σ 2 t
Remark 7.1.11. Evidently, in the current setting (regardless of the law of X 0 ), the
optimal estimate πt (f ) depends on the observation history only through the random
variable Yt and in an explicitly computable fashion. This is an artefact, however, of
this particularly simple model; in most cases the optimal estimate has a complicated
dependence on the observation history, so that working directly with the Bayes for-
mula, as we have done here, is not as fruitful (in general the Bayes formula does not
lend itself to explicit computation). Instead, we will use the Bayes formula to obtain
a stochatic differential equation for πt (f ), which can subsequently be implemented
recursively using, e.g., an Euler-Maruyama type method (at least in theory).
i.e., the signal which we would like to observe is the solution of a (possibly con-
trolled) n-dimensional stochastic differential equation driven by the m-dimensional
Ft -Wiener process Wt . However, we do not have direct access to this signal; instead,
we can only see the measurements taken by a noisy sensor, whose output is given by
Z t Z t
Yt = h(s, Xs , us ) ds + K(s) dBs ,
0 0
where ξt is white noise, but we work with the integrated form to obtain a sensible
mathematical model (see the Introduction and section 3.3 for further discussion). In
the equations above b : [0, ∞[ × Rn × U → Rn , σ : [0, ∞[ × Rn × U → Rn×m ,
h : [0, ∞[ × Rn × U → Rp , and K : [0, ∞[ → Rp×p are measurable maps, U ⊂ Rq ,
and ut is presumed to be adapted to the observation filtration FtY = σ{Ys : s ≤ t}.
Remark 7.2.1. This is a rather general model; one often does not need this full-blown
scenario! On the other hand, we are anticipating applications in control, so we have
already included a control input. Note, however, that the control may only depend
(causally) on the observation process; we can no longer use the state of the system X t
to determine ut ! This rules out the control strategies developed in the previous chapter,
and we must reconsider our control problems in this more complicated setting.
The goal of the filtering problem is to compute, on the basis of the observations
{Ys : s ≤ t}, the (least mean square) optimal estimates πt (f ) ≡ EP (f (Xt )|FtY ), at
least for a sufficiently large class of functions f . To keep matters as simple as possible
we will be content to operate under rather stringent technical conditions:
1. The equation for (Xt , Yt ) has a unique Ft -adapted solution;
2. K(t) is invertible for all t and K(t), K(t)−1 are locally bounded;
3. b, σ, h are bounded functions.
The last condition is particularly restrictive, and can be weakened significantly (see
section 7.7 for detailed references). This tends to become a rather technical exercise.
By restricting ourselves to bounded coefficients, we will be able to concentrate on the
essential ideas without being bogged down by a large number of technicalities.
Unfortunately, one of the most important examples of the theory, the Kalman-
Bucy filter, does not satisfy condition 3. We will circumvent the technicalities by
treating this case separately, using a different approach, in the next section.
Moreover, P QT with
"Z #
T
dP 1 T
Z
−1 ∗ −1 2
= exp (K(t) h(t, Xt , ut )) dȲt − kK(t) h(t, Xt , ut )k dt .
dQT 0 2 0
as one would naively think, by virtue of the fact that QT and P are mutually absolutely
continuous; this is easily verified when Ft is simple and bounded, and can be extended
to any integrable Ft through the usual process of taking limits and localization. Al-
ternatively, one can set up a more general integration theory that is not restricted to
Wiener process integrands, so that these integrals can be constructed under both mea-
sures. As this is an introductory course, we will not dwell on these technical issues;
we will be content to accept the fact that stochastic integrals are well-behaved under
absolutely continuous changes of measure (see, e.g., [Pro04] or [RY99]).
Proof. By conditions 2 and 3, Novikov’s condition is satisfied. We can thus apply Girsanov’s
theorem to the (m+p)-dimensional process (dWt , dȲt ) = dYt = Ht dt+dWt , where Wt =
(Wt , Bt ) and Ht = (0, K(t)−1 (t, Xt , ut )). We find that under QT , the process {Yt }t∈[0,T ]
is an Ft -Wiener process; in particular, Wt and Ȳt are independent Wiener processes and both
are independent of X0 (as X0 is F0 -measurable, and these are Ft -Wiener processes).
Now note that P QT follows immediately from the fact that dQT /dP is strictly positive,
where dP/dQT = (dQT /dP)−1 (which is precisely the expression in the statement of the
lemma). After all, EQT (Z (dQT /dP)−1 ) = EP (Z dQT /dP (dQT /dP)−1 ) = EP (Z).
which is a martingale under QT for t ≤ T (Novikov). The Bayes formula now gives:
Corollary 7.2.4. If EP (|f (Xt )|) < ∞, the filtered estimate πt (f ) is given by
EQt (f (Xt )Λt |FtY ) σt (f )
πt (f ) = EP (f (Xt )|FtY ) = Y
= ,
EQt (Λt |Ft ) σt (1)
where we have defined the unnormalized estimate σt (f ) = EQt (f (Xt )Λt |FtY ).
This expression is called the Kallianpur-Striebel formula. In fact, Kallianpur and
Striebel went a little further: they actually expressed the unnormalized conditional
expectation σt (f ) as an integral over a part of the probability space, just like we did
in the previous section, thus making the analogy complete (see, e.g., [LS01a, section
7.9], for the relevant argument). However, we will find is just as easy to work directly
with the conditional expectations, so we will not bother to make this extra step.
7.2. Nonlinear filtering for stochastic differential equations 180
Remark 7.2.5. Note that Girsanov’s theorem implies Qt (A) = QT (A) for any A ∈
Ft ; in particular, EQt (f (Xt )Λt |FtY ) = EQT (f (Xt )Λt |FtY ) = EQT (f (Xt )ΛT |FtY )
(why?). We will occasionally use this, e.g., in the proof proposition 7.2.6 below.
What progress have we made? Quite a lot, as a matter of fact, though it is not
immediately visible. What we have gained by representing πt (f ) in this way is that
the filtering problem is now expressed in terms of a particularly convenient measure.
To proceed, we can turn the crank on our standard machinery: the Itô rule et al.
By our boundedness assumptions, all the integrands are in L2 (µt ×Qt ). Hence we can compute
Z t
EQt (f (Xt )Λt |FtY ) = EQt (f (X0 )|FtY ) + EQt (Λs Lsu f (Xs )|FsY ) ds
0
Z t
+ EQt (Λs K(s)−1 h(s, Xs , us ) f (Xs )|FsY )∗ dȲs ,
0
where we have used lemma 7.2.7 below. It remains to note that X0 and FtY are independent
under Qt , so EQt (f (X0 )|FtY ) = EQt (f (X0 )) = EP (f (X0 )).
Proof. Choose any A ∈ FtW , and note that by the Itô representation theorem
Z t
IA = P(A) + Hs dWs
0
for some FtW -adapted process H· ∈ L2 (µt × P). Let us now apply the polarization identity
2 E(It (H)It(F )) = E((It (H) + It (F ))2 ) − E(It (H)2 ) − E(It (F )2 ) and the Itô isometry:
Z t Z t Z t
E IA Fs dWs = E Fs Hs ds = E E(Fs |FsW ) Hs ds ,
0 0 0
where we have used Fubini’s theorem and the tower property of the conditional exepctation in
the last step. But by applying the same steps with Fs replaced by E(Fs |FsW ), we find that
Z t Z t
E IA Fs dWs = E IA E(Fs |FsW ) dWs for all A ∈ FtW .
0 0
Hence the first statement follows by the Kolmogorov definition of the conditional expectation.
The second statement follows in the same way. To establish the last statmement, note that
Z t Z t Z t
E IA Fs ds = E IA E(Fs |FtW ) ds = E IA E(Fs |FsW ) ds
0 0 0
for A ∈ FtW , where the first equality follows by using Fubini’s theorem and the tower property
of the conditional expectation, and the second equality follows as Fs , being Fs -measurable, is
W
independent of Ft,s = σ{Wr − Ws : s ≤ r ≤ t}, and FtW = σ{FsW , Ft,s W
}.
just popped up while applying Itô’s rule; B̄t is called the innovations process. It has
an important property that can be extremely useful in control applications.
Proposition 7.2.9. Under P, the innovation B̄t is an FtY -Wiener process, so we have
Z t Z t
πt (f ) = π0 (f ) + πs (Lsu f ) ds + {K(s)−1 (πs (hus f ) − πs (f ) πs (hus ))}∗ dB̄s .
0 0
We now condition on FsY . We claim that the stochastic integral vanishes; indeed, it is an
Ft -martingale, so vanishes when conditioned on Fs , and FsY ⊂ Fs establishes the claim.
∗ ∗
Moreover, note that EP (eiα B̄r EP (h(r, Xr , ur )|FrY )|FsY ) = EP (eiα B̄r h(r, Xr , ur )|FsY ),
∗
as eiα B̄r is FrY -measurable and FsY ⊂ FrY . Hence as in the proof of lemma 7.2.7, we find
Z
∗ ∗ kαk2 t ∗
EP (eiα B̄t |FsY ) = eiα B̄s − EP (eiα B̄r |FsY ) dr.
2 s
∗ ∗
B̄s −kαk2 (t−s)/2
But this equation has the unique solution EP (eiα B̄t
|FsY ) = eiα .
would then reduce to an SDE which can be computed, e.g., using the Euler-Maruyama
method. Unfortunately, it turns out that this is almost never the case—in most cases no
finite-dimensional realization of the filtering equation exists. There is one extremely
important exception to this rule: when b, h are linear, σ is constant and X 0 is a Gaus-
sian random variable, we obtain the finite-dimensional Kalman-Bucy filter which is
very widely used in applications; this is the topic of the next section. However, a
systematic search for other finite-dimensional filters has unearthed few examples of
practical relevance (see, e.g., [HW81, Mit82, Par91, HC99]).1
Of course, it would be rather naive to expect that the conditional expectations
EP (f (Xt )|FtY ) can be computed in a finite-dimensional fashion. After all, in most
cases, even the unconditional expectation EP (f (Xt )) can not be computed by solving
a finite-dimensional equation! Indeed, the Itô rule gives
d
EP (f (Xt )) = EP (Ltu f (Xt )),
dt
which depends on EP (Ltu f (Xt )); and the equation for EP (Ltu f (Xt )) will depend on
EP (Ltu Ltu f (Xt )) (if b, σ, f are sufficiently smooth), etc., so that we will almost cer-
tainly not obtain a closed set of equations for any collection of functions f 1 , . . . , fn .
(Convince yourself that the case where b, f are linear and σ is constant is an excep-
tion!) To actually compute EP (f (Xt )) (in the absence of control, for example), we
have two options: either we proceed in Monte Carlo fashion by averaging a large num-
ber of simulated (random) sample paths of Xt , or we solve one of the PDEs associated
with the SDE for Xt : the Kolmogorov forward or backward equations. The latter are
clearly infinite-dimensional, while in the former case we would need to average an
infinite number of random samples to obtain an exact answer for E P (f (Xt )).
We are faced with a similar choice in the filtering problem. If we are not in the
Kalman-Bucy setting, or one which is sufficiently close that we are willing to lin-
earize our filtering model (the latter gives rise to the so-called extended Kalman filter
[Par91]), we will have to find some numerically tractable approximation. One popular
approach is of the Monte Carlo type; the so-called particle filtering methods, roughly
speaking, propagate a collection of random samples in such a way that the probability
of observing these “particles” in a certain set A is an approximation of π t (IA ) (or of a
related object). Particle methods are quite effective and are often used, e.g., in track-
ing, navigation, and robotics applications. Unfortunately the details of such methods
are beyond our scope, but see [Del04, CL97] for discussion and further references.
Another approach is through PDEs. For simplicity, let us consider (on a formal
level) the filtering counterpart of the Kolmogorov forward equation. We will assume
that the filtering problem possesses a conditional density, i.e., that there is a random
density pt (x), which is only a functional of the observations FtY , such that
Z
πt (f ) = EP (f (Xt )|FtY ) = f (x) pt (x) dx.
1 An important class of finite-dimensional nonlinear filters with applications, e.g., in speech recogni-
tion, are those for which the signal Xt is not the solution of a stochastic differential equation, as in this
section, but a finite-state Markov process [Won65, LS01a]. We will discuss a special case in section 7.4.
7.2. Nonlinear filtering for stochastic differential equations 184
qt (x)
Z
σt (f ) = f (x) qt (x) dx, pt (x) = R ,
qt (x) dx
At least this equation is a linear stochastic partial differential equation, a much more
well-posed object. It is still much too difficult for us, but the corresponding theory can
be found, e.g., in [Par82, Par91, Ben92, Kun90]. The Zakai PDE can now be the start-
ing point for further approximations, e.g., Galerkin-type methods [GP84], spectral
methods [LMR97], or projection onto a finite-dimensional manifold [BHL99].
Finally, there is a third approach which is similar to the method that we have
already encountered in the control setting. We can approximate our signal process by a
discrete time finite-state Markov process, and introduce an appropriate approximation
to the observation process; this can be done, e.g., by introducing a suitable finite-
difference approximation, as we did in the last chapter. The optimal filter for the
approximate signal and observations is a finite-dimensional recursion, which can be
shown to converge to the solution of the optimal filtering problem [Kus77, KD01].
For a recent review on numerical methods in nonlinear filtering, see [Cri02].
Example 7.2.10. For sake of example, and as we already have some experience with
such approximations, let us discuss an extremely simple Markov chain approximation
for a nonlinear filtering problem. This is not necessarily the method of choice for such
a problem, but will serve as a simple demonstration.
Consider a signal process θt on the circle which satisfies
and our goal is to estimate θt given FtY . Such a model can be used in phase tracking
problems (e.g., in a phase lock loop), where the goal is to estimate the drifting phase
of an oscillating signal (with carrier frequency ω) from noisy observations [Wil74].
7.2. Nonlinear filtering for stochastic differential equations 185
where Q is the measure under which xn and FtY are independent. A weak con-
vergence argument [KD01, Kus77] guarantees that this approximate expression does
indeed converge, as ∆, δ → 0, to the exact unnormalized estimate σ n∆ (f ).
Remark 7.2.11. Consider the following discrete time filtering problem: the signal is
our Markov chain xn , while at time n we observe yn = sin(xn ) ∆ + κ ξn , where ξn is
a Gaussian random variable with mean zero and variance ∆, independent of the signal
process. Using the Bayes formula, you can easily verify that
EQ (f (xn )Λn |Fny ) Pn−1
−2 1
sin2 (xm ) ∆}
E(f (xn )|Fny ) = , Λn = e κ m=0 {sin(xm ) ym − 2 ,
EQ (Λn |Fny )
where Fny = σ{ym : m ≤ n} and Q is the measure under which {xn } and {yn }
are independent. Evidently our approximate filter is again a filter for an approximate
problem, just like the Markov chain approximations in the stochastic control setting.
7.2. Nonlinear filtering for stochastic differential equations 186
π/2
3π/2
2π
0.2
-0.2
0 2.5 5 7.5 10 12.5 15 17.5 20
Figure 7.1. Numerical solution of example 7.2.10 with ω = ν = .5, κ = .1, ∆ = .05 and
N = 25, on the interval t ∈ [0, 20]. The top plot shows θt (blue line) and the approximate
conditional distribution π̃n (shaded background), while the bottom plot shows the observation
increments Y(m+1)∆ − Ym∆ used by the approximate filter (red). The blue and red plots were
computed by the Euler-Maruyama method with a time step much smaller than ∆.
σ̃n (f ) =
−2 Pn−1
h 1 2
i
EQ EQ (f (xn )|xn−1 ) eκ m=0 {sin(xm ) (Y(m+1)∆ −Ym∆ )− 2 sin (xm ) ∆} F Y
n∆ ,
where we have written σ̃n (f ) for the approximate expression for σn∆ (f ). To see this,
it suffices to note that as Yt is independent of xn under Q, and as xn has the same law
Y x
under P and Q, we can write EQ (f (xn )|σ{Fn∆ , Fn−1 }) = EP (f (xn )|xn−1 ) using
x
the Markov property (where Fn = σ{xm : m ≤ n}); the claim follows directly using
the tower property of the conditional expectation. But then evidently
−2 1 2
σ̃n (f ) = σ̃n−1 EP (f (xn )|xn−1 = · ) eκ {sin( · ) (Y(n)∆ −Y(n−1)∆ )− 2 sin ( · ) ∆} ,
which is the discrete time analog of the Zakai equation. We can now turn this into a
closed-form recursion as follows. Define σ̃nk = σ̃n (I{kδ} ), denote by P the matrix
with elements Pk` = P(xn = `δ|xn−1 = kδ), and by Λ(y) the diagonal matrix with
−2 2
(Λ(y))kk = eκ {sin(kδ) y−sin (kδ) ∆/2} . Then you can easily verify that
Λ(Yn∆ − Y(n−1)∆ )P ∗ π̃n−1
σ̃n = Λ(Yn∆ − Y(n−1)∆ )P ∗ σ̃n−1 , π̃n = P ∗ k
,
k (Λ(Yn∆ − Y(n−1)∆ )P π̃n−1 )
where π̃nk = σ̃n (I{kδ} )/σ̃n (1) is the approximate conditional probability of finding θt
in the kth discretization interval at time n∆, given the observations up to that time.
A numerical simulation is shown in figure 7.1; that the approximate filter provides
good estimates of the signal location is evident. A curious effect should be pointed
out: note that whenever the signal crosses either π/2 or 3π/2, the conditional distribu-
tion briefly becomes bimodal. This is to be expected, as these are precisely the peaks
7.3. The Kalman-Bucy filter 187
of the observation function sin(x); convince yourself that around these peaks, the fil-
ter cannot distinguish purely from the observations in which direction the signal is
moving! This causes the conditional distribution to have “ghosts” which move in the
opposite direction. However, the “ghosts” quickly dissipate away, as prolonged mo-
tion in the opposite direction is incompatible with the signal dynamics (of course, the
effect is more pronounced if ω is close to zero). Thus the filter does its job in utilizing
both the information gained from the observations and the known signal dynamics.
Here A(t), B(t), C(t), H(t), and K(t) are non-random matrices of dimensions n×n,
n × k, n × m, p × n, and p × p, respectively, and ut is a k-dimensional control input
which is presumed to be FtY -adapted. We will make the following assumptions:
1. X0 is a Gaussian random variable with mean X̂0 and covariance P̂0 ;
2. The equation for (Xt , Yt ) has a unique Ft -adapted solution;
3. K(t) is invertible for all t;
4. A(t), B(t), C(t), H(t), K(t), K(t)−1 are continuous.
The goal of the linear filtering problem is to compute the conditional mean X̂t =
E(Xt |FtY ) and error covariance P̂t = E((Xt − X̂t )(Xt − X̂t )∗ ). We will prove:
Theorem 7.3.1 (Kalman-Bucy). Under suitable conditions on the control u t ,
What conditions must be imposed on the controls will be clarified in due course;
however, the theorem holds at least for non-random ut which is locally bounded (open
loop controls), and for sufficiently many feedback controls that we will be able to solve
the partial observations counterpart of the linear regulator problem (section 7.5).
In principle it is possible to proceed as we did in the previous section, i.e., by
obtaining the Zakai equation through the Kallianpur-Striebel formula. The problem,
however, is that the change of measure Λt is generally not square-integrable, so we
will almost certainly have trouble applying lemma 7.2.7. This can be taken care of by
clever localization [Par91] or truncation [Ben92] arguments. We will take an entirely
different route, however, which has an elegance of its own: we will exploit the fact that
the conditional expectation is the least squares estimate to turn the filtering problem
into an optimal control problem (which aims to find an estimator which minimizes
the mean square error). The fundamental connection between filtering and control
runs very deep (see section 7.7 for references), but is particularly convenient in the
Kalman-Bucy case due to the special structure of the linear filtering problem.
Before we embark on this route, let us show that theorem 7.3.1 does indeed follow
from the previous section, provided that we are willing to forgo technical precision.
The easiest way to do this is to consider the density form of the Zakai equation,
Z
σt (f ) ≡ f (x) qt (x) dx, dqt (x) = (Ltu )∗ qt (x) dt + qt (x) (K(t)−1 h(t, x))∗ dȲt .
Lemma 7.3.2. Denote by Φs,t the unique (non-random) matrix that solves
d
Φs,t = A(t) Φs,t (t > s), Φs,s = I,
dt
where I is the identity matrix as usual. Then we can write
Z t Z t
Xt = Φ0,t X0 + Φs,t B(s)us ds + Φs,t C(s) dWs .
0 0
−1
Proof. It is elementary that Φs,t = Φ0,t (Φ0,s ) . Hence the claim is that we can write
Z t Z t
−1 −1
Xt = Φ0,t X0 + (Φ0,s ) B(s) us ds + (Φ0,s ) C(s) dWs .
0 0
Why does this help? Recall that we want to compute E(Xt |FtY ); in general, this
could be an arbitrarily complicated measurable functional of the observation sample
paths {Ys : s ≤ t}. However, it is a very special consequence of the Gaussian property
of (Xt , Yt ) that E(Xt |FtY ) must be a linear functional of {Ys : s ≤ t} (in a sense to
be made precise). This will make our life much simpler, as we can easily parametrize
all linear functionals of {Ys : s ≤ t}; it then suffices, by the least squares property
of the conditional expectation (proposition 2.3.3), to search for the linear functional L
that minimizes the mean square error X̂ti = argminL E((Xti − L(Y[0,t] ))2 ).
Lemma 7.3.5. There exists a non-random Rn×p -valued function G(t, s) such that
Z t Z t
Y
E(Xt |Ft ) = E(Xt ) + G(t, s)H(s)(Xs − E(Xs )) ds + G(t, s)K(s) dBs ,
0 0
Rt
where 0
kG(t, s)k2 ds < ∞. Thus E(Xt |FtY ) is a linear functional of {Ys : s ≤ t}.
7.3. The Kalman-Bucy filter 190
Clearly FtY = FtỸ , (X̃t , Ỹt ) is a Gaussian process, and we wish to prove that
Z t Z t Z t
E(X̃t |FtỸ ) = G(t, s) dỸs = G(t, s)H(s)X̃s dt + G(t, s)K(s) dBs .
0 0 0
Let us first consider a simpler problem which only depends on a finite number of random
variables. To this end, introduce the σ-algebra G` = σ{Yk2−` t − Y(k−1)2−` t : k = 1, . . . , 2` },
and note that FtY = σ{G` : ` = 1, 2, . . .} (as Yt has continuous sample paths, so only depends
on its values in a dense set of times). Define also the p2` -dimensional random variable
so that E(X̃t |G` ) = E(X̃t |Ỹ ` ). But (X̃t , Ỹ ` ) is a (p2` + n)-dimensional Gaussian random
variable, and in particular possesses a joint (Gaussian) density with respect to the Lebesgue
measure. It is well known how to condition multivariate Gaussians, so we will not repeat the
computation (it is simply a matter of applying example 7.1.1 to the Gaussian density, and per-
forming explicit integrations); the result is as follows: if we denote by ΣXX , ΣY Y the covari-
ance matrices of X̃t and Ỹ ` , and by ΣXY the covariance between X̃t and Ỹ ` , then E(X̃t |Ỹ ` ) =
E(X̃t ) + ΣXY (ΣY Y )−1 (Ỹ ` − E(Ỹ ` )) (if ΣY Y is singular, take the pseudoinverse instead).
But for us E(X̃t ) = E(Ỹ ` ) = 0, so we conclude that E(X̃t |Ỹ ` ) = ΣXY (ΣY Y )−1 Ỹ ` .
Evidently E(X̃t |Ỹ ` ) can be written as a linear combination of the increments of Ỹ ` with
deterministic coefficients. In particular, we can thus write
Z t Z t Z t
E(X̃t |G` ) = G` (t, s) dỸs = G` (t, s)H(s)X̃s dt + G` (t, s)K(s) dBs ,
0 0 0
`
where s 7→ G (t, s) is a non-random simple function which is constant on the intervals s ∈
[(k − 1)2−` t, k2−` t[. To proceed, we would like to take the limit as ` → ∞. But note that
E(X̃t |G` ) → E(X̃t |FtY ) in L2 by Lévy’s upward theorem (lemma 4.6.4). Hence the remainder
is essentially obvious (see [LS01a, lemma 10.1] for more elaborate reasoning).
We can now proceed to solve the filtering problem. Our task is clear: out of all
linear functionals of the form defined in the previous lemma, we seek the one that
minimizes the mean square error. We will turn this problem into an optimal control
problem, for which G(t, s) in lemma 7.3.5 is precisely the optimal control.
Theorem 7.3.6 (Kalman-Bucy, no control). Theorem 7.3.1 holds for ut = 0.
Proof. Let us fix the terminal time T . For any (non-random) function G : [0, T ] → Rn×p with
RT
0
kG(t)k2 dt < ∞, we define the FtY -adapted process
Z t Z t
LG
t = E(Xt ) + G(s)H(s)(Xs − E(Xs )) ds + G(s)K(s) dBs ,
0 0
We would like to find such a G that minimizes the cost Jv [G] = E((v ∗ (XT − LG 2
T )) ) for every
vector v ∈ Rn . By proposition 2.3.3 and lemma 7.3.5, v ∗ LG T
∗
= E(v ∗
X T |F Y
T ) = v ∗ X̂T and
Jv [G∗ ] = v ∗ P̂T v for every v ∈ Rn , where G∗ is the function that minimizes Jv [G].
7.3. The Kalman-Bucy filter 191
d α
ξt = A(T − t)∗ ξtα − H(T − t)∗ α(t), ξ0α = v,
dt
and we obtain the cost Jv [G] = J[α] with
Jv [G] = J[α] =
Z T
{(ξsα )∗ C(T − s)C(T − s)∗ ξsα + α(s)∗ K(T − s)K(T − s)∗ α(s)} ds + (ξTα )∗ P̂0 ξTα .
0
But this is precisely a linear regulator problem for the controlled (non-random) differential
equation ξtα with the cost J[α]. The conclusion of the theorem follows easily (fill in the re-
maining steps!) by invoking the solution of the linear regulator problem (theorem 6.5.1).
It remains to verify that the innovations process B̄t is a Wiener process, as claimed; this
follows immediately, however, from proposition 7.2.9, without any changes in the proof.
where we have attached the label u to the signal process to denote its solution with the
control strategy u in operation. But then evidently
Z t
Xtu = Xt0 + Φs,t B(s)us ds,
0
and in particular the second term is FtY,u -adapted (as us was assumed to depend only
on the observations), where we have denoted the observations under the strategy u by
Z t Z t
u
Yt = u
H(s)Xs ds + K(s) dBs , FtY,u = σ{Ysu : s ≤ t}.
0 0
Hence we obtain
Z t
X̂tu = E(Xtu |FtY,u ) = E(Xt0 |FtY,u ) + Φs,t B(s)us ds.
0
If only FtY,u = FtY,0 , we would easily obtain an equation for X̂tu : after all, then
Z t
X̂tu = X̂t0 + Φs,t B(s)us ds,
0
and as we have already found the equation for X̂t0 we immediately obtain the appropri-
ate equation for X̂tu using Itô’s rule. The statement FtY,u = FtY,0 is not at all obvious,
however. The approach which we will take is simply to restrict consideration only
to those control strategies for which FtY,u = FtY,0 is satisfied; we will subsequently
show that this is indeed the case for a large class of interesting controls.
Remark 7.3.7. We had no such requirement in the bounded case, where we used
the Kallianpur-Striebel formula to obtain the filter. Indeed this requirement is also
superfluous here: it can be shown that under a straightforward integrability condition
(of purely technical nature), E(Xt0 |FtY,u ) = E(Xt0 |FtY,0 ) always holds regardless of
whether FtY,u = FtY,0 [Ben92, section 2.4]. It is perhaps not surprising that the proof
of this fact hinges crucially on the Kallianpur-Striebel formula! We will not need this
level of generality, however, as it turns out that FtY,u = FtY,0 for a sufficiently large
class of controls; we will thus be content to stick to this simpler approach.
The following result is now basically trivial.
L1 (µT ×P)
T
Theorem 7.3.8 (Kalman-Bucy with control). Suppose that u· ∈ T <∞
and that FtY,u = FtY,0 for all t < ∞. Then theorem 7.3.1 holds.
Proof. The integrability condition ensures that Xtu − Xt0 is in L1 (so that the conditional
expectation is well defined). The discussion above gives immediately the equation for X̂tu ,
which depends on P̂t . We claim that P̂t = E((Xtu − X̂tu )(Xtu − X̂tu )∗ ); but this follows
immediately from Xtu − X̂tu = Xt0 − X̂t0 . It remains to show that the innovation B̄tu is still a
Wiener process. But as Xtu − X̂tu = Xt0 − X̂t0 , we find that B̄tu = B̄t0 , which we have already
established to be a Wiener process. Hence the proof is complete.
7.3. The Kalman-Bucy filter 193
This result is not very useful by itself; it is not clear that there even exist controls
that satisfy the conditions! (Of course, non-random controls are easily seen to work,
but these are not very interesting.) We will conclude this section by exhibiting two
classes of controls that satisfy the conditions of theorem 7.3.8.
The first, and the most important class in the following, consists of those controls
whose value at time t is a Lipschitz function of the Kalman-Bucy filter at time t.
This class of separated controls is particularly simple to implement: we can update
the Kalman-Bucy filter numerically, e.g., using the Euler-Maruyama method, and at
each time we simply feed back a function of the latest estimate. In particular, the
complexity of feedback strategies that depend on the entire obsevation path in an
arbitrary way is avoided. As it turns out, controls of this form are also optimal for the
type of optimal control problems in which we are interested (see section 7.5).
Proposition 7.3.9. Let ut = α(t, X̃t ), where α : [0, ∞[ × Rn → Rk is Lipschitz and
dX̃t = A(t)X̃t dt + B(t)α(t, X̃t ) dt + P̂t (K(t)−1 H(t))∗ K(t)−1 (dYtu − H(t)X̃t dt)
with the initial condition X̃0 = X̂0 . Then (Xtu , Ytu , X̃t ) has a unique solution, ut
satisfies the conditions of theorem 7.3.8, and X̃t = X̂tu .
Proof. Set F (t) = K(t)−1 H(t). To see that (Xtu , Ytu , X̃t ) has a unique solution, write
which is FtY,0 -adapted as B̄t0 is an FtY,0 -Wiener process and the coefficients are Lipschitz.
Consider the FtY,0 -adapted control u0t = α(t, Xt0 ). It is easily seen that
Z t
u0 Y,0 0
E(Xt |Ft ) = X̂t + Ψs,t B(s)u0s ds.
0
0
Using Itô’s rule, we find thatE(Xtu |FtY,0 )
satisfies the same equation as Xt0 , so apparently
0 u0 Y,0
Xt = E(Xt |Ft ) by the uniqueness of the solution. On the other hand, note that
0
dB̄t0 = K(t)−1 dYt0 − F (t)X̂t0 dt = K(t)−1 dYtu − F (t)Xt0 ds,
so we can write
0
dXt0 = (A(t)Xt0 + B(t)α(t, Xt0 )) dt + P̂t F (t)∗ (K(t)−1 dYtu − F (t)Xt0 ds).
0 Rt 0
Thus Xt0 is FtY,u -adapted (e.g., note that Xt0 − 0 P̂s F (s)∗ K(s)−1 dYsu satisfies an ODE
0
which has a unique solution), so u0t = α(t, Xt0 ) is both FtY,0 - and FtY,u -adapted. But note that
Z t Z s
0
Ytu = Yt0 + H(s) Φr,t B(r)u0r dr ds.
0 0
7.4. The Shiryaev-Wonham filter 194
0 0 0
Thus FtY,0 ⊂ FtY,u , as Yt0 is a functional of Ytu and u0t , both of which are FtY,u -adapted.
0 0
Conversely FtY,u ⊂ FtY,0 , as Ytu is a functional of Yt0 and u0t , both of which are FtY,0 -
0 0
adapted. It remains to note that (Xtu , Ytu , Xt0 ) satisfies the same SDE as (Xtu , Ytu , X̃t ), so
0 0
ut = ut , etc., by uniqueness, and in particular X̃t = Xt0 = E(Xtu |FtY,0 ) = X̂tu = X̂tu .
0
The second class of controls that are guaranteed to satisfy the conditions of the-
orem 7.3.8 consists of those strategies which are “nice” but otherwise arbitrary func-
tionals of the observation history. The following result is not difficult to obtain, but as
we will not need it we refer to [FR75, lemma VI.11.3] for the proof. Let us restrict to
a finite interval [0, T ] for notational simplicity (the extension to [0, ∞[ is trivial).
Proposition 7.3.10. Let α : [0, T ] × C([0, T ]; Rp ) → Rk be a (Borel-)measurable
function which satisfies the following condtions:
1. If ys = ys0 for all s ≤ t, then α(t, y) = α(t, y 0 ); in other words, α(t, y) only
depends on {ys : s ≤ t} (for fixed t).
2. kα(t, y) − α(t, y 0 )k ≤ K maxs∈[0,T ] kys − ys0 k for some K < ∞ and all
t ∈ [0, T ] and y, y 0 ∈ C([0, T ]; Rp ); i.e., the function α is uniformly Lipschitz.
3. kα(t, 0)k is bounded on t ∈ [0, T ].
Define the control ut = α(t, Y· ). Then the equation for (Xt , Yt ) admits a unique
solution which satisfies the conditions of theorem 7.3.8.
The goal of the filtering problem is to estimate whether the change has occured by the
current time, given the observations up to the current time; in other words, we seek to
compute πt = P(τ ≤ t|FtY ) = E(Iτ ≤t |FtY ), where FtY = σ{Ys : s ≤ t}.
We have all the tools to solve this problem; in fact, compared to some of the more
technical problems which we have encountered in the previous sections, this is a piece
of cake! All that is needed is the Bayes formula and some simple manipulations.
Proposition 7.4.1 (Shiryaev-Wonham filter). πt = P(τ ≤ t|FtY ) satisfies
γ (1 − p0 − p∞ ) pτ (t)
dπt = πt (1−πt ) dB̄t + R∞ (1−πt ) dt, π0 = p0 ,
σ (1 − p0 − p∞ ) t pτ (s) ds + p∞
where the innovations process dB̄t = σ −1 (dYt − γπt dt) is an FtY -Wiener process.
Proof. Consider the following change of measure:
Z Z T
dQT γ T γ2
= exp − Iτ ≤s dBs − 2 Iτ ≤s ds .
dP σ 0 2σ 0
where µτ is the law of τ . Let us now evaluate the numerator and denominator explicitly. For
the numerator, we find the explicit expression
Z
γ2
Σt = Is≤t exp( σγ2 (Yt − Yt∧s ) − 2σ 2
(t − s)+ ) µτ (ds) =
[0,∞]
2
Z t 2
γ
Yt − γ 2 t γ
(Yt −Ys )− γ 2 (t−s)
p0 e σ 2 2σ + (1 − p0 − p∞ ) pτ (s) e σ2 2σ ds.
0
It remains to apply Itô’s rule. First, applying Itô’s rule to Σt , we obtain the counterpart of the
Zakai equation in the current context:
γ
dΣt = Σt dYt + (1 − p0 − p∞ ) pτ (t) dt, Σ 0 = p0 .
σ2
Another application of Itô’s rule gives
γ (1 − p0 − p∞ ) pτ (t)
dπt = πt (1 − πt ) σ −1 (dYt − γπt dt) + R∞ (1 − πt ) dt.
σ (1 − p0 − p∞ ) t pτ (s) ds + p∞
It remains to note that dB̄t = σ −1 (dYt − γπt dt) is a Wiener process, which follows exactly
as in proposition 7.2.9 without any change to the proof.
We would like to design a strategy ut to achieve a certain purpose; consider, for exam-
ple, a cost functional that is similar to the finite horizon cost in the previous chapter:
"Z #
T
J[u] = E {v(Xtu ) + w(ut )} dt + z(XTu ) .
0
(The specific type of running cost is considered for simplicity only, and is certainly
not essential.) Our goal is, as usual, to find a control strategy u∗ that minimizes the
cost J[u]. However, as opposed to the previous chapter, there is now a new ingredient
in the problem: we can only observe the noisy sensor data Ytu , so that the control
signal ut can only be FtY,u -adapted (where FtY,u = σ{Ysu : s ≤ t}). The theory of
the previous chapter cannot account for this; the only constraint which we are able to
impose on the control signal within that framework is the specification of the control
set U, and the constraint that ut is FtY,u -adapted is certainly not of this form. Indeed,
if we apply the Bellman equation, we always automatically obtain a Markov control
which is a function of Xtu and is thus not adapted to the observations.
The trick to circumvent this problem is to express the cost in terms of quantities
that depend only on the observations; if we can then find a feedback control which is
a function of those quantities, then that control is automatically F tY,u -adapted! It is
not difficult to express the cost in terms of observation-dependent quantities; indeed,
using lemma 7.2.7 and the tower property of the conditional expectation,
"Z #
T
J[u] = E {πtu (v) + w(ut )} dt + πTu (z)
0
(provided we assume that ut is FtY,u -adapted and that we have sufficient integrability
to apply lemma 7.2.7), where πtu (f ) = E(f (Xtu )|FtY,u ). But we can now interpret
this cost as defining a new control problem, where the system Xtu is replaced by the
filter πtu (·), and, from the point of view of the filter, we end up with a completely
observed optimal control problem. If we can solve a Bellman equation for such a
problem, then the optimal control at time t will simply be some function of the filter
at time t. Note that this is not a Markov control from the point of view of the physical
system Xtu , but this is a Markov control from the point of view of the filter. The idea to
express the control problem in terms of the filter is often referred to as the separation
principle, and a strategy which is a function of the filter is called a separated control.
Remark 7.5.1. You might worry that we cannot consider the filter by itself as an
autonomous system to be controlled, as the filter is driven by the observations Y tu
7.5. The separation principle and LQG control 198
obtained from the physical system rather than by a Wiener proces as in our usual
control system models. But recall that the filter can also be expressed in terms of the
innovations process: from this point of view, the filter looks just like an autonomous
equation, and can be considered as a stochastic differential equation quite separately
from the underlying model from which it was obtained.
Unfortunately, the separation principle does not in general lead to results that are
useful in practice. We have already seen that in most cases, the filter cannot be com-
puted in a finite-dimensional fashion. At the very best, then, the separation principle
leads to an optimal control problem for a stochastic PDE. Even if the formidable tech-
nical problems along the way can all be resolved (to see that tremendous difficulties
will be encountered requires little imagination), this is still of essentially no practical
use; after all, an implementation of the controller would require us both to propa-
gate a stochastic PDE in real time, and to evaluate a highly complicated function (the
control function) on an infinite-dimensional space! The former is routinely done in a
variety of applications, but the latter effectively deals the death blow to applications
of optimal control theory in the partially observed setting.3
On the other hand, in those cases where the filtering problem admits a finite-
dimensional solution, the separation principle becomes a powerful tool for control
design. In the remainder of this section we will develop one of the most important
examples: the partially observed counterpart of the linear regulator problem. We will
encounter more applications of the separation principle in the next chapter.
We consider the system-observation model
where the various objects in this expression satisfy the same conditions as in section
7.3. Our goal is to find a control strategy u which minmizes the cost functional
"Z #
T
J[u] = E {(Xtu )∗ P (t)Xtu + (ut )∗ Q(t)ut } dt + (XTu )∗ R(XTu ) ,
0
where P (t), Q(t) and R satisfy the same conditions as in section 6.5. We will insist,
however, that the control strategy ut is FtY,u -adapted, and we seek a strategy that is
optimal within the class of such controls that satisfy the conditions of theorem 7.3.8.
This is the LQG (Linear, Quadratic cost, Gaussian) control problem.
Theorem 7.5.2 (LQG control). Denote by Nt the solution of the Riccati equation
dNt
= A(t)Nt + Nt A(t)∗ − Nt H(t)∗ (K(t)K(t)∗ )−1 H(t)Nt + C(t)C(t)∗ ,
dt
3 That is not to say that this setting has not been studied; many questions of academic interest, e.g., on
the existence of optimal controls, have been investigated extensively. However, I do not know of a single
practical application where the separation principle has actually been applied in the infinite-dimensional
setting; the optimal control problem is simply too difficult in such cases, and other solutions must be found.
7.5. The separation principle and LQG control 199
u u
system (x,z) system (x,z)
y = βx + ξ
x x^
controller controller filter
Figure 7.2. Figure 0.3 revisited. The schematic on the left depicts the structure of a completely
observed optimal controller, as in the linear regulator problem. The schematic on the right
depicts the structure of a separated controller, as in the LQG problem.
where the initial condition N0 is taken to be the covariance matrix of X0 , and denote
by Mt the solution of the time-reversed Riccati equation
d
Mt + A(t)∗ Mt + Mt A(t) − Mt B(t)Q(t)−1 B(t)∗ Mt + P (t) = 0,
dt
with the terminal condition MT = R. Then an optimal feedback control strategy for
the LQG control problem is given by u∗t = −Q(t)−1 B(t)∗ Mt X̂t , where
Proof. As we assume that our controls satisfy the conditions of theorem 7.3.8, the Kalman-
Bucy filtering equations are valid. We would thus like to express the cost J[u] in terms of the
Kalman-Bucy filter. To this end, note that for any (non-random) matrix G
E((Xtu )∗ GXtu ) − E((X̂tu )∗ GX̂tu ) = E((Xtu − X̂tu )∗ G(Xtu − X̂tu )) = Tr[GP̂t ].
Thus evidently, the following cost differs from J[u] only by terms that depend on Tr[GP̂t ]
(provided we assume that ut is a functional of the observations):
Z T
0 u ∗ u ∗ u ∗ u
J [u] = E {(X̂t ) P (t)X̂t + (ut ) Q(t)ut } dt + (X̂T ) R(X̂T ) .
0
But Tr[GP̂t ] is non-random and does not depend on the control u, so that clearly a strategy u∗
which minimizes J 0 [u] will also minimize J[u]. Now note that X̂tu satisfies the equation
dX̂tu = A(t)X̂tu dt + B(t)ut dt + P̂t (K(t)−1 H(t))∗ dB̄t ,
where B̄t is a Wiener process. Hence the equation X̂tu , together with the cost J 0 [u], defines
a linear regulator problem. By theorem 6.5.1, we find that an optimal control is given by the
∗
strategy u∗t = −Q(t)−1 B(t)∗ Mt X̂tu , and this control is admissible in the current setting by
proposition 7.3.9. Moreover, the controlled Kalman-Bucy filter is given precisely by X̂t in the
statement of the theorem. Hence we are done.
7.5. The separation principle and LQG control 200
-5
Figure 7.3. Simulation of the model of example 7.5.4 with the optimal control strategy in
operation. Shown are the position of the particle xt (blue), the best estimate of the particle
position x̂t (green), and the position of the microscope focus −zt (red). For this simulation
T = 1, β = 100, σ = 10, γ = 5, κ = .5, z0 = 0, E(x0 ) = 0, var(x0 ) = 2, P = 1, Q = .5.
Remark 7.5.3. It is worth pointing out once again the structure of the controls ob-
tained through the separation principle (figure 7.2). In the completely observed case
(e.g., the linear regulator), the controller has access to the state of the system, and
computes the feedback signal as a memoryless function of the system state. In the
partially observed case (e.g., the LQG problem), the noisy observations are first used
to compute the best estimate of the system state; the controller then feeds back a mem-
oryless function of this estimate. Evidently the optimal control strategy separates into
a filtering step and a memoryless controller, hence the name “separation principle”.
Example 7.5.4 (Tracking under a microscope IV). Let us return for the last time to
our tracking example. In addition to the previous model, we now have observations:
dzt
= βut , xt = x0 + σWt , dyt = γ (xt + zt ) dt + κ dBt .
dt
We would like to find a strategy u∗ which minimizes the cost
" Z #
T Z T
J[u] = E P (xt + zt )2 dt + Q (ut )2 dt ,
0 0
but this time we are only allowing our controls to depend on the noisy observations.
As before we will define et = xt + zt , so that we can work with the system equation
det = βut dt+σ dWt (as the observations and cost functional depend only on et ). But
we can now directly apply theorem 7.5.2. We find that u∗t = −Q−1 β mt êt , where
β 2 mt γ nt
dêt = − êt dt + 2 (dyt − γ êt dt), ê0 = E(e0 ),
Q κ
and mt , nt are the solutions of the equations
dmt β2 2 dnt γ2
− m + P = 0, = σ 2 − 2 n2t ,
dt Q t dt κ
with mT = 0 and n0 = var(e0 ). A numerical simulation is shown in figure 7.3.
7.6. Transmitting a message over a noisy channel 201
Remark 7.5.5. Though we have only considered the finite time horizon cost, it is
not difficult to develop also the time-average and the discounted versions of the LQG
problem; see, e.g., [Dav77]. In fact, the usefulness of the separation principle in
the linear setting is not restricted to quadratic costs; we may choose the running and
terminal costs essentially arbitrarily, and the optimal control will still be expressible
as a function of the Kalman-Bucy filter [FR75] (though, like in the completely ob-
served case, there is no analytic solution for the feedback function when the cost is
not quadratic). The quadratic cost has a very special property, however, that is not
shared by other cost functions. Note that for the quadratic cost, the optimal feedback
function for the partially observed case u∗t = α(t, X̂t ) is the same function as in the
completely observed case u∗t = α(t, Xt ), where we have merely replaced the system
state by its estimate! This is called certainty equivalence. Though the optimal control
problem still separates for linear systems with non-quadratic cost, certainty equiva-
lence no longer holds in that case. In other words, in the latter case we still have
u∗t = α(t, X̂t ), but for the completely observed problem u∗t = α0 (t, Xt ) with α0 6= α.
where P bounds the signal power per unit time. On the other hand, we will presume
that the receiver may send a response to the transmitter in a noiseless manner, i.e., that
there is a noiseless feedback channel. This setup is illustrated in figure 7.4.
Let us now turn to the message. We will investigate the simplest type of mes-
sage: the transmitter has obtained a single Gaussian random variable θ, which is F 0 -
measurable and thus independent of Bt , to transmit to the receiver. We are thus faced
with the following problem: we would like to optimize our usage of the communica-
tion channel by choosing wisely the encoding strategy employed by the transmitter,
the decoding strategy employed by the receiver, and the way in which the receiver and
transmitter make use of the feedback channel, so that the receiver can form the best
possible estimate of θ at the end of the day given the time and power constraints.
7.6. Transmitting a message over a noisy channel 202
ξ
decoded
message message
A(θ,y) y=A+ξ ^
θ encoder decoder θ
noisy channel
noiseless feedback
Figure 7.4. Figure 0.4 revisited: setup for the transmission of a message over a noisy channel
with noiseless feedback. Further details can be found in the text.
where Bt is an Ft -Wiener process on the probability space (Ω, F, {Ft }, P). As the
transmitter has access to Ytu (by virtue of the noiseless feedback, the receiver can
always forward the received signal Ytu back to the transmitter), we may choose ut to
be FtT,u -adapted. On the other hand, the receiver will only have access to the noisy
observations FtR,u , and at the end of the communication period the receiver will form
an FTR,u -measurable estimate θ̂u of the message θ. We now have two problems:
• The transmitter must choose the optimal encoding strategy u.
• The receiver must choose the optimal decoding strategy θ̂u .
By “optimal”, we mean that we wish to minimize the mean square error J 0 [u, θ̂u ] =
E((θ − θ̂u )2 ) over all encoding/feedback strategies u and all decoding strategies θ̂u .
The second problem—the choice of an optimal decoding strategy—is easy to
solve. After all, we know that at time t, the best estimate of θ based on the obser-
vation history is given by θ̂tu = E(θ|FtR,u ), regardless of the encoding strategy u
employed by the receiver. In principle, this solves one half of the problem.
In practice, if we choose a complicated encoding strategy, the resulting decoder
(filter) may be very complicated; in particular, it will be difficult to find the optimal
encoding strategy when we have so much freedom. We will therefore restrict ourselves
to a particularly simple class of encoders, which can be written as u t = at +bt θ where
at is FtR,u -adapted and bt is non-random. We will seek an optimal encoding strategy
within this class of linear encoding strategies. The advantage of the linear strategies
is that the resulting filters θ̂tu are easily computable; let us begin by doing this.
Lemma 7.6.1. Consider the linear encoding strategy ut = at + bt θ, and define the
0
simplified strategy u0t = bt θ. If FtR,u = FtR,u for all t ∈ [0, T ], then
dP̂tu
dθ̂tu = P̂tu bt (dYtu − at dt − bt θ̂tu dt), = −(bt P̂tu )2 ,
dt
where θ̂0u = E(θ) and P̂0u = var(θ).
7.6. Transmitting a message over a noisy channel 203
Proof. For u0 , we simply obtain a Kalman-Bucy filter with X0 = θ, A(t) = B(t) = C(t) = 0,
0
H(t) = bt , and K(t) = 1. But θ̂tu = θ̂tu follows trivially from the assumption on the equality
of the σ-algebras, so we obtain the result.
E((ut )2 ) = E((at + bt θ̂tu + bt (θ − θ̂tu ))2 ) = E((at + bt θ̂tu )2 ) + b2t P̂tu ≥ b2t P̂tu ,
where we have used the properties of the conditional expectation to conclude that we may set
E((at + bt θ̂tu )(θ − θ̂tu )) = 0. But then our power constraint requires that
Z t Z t
b2s P̂su ds ≤ E (us )2 ds ≤ P t,
0 0
so we conclude that P̂tu ≥ var(θ) e−P t . To find a strategy that attains this bound, note that
s !2
d P eP t
var(θ) e−P t = −P var(θ) e−P t = − var(θ) e−P t ,
dt var(θ)
p
so bt = eP t/2 P/var(θ) gives P̂tu = var(θ) e−P t . Thus we must choose this bt to obtain
any optimal strategy, provided we can find an at such that the resulting strategy satisfies the
power constraint. But for this choice of bt , we find that
Z t Z t Z t Z t
E (us )2 ds = E (as + bs θ̂su )2 ds + b2s P̂su ds = E (as + bs θ̂su )2 ds +P t,
0 0 0 0
so the power constraint is satisfied if and only if at + bt θ̂tu = 0 for all t. This yields the strategy
u∗t . It remains to check that the strategy u∗t satisfies the condition of lemma 7.6.1; but this is
easily done following the same logic as in the proof of proposition 7.3.9.
Have we gained anything by using the feedback channel? Let us see what hap-
pens if we disable the feedback channel; in this case, at can no longer depend on the
observations and is thus also non-random. We now obtain the following result.
7.6. Transmitting a message over a noisy channel 204
Proposition 7.6.3. Within the class of linear encoding strategies without feedback
which satisfy the power constraint, an optimal strategy u∗ is given by
s
P
u∗t = (θ − E(θ)).
var(θ)
∗
The ultimate mean square error for this strategy is E((θ − θ̂Tu )2 ) = var(θ)/(1+P T ).
Proof. This is the same idea as in the previous proof, only we now require that at is non-random
(note that in this case the condition of lemma 7.6.1 is automatically satisfied). The equation for
P̂tu can be solved explicitly: it is easily verified that
var(θ)
P̂tu = Rt .
1 + var(θ) 0 (bs )2 ds
Remark 7.6.4. Evidently the strategy that uses the feedback channel performs much
better than the strategy without feedback. It is illuminating in this regard to inves-
tigate the particular form of the optimal strategies. Note that in the absence of the
power constraint, we would have no problem sending the message across the noisy
channel; we could just transmit θ directly over the channel with some large gain fac-
tor, and by cranking up the gain we can make the signal to noise ratio arbitrarily large
(and thus the estimation error arbitrarily small). However, with the power constraint
in place, we have to choose wisely which information we wish to spend our power al-
lowance on. Clearly it is not advantageous to waste power in transmitting something
that the receiver already knows; hence the optimal strategies, rather than transmitting
the message itself, try to transmit the discrepancy between the message and the part
of the message thas is known to the receiver. Here feedback is of great help: as the
transmitter knows what portion of the message was received on the other end, it can
spend its remaining power purely on transmitting the parts of the message that were
corrupted (it does this by only transmitting the discrepancy between the message and
the receiver’s estimate of the message). On the other hand, the feedbackless transmit-
ter has no idea what the receiver knows, so the best it can do is subtract from θ its
mean (which is assumed to be known both to the transmitter and to the receiver).
Surprisingly, perhaps, these results are not restricted to the linear case; in fact, it
turns out that the encoding strategy of proposition 7.6.2 is optimal even in the class
of all nonlinear encoding strategies. It would be difficult to prove this directly, how-
ever, as this would require quantifying the mean-square error for a complicated set of
7.7. Further reading 205
Rozovskii [Roz90]. The issue of efficient numerical algorithms is another story; many
references were already mentioned in the chapter, but see in particular the recent re-
view [Cri02]. There are various interesting issues concerning the finite-dimensional
realization of filtering problems; the volumes [HW81, Mit82] contain some interesting
articles on this topic. Another very useful topic which we have overlooked, the robust
or pathwise definition of the filter, is discussed in an article by Davis in [HW81].
The Kalman-Bucy filter is treated in detail in many places; see, e.g., the book
by Davis [Dav77] and Liptser and Shiryaev [LS01a, LS01b]. Our treatment, through
stochastic control, was heavily inspired by the treatment in Fleming and Rishel [FR75]
and in Bensoussan [Ben92]. The relations between filtering and stochastic control go
very deep indeed, and are certainly not restricted to the linear setting; on this topic,
consult the beautiful article by Mitter and Newton [MN03]. The Kalman-Bucy filter
can be extended also to the general conditionally Gaussian case where A(t), B(t),
C(t), H(t) and K(t) are all adapted to the observations, see Liptser and Shiryaev
[LS01b], as well as to the case where X0 has an arbitrary distribution (i.e., it is non-
Gaussian); for the latter, see the elegant approach by Makowski [Mak86].
The Shiryaev-Wonham filter is due to Shiryaev [Shi63, Shi73] and, in a more gen-
eral setting which allows the signal to be an arbitrary finite state Markov process, due
to Wonham [Won65]. Our treatment was inspired by Rogers and Williams [RW00b].
On the topic of partially observed control, Bensoussan [Ben92] is a good source of
information and further references. Our treatment was inspired by Fleming and Rishel
[FR75], which follows closely the original article by Wonham [Won68b] (for results
in the finite state setting see Segall [Seg77] and Helmes and Rishel [HR92]). Finally,
the transmission of a message through a noisy channel, and many other applications,
are treated in the second volume of Liptser and Shiryaev [LS01b].
HAPTER
8
C
Optimal Stopping and Impulse Control
In the previous chapters, we have discussed several control problems where the goal
was to optimize a certain performance criterion by selecting an apppropriate feedback
control policy. In this chapter, we will treat a somewhat different set of control prob-
lems; rather than selecting a continuous control to be applied to an auxiliary input in
the system equations, our goal will be to select an optimal stopping time to achieve a
certain purpose. Such problems show up naturally in many situations where a timing
decision needs to be made, e.g., when is the best time to sell a stock? When should
we decide to bring an apparatus, which may or may not be faulty, in for repair (and
pay the repair fee)? How long do we need to observe an unknown system to be able
to select one of several hypotheses with sufficient confidence? Such problems are
called optimal stopping problems, and we will develop machinery to find the optimal
stopping times. These ideas can also be extended to find optimal control strategies
in which feedback is applied to the system at a discrete sequence of times; we will
briefly discuss such impulse control problems at the end of this chapter.
207
8.1. Optimal stopping and variational inequalities 208
Remark 8.1.1. As with the infinite time costs in chapter 6, we will find it convenient
to choose b and σ to be time-independent. However, time can always be added simply
by considering the (n + 1)-dimensional system Xt0 = (Xt , t).
For an Ft -stopping time τ , define the cost functional
Z τ
−λs −λτ
J[τ ] = E e w(Xs ) ds + e z(Xτ ) ,
0
A heuristic calculation
As in chapter 6, we will mostly concentrate on obtaining a useful verification theorem.
However, to clarify where the equations in the verification theorem come from, it is
helpful to first obtain the appropriate equations in a heuristic manner. Let us do this
now. We will disregard any form of technical precision until further notice.
Let τ be an admissible stopping rule. We define the cost-to-go J τ (x) as
Z τ
τ −λs −λτ
J (X0 ) = E e w(Xs ) ds + e z(Xτ ) X0 .
0
Note that J τ (x) is the cost of the stopping rule τ when X0 = x is non-random.
Now suppose, as we did in the corresponding discussion in chapter 6, that there
is a stopping rule τ ∗ , with continuation region D, which minimizes J τ (x) for every
∗
x, and define the value function as the optimal cost-to-go V (x) = J τ (x). We will
try to find an equation for V (x). To this end, let τ be any admissible stopping rule,
and define the new stopping rule τ 0 = inf{t ≥ τ : Xt 6∈ D}. Then τ 0 is the rule
under which we do not stop until the time τ , and continue optimally afterwards, and
by assumption J[τ ∗ ] ≤ J[τ 0 ]. But then it is not difficult to see that
Z τ
0
V (X0 ) ≤ J τ (X0 ) = E e−λs w(Xs ) ds + e−λτ V (Xτ ) X0 ,
0
where we have used the stong Markov property of Xt (see remark 5.2.3) and the tower
property of the conditional expectation. On the other hand, we obtain an equality in
this expression, rather than an inequality, if we choose τ ≤ τ ∗ (why?).
8.1. Optimal stopping and variational inequalities 209
Now suppose that V (x) is sufficiently smooth to apply Itô’s rule. Then
Z τ Z τ
−λτ −λs
e V (Xτ ) = V (X0 ) + e {L V (Xs ) − λ V (Xs )} ds + · · · dWs .
0 0
Now consider the case that τ is arbitrary. Proceeding in the same way as above, we
obtain L V (x) − λV (x) + w(x) ≥ 0; on the other hand J[τ ∗ ] ≤ J[0] (we can do at
least as well as stopping immediately), so that in particular V (x) ≤ z(x). Hence
This is not a PDE in the usual sense; it is called a variational inequality. Just like
in the optimal control case, where the Bellman equation reduces the optimization
problem to a pointwise minimization over all possible control actions, the variational
inequality reduces the optimal stopping problem to a pointwise minimization over
our two options: to continue, or to stop. It is important to note that if V (x) is a
unique solution to the variational inequality, then we can completely reconstruct the
continuation region D: it is simply D = {x : V (x) < z(x)}. Hence it suffices, as in
the optimal control setting, to solve a (nonlinear, variational inequality) PDE for the
value function, in order to be able to construct the optimal strategy.
Remark 8.1.2. There are much more elegant treatments of the optimal stopping the-
ory which can be made completely rigorous with some effort. One of these methods,
due to Snell, is closely related to the martingale dynamic programming principle in
optimal control (see remark 6.1.7) and works in a general setting. Another method,
due to Dynkin, characterizes the optimal cost of stopping problems for the case where
Xt is a Markov process. Both these methods are extremely fundamental to optimal
stopping theory and are well worth studying; see, e.g., [PS06].
8.1. Optimal stopping and variational inequalities 210
A verification theorem
The previous discussion is only intended as motivation. We have made various en-
tirely unfounded assumptions, which you should immediately discard from this point
onward. Rather than making the story above rigorous, we proceed, as in chapter 6,
in the opposite direction: we will assume that we have found a sufficiently smooth
solution to the appropriate variational inequality, and prove a verification theorem that
guarantees that this solution does indeed give rise to an optimal stopping rule.
Proposition 8.1.3. Let K ⊂ Rn be a set such that Xt ∈ K for all t. Suppose there is
a function V : K → R, which is sufficiently smooth to apply Itô’s rule, such that
and |E(V (X0 ))| < ∞. Define the set D = {x ∈ K : V (x) < z(x)}, and denote by
K the class of admissible stopping rules τ such that τ < ∞ a.s. and
" n m Z #
XX τ ∂V
E e−λs i (Xs ) σ ik (Xs ) dWsk = 0.
i=1 0 ∂x
k=1
for τ ∈ K. But the variational inequality implies V (x) ≤ z(x) and λV (x) − L V (x) ≤ w(x),
so we find that E(V (X0 )) ≤ J[τ ]. On the other hand, for τ = τ ∗ , we these inequalities
become equalities, so we find that J[τ ∗ ] = E(V (X0 )) ≤ J[τ ]. This establishes the claim.
In this verification result, we required that V (x) “is sufficiently smooth to apply
Itô’s rule”. It would seem that we should just assume that V is in C 2 , as this is the
requirement for Itô’s rule. Unfortunately, hardly any optimal stopping problem gives
rise to a value function in C 2 . The problems occur on the boundary ∂D of D: often
V (x) is C 2 on K\∂D, but on ∂D it is only C 1 . We thus need to extend Itô’s rule to
this situation. There are various technical conditions under which this is possible; the
most elegant is the following result, which holds only in one dimension.
Proposition 8.1.5 (Relaxed Itô rule in one dimension). Suppose that V : R → R is
C 1 and admits a (not necessarily continuous) second derivative in the sense that there
exists a measurable function ∂ 2 V /∂x2 such that
Z x 2
∂V ∂V ∂ V
(x) − (0) = 2
(y) dy, x ∈ R.
∂x ∂x 0 ∂x
For the proof of this statement, see [RW00b, lemma IV.45.9]. The proof is a little
too difficult to us; it requires, in essence, to show that Xt does not spend any time at
the discontinuity points of ∂ 2 V /∂x2 (i.e., the amount of time spent at the discontinuity
points has Lebesgue measure zero). For generalizations to the multidimensional case,
see [Kry80, section 2.10] and particularly [PS93] (see also [Øks03, theorem 10.4.1]
and [Fri75, theorem 16.4.1]). For our purposes proposition 8.1.5 will suffice, as we
will restrict ourselves to one-dimensional examples for simplicity.
Remark 8.1.6 (The principle of smooth fit). The fact that V (x) is generally not C 2
is not surprising; on the other hand, the fact that V (x) should be C 1 is not at all obvi-
ous! Nonetheless, in many cases it can be shown that the gradient of V (x) does indeed
need to be continuous on the boundary ∂D; this is called the principle of smooth fit
(see, e.g., [PS06, DK03] for proofs). This turns out to be an extremely useful tool
for finding an appropriate solution of the variational inequality. In general, there are
many solutions to the variational inequality, each leading to a different continuation
set D; however, it is often the case that only one of these solutions is continuously
differentiable. Only this solution, then, satisfies the conditions of the verification the-
orem, and thus the principle of smooth fit has helped us find the correct solution to the
optimal stopping problem. We will shortly see this procedure in action.
Let us treat an interesting example (from [Øks03]).
Example 8.1.7 (Optimal resource extraction). We are operating a plant that extracts
natural gas from an underground well. The total amount of natural gas remaining
in the well at time t is denoted Rt (so the total amount of extracted natural gas is
R0 − Rt ). Moreover, the rate at which we can extract natural gas from the well is
proportional to the remaining amount: that is, when the plant is in operation, the
amount of natural gas in the well drops according the equation
d
Rt = −λ Rt ,
dt
where λ > 0 is the proportionality constant. After the gas has been extracted, it is
sold on the market at the current market price Pt , which is given by the equation
where µ > 0. However, it costs money to operate the plant: in order to keep the plant
running we have to pay K dollars per unit time. The total amount of money made by
time t by extracting natural gas and selling it on the market is thus given by
Z t Z t
Ps d(R0 − Rs ) − Kt = (λRs Ps − K) ds.
0 0
It seems inevitable that at some point in time it will no longer be profitable to keep the
plant in operation: we will not be able to extract the natural gas sufficiently rapidly to
be able to pay for the operating costs of the plant. We would like to determine when
8.1. Optimal stopping and variational inequalities 212
would be the best time to call it quits, i.e., we would like to find a stopping time τ ∗
which maximizes the expected profit −J[τ ] up to time τ . We thus seek to minimize
Z τ
J[τ ] = E (K − λRs Ps ) ds .
0
This is precisely an optimal stopping problem of the type we have been considering.
The problem can be simplified by noting that the cost functional depends only on
the quantity St = Rt Pt . Using Itô’s rule, we find that
Z τ
dSt = (µ − λ)St dt + σSt dWt , J[τ ] = E (K − λSs ) ds .
0
As St ≥ 0 a.s. for all t, we can apply proposition 8.1.3 with K = [0, ∞[. The
variational inequality for this problem can be written as
2 2 2
σ x ∂ V (x) ∂V (x)
min + (µ − λ)x + K − λx, −V (x) = 0.
2 ∂x2 ∂x
Thus on Dc , we must have V (x) = 0 and L V (x) + K − λx = K − λx ≥ 0; in
particular, if x ∈ D c , then x ≤ K/λ, so we conclude that ]K/λ, ∞[ ⊂ D. Let us now
try to solve for V (x) on D. To this end, consider the PDE
σ 2 x2 ∂ 2 V (x) ∂V (x)
+ (µ − λ)x + K − λx = 0.
2 ∂x2 ∂x
Let us try a solution of the form
K log(x) λx
Vc (x) = − 2
+ + c.
µ − λ − σ /2 µ − λ
If V (x) = Vc (x) on D, then it must be that Vc (x) < 0 on ]K/λ, ∞[; in particular,
this means that we must require that µ < λ. Intuitively this makes sense: if the price
of natural gas were to grow at a faster rate than the rate at which we deplete our well,
then it would always pay off to keep extracting more natural gas!
Let us thus assume that µ < λ, and we are seeking a solution of the form V (x) =
Vc (x) on D and V (x) = 0 on D c . To determine the appropriate c and D, we will try
to paste the solutions Vc (x) and 0 together in such a way that the result is C 1 —i.e.,
we are going to use the principle of smooth fit. To this end, note that the derivative of
V must vanish on the boundary of D (as V (x) = 0 on D c ). But
We have thus shown that the variational inequality is solved by the value function
V (x) = 0 for x ≤ x∗ , and V (x) = Vc∗ (x) for x > x∗ ; note that V (x) is C 1 on [0, ∞[
and C 2 on [0, ∞[\{x∗ }. Our candidate stopping rule is thus τ ∗ = inf{t : Xt ≤ x∗ }.
To conclude that τ ∗ is indeed optimal, it remains to show that τ ∗ ∈ K. This is
indeed possible whenever µ < λ using a more refined version of the optional stopping
theorem than we have discussed; see [RW00a, theorems II.69.2 and II.77.5]. For sake
of simplicity, let us verify that τ ∗ ∈ K under the more restrictive assumption that
µ − λ + σ 2 /2 < 0. Recall that we must assume that E(V (S0 )) is finite, and we will
also assume without loss of generality that S0 ≥ x∗ a.s.
First we will establish that E(τ ∗ ) < ∞. To this end, note that
As E(V (S0 )) is finite, E(log(S0 )) is finite also, and we find that E(log(St∧τ ∗ )) =
E(log(S0 )) + (µ − λ − σ 2 /2) E(t ∧ τ ∗ ). In particular, by monotone convergence,
E(log(S0 )) − limt→∞ E(log(St∧τ ∗ ))
E(τ ∗ ) = .
σ 2 /2 + λ − µ
But E(log(St∧τ ∗ )) ≥ log x∗ , so E(τ ∗ ) < ∞. Next, we need to show that
"Z ∗ # "Z ∗ #
τ τ
∂V
E (Ss ) σ Ss dWs = E (C1 Ss + C2 ) dWs = 0,
0 ∂x 0
where C1 and C2 are the appropriate constants. The integral over C2 has zero expec-
tation by lemma 6.3.4. To deal with the integral over St , note that for m < n
!2
Z n∧τ ∗ "Z
n∧τ ∗
# Z ∞
2 2
E Ss dWs =E (Ss ) ds ≤ E (Ss ) ds .
m∧τ ∗ m∧τ ∗ 0
But you can verify using Itô’s rule that the term on the right is finite whenever we have
µ − λ + σ 2 /2 < 0. Hence if this is the case, we find using dominated convergence
∗ Z ∗ "Z ∗ #
Z n∧τ τ τ
Ss dWs → Ss dWs in L2 =⇒ E Ss dWs = 0
0 0 0
(use that the integral is a Cauchy sequence in L2 ). This is what we set out to show.
It is often useful to be able to introduce an additional constraint in the optimal
stopping problem; we would like to find the optimal stopping time prior to the time
when the system exits a predetermined set K. We will see an example of this below.
The corresponding extension of proposition 8.1.3 is immediate, and we omit the proof.
Proposition 8.1.8. Let K ⊂ Rn be a fixed open set, with closure K and boundary
∂K = K\K, and assume that X0 ∈ K a.s. Suppose there is a function V : K → R
which is sufficiently smooth to apply Itô’s rule, such that V (x) = z(x) on ∂K,
and |E(V (X0 ))| < ∞. Define D = {x ∈ K : V (x) < z(x)}, and let K be the class
of admissible stopping rules τ with τ < ∞ a.s., τ ≤ τK = inf{t : Xt 6∈ K}, and
" n m Z #
XX τ ∂V
E e−λs i (Xs ) σ ik (Xs ) dWsk = 0.
i=1 0 ∂x
k=1
At some point in the future we might want to sell our stock, but this is risky: by that
point the stock price may have tanked, in which case we would not be able to sell the
stock for a reasonable price on the stock market. To mitigate this risk, we may take
out a form of insurance on our stock: a put option. This is a contract which guarantees
that we will be able to sell our stock at some time in the future for a predetermined
price K. A European put option works precisely in this way: we fix T and K, and the
contract guarantees that we may sell our stock for the price K at time T . Hence the
payoff from such an option is (K − ST )+ (because if the stock price is larger than K,
we retain the right to sell our stock on the stock market instead).
European put options are not our only choice, however; there are options which
allow us more flexibility. In this example, we will investigate an American put option.
Like in the European case, we fix a price K and a terminal time T . In contrast with
the European option, however, an American put option can be exercised at any point
in time in the interval [0, T ]: that is, after purchasing the option at time zero, we may
decide at any stopping time τ ≤ T to sell our stock for the price K. If we choose
to exercise at time τ , then the payoff from the option is (K − Sτ )+ . The question
now becomes: when should we choose to exercise to maximize our payoff from the
option? This problem is naturally formulated as an optimal stopping problem.
In general, there is also a bank account involved which gives an interest rate r. It
is customary to try to maximize the discounted payoff, i.e., we will normalize all our
prices by the amount of money we could have made by investing our money in the
bank account rather than in the risky stocks. This gives rise to the following optimal
8.1. Optimal stopping and variational inequalities 215
(convince yourself that this is true!), and we can shift time by ∆ in the second term
without modifying the ∆ → 0 limit (we will shortly see why this is desirable). Sub-
stituting into the variational inequality and rearranging gives
∆σ 2 x2
+ ∆µx
Vδ,∆ (t − ∆, x) = min −(K − x) , 1 − − − ∆r Vδ,∆ (t, x)
δ2 δ
∆σ 2 x2 ∆σ 2 x2
∆µx
+ + Vδ,∆ (t, x + δ) + Vδ,∆ (t, x − δ) .
2δ 2 δ 2δ 2
Note that this is a backwards in time recursion for Vδ,∆ (t, x)! (It is for this reason
that we shifted the terminal cost term in the variational inequality by ∆: if we did not
do this, the right hand side would depend on Vδ,∆ (t − ∆, x)). It remains to specify
boundary conditions, but this follows directly from proposition 8.1.8: we should set
Vδ,∆ (t, x) = −(K − x)+ on the boundary of K 0 , i.e., whenever t = T or x = R.
We now claim that this discretized equation is itself the dynamic programming
equation for an optimal stopping problem for a discrete time Markov chain on a finite
state space, provided that ∆ is sufficiently small that 1 − ∆σ 2 M 2 − ∆µM − ∆r ≥ 0.
Proposition 8.1.11. Let xk , k = 0, . . . , N be a Markov chain on the state space
{nδ : n = 0, . . . , M } with the following transition probabilities for n < M :
1 − ∆σ 2 n2 − ∆µn − ∆r
P(xk = nδ|xk−1 = nδ) = ,
1 − ∆r
∆σ 2 n2 + 2∆µn
P(xk = (n + 1)δ|xk−1 = nδ) = ,
2 − 2∆r
∆σ 2 n2
P(xk = (n − 1)δ|xk−1 = nδ) = ,
2 − 2∆r
and all other transition probabilities are zero. For the state n = M (so nδ = R), let
P(xk = R|xk−1 = R) = 1 (so the boundary R is an absorbing state). Moreover, let
for any stopping time τ ≤ N for the filtration generated by xk (so τ is an {0, . . . , N }-
valued random variable). Denote D = {(k, nδ) : Vδ,∆ (k∆, nδ) + (K − xτ )+ < 0}.
Then τ ∗ = inf{k : (k, xk ) 6∈ D} is an optimal stopping rule for the cost H[τ ] in the
sense that H[τ ∗ ] = E(Vδ,∆ (0, x0 )) ≤ H[τ ] for any stopping time τ ≤ N .
8.1. Optimal stopping and variational inequalities 217
110 100
100 80
-V (0, x)
60
90 D
40
80
20
Dc - J [τ∗]
70
0
τ∗ (K - x)+
60
0 0.2 0.4 0.6 0.8 1 0 50 100 150 200
Figure 8.1. Numerical solution of example 8.1.9 with T = 1, K = 100, σ = .3, µ = r = .05,
and R = 200. In the left plot, the boundary ∂D of the continuation region D is plotted in
blue, while the contract price K is shown in green (the horizontal axis is time, the vertical axis
is price). A single sample path of the stock price, started at S0 = 100, is shown in red; the
optimal stopping rule says that we should stop when the stock price hits the curve ∂D. In the
right plot, the value function −V (t, x) is plotted in blue for t = 0 (the horizontal axis is stock
price, the vertical axis is payoff). For an initial stock price of 100 dollars, we see that the option
should be priced at approximately ten dollars. Note that the exercise boundary ∂D intersects
the line t = 0 precisely at the point where −V (0, x) and (K − x)+ (shown in green) diverge.
where we have used the equation for Vδ,∆ ((k − 1)∆, mδ). In particular, we find that
by the Markov property, where Fk = σ{x` : ` ≤ k}. Now note that Iτ ≥k is Fk−1 -measurable,
as τ is a stopping time. Multiplying by Iτ ≥k (1 − ∆r)k−1 and taking the expectation gives
where we have used the equation for Vδ,∆ (t, x) again. But repeating the same argument with
τ ∗ instead of τ , the inequalities are replaced by equalities and we find that E(Vδ,∆ (0, x0 )) =
H[τ ∗ ]. Thus τ ∗ is indeed an optimal stopping time for the discrete problem.
The numerical solution of the problem is shown in figure 8.1. Evidently the bound-
ary ∂D of the continuation region is a curve, and D is the area above the curve. The
8.2. Partial observations: the modification problem 218
optimal time to exercise the option is the time at which the stock price first hits ∂D
(provided that the inital stock price lies above the curve), and we can read off the
optimal cost (and hence the fair price of the option) from the value function V (0, x).
and assume that τ is an FtY -stopping time. As before, we would like to use the tower
property of the conditional expectation to express this cost directly in terms of the
filter. Consider first the integral term. If w is nonnegative, for example (so we can
apply Fubini’s theorem—clearly this can be weakened), we obtain
Z τ Z ∞
E e−λs w(Xs ) ds = E Is<τ e−λs w(Xs ) ds
0 0
Z ∞ Z ∞
= E(Is<τ e−λs w(Xs )) ds = E(Is<τ e−λs E(w(Xs )|FsY )) ds
0 0
Z ∞ Z τ
−λs Y −λs Y
=E Is<τ e E(w(Xs )|Fs ) ds = E e E(w(Xs )|Fs ) ds ,
0 0
where we have used that Is<τ is FsY -measurable. The second term is more difficult,
however. Define πt (f ) = E(f (Xt )|FtY ). Ultimately, we would like to write
Z τ
−λs −λτ
J[τ ] = E e πs (w) ds + e πτ (z) ;
0
if this is true, then the partially observed problem reduces to a completely observed
optimal stopping problem for the filter. However, it is not at all obvious that
E(e−λτ z(Xτ )) = E(e−λτ πτ (z)) = E(e−λτ E(z(Xt )|FtY )|t=τ ).
In fact, at this point, this expression is neither true or false—it is meaningless!
To understand this point, let us revisit the definition of the conditional expectation.
Recall that for any random variable X ∈ L1 and σ-algebra F, the conditional expec-
tation E(X|F) is defined uniquely up to a.s. equivalence. In particular, there may
8.2. Partial observations: the modification problem 219
well be two different random variables A and B, both of which satisfy the definition
of E(X|F); however, in this case, we are guaranteed that A = B a.s.
Now consider the stochastic process t 7→ πt (f ). For every time t separately, πt (f )
is defined uniquely up to a.s. equivalence. But this means that the process π t (f ) is only
defined uniquely up to modification; in particular, two processes A t and Bt may both
satisfy the definition of πt (f ), but nonetheless P(At = Bt ∀ t) < 1 (after all, we are
only guaranteed that P(At = Bt ) = 1 for all t ∈ [0, ∞[, and [0, ∞[ is an uncountable
set). In this case, there may well be a stopping time τ such that Aτ 6= Bτ : modi-
fication need not preserve the value of a process at stopping times. See section 2.4
for a discussion on this point. Unfortunately, this means that E(z(X t )|FtY )|t=τ is a
meaningless quantity; it may take very different values, even with nonzero probability,
depending on how we choose to define our conditional expectations.
Does this mean that all is lost? Not in the least; it only means that we need
to do a little more work in defining the process πt (f ). As part of the definition of
that process, we will select a particular modification which has the following special
property: πτ (f ) = E(f (Xτ )|FτY ) for all FtY -stopping times τ (with τ < ∞ a.s.).
The process πt (f ) is then no longer “just” the conditional expectation process; this
particular modification of the conditional expectation process is known as the optional
projection of the process f (Xt ) onto the filtration FtY . Provided that we work with
this particular modification, we can complete the separation argument. After all,
where we have used that τ is FτY -measurable. Hence the problem is now finally
reduced to an optimal stopping problem for the filter—that is, if the filter does indeed
compute the optional projection πt (z) (which, as we will see, is the case).
Remark 8.2.1. A general theory of optional projections is developed in detail in Del-
lacherie and Meyer [DM82, section VI.2] (a brief outline can be found in [RW00b,
section VI.7]). We will have no need for this general theory, however; instead, we will
follow a simple argument due to Rao [Rao72], which provides everything we need.
Let us begin by recalling the definition of the σ-algebra FτY of events up to and in-
cluding time τ . We have encountered this definition previously: see definition 2.3.16.
Definition 8.2.2. We define FτY = {A : A ∩ {τ ≤ t} ∈ FtY for all t ≤ ∞} for any
FtY -stopping time τ . Then, in particular, τ is FτY -measurable.
To demonstrate where we want to be going, consider our usual observation model
In the previous chapter, we found filtering equations for πt (f ) = E(f (Xt )|FtY ) for
several different signal models; in all these filters, πt (f ) was expressed as the sum of
E(f (X0 )), a time integral, and a stochastic integral with respect to the observations.
But recall that we have defined both the time integral and the stochastic integrals
to have continuous sample paths; thus the πt (f ) obtained by solving the filtering
equations of the previous chapter is not just any version of the conditional expectation:
8.2. Partial observations: the modification problem 220
it is the unique modification of the conditional expectation process that has continuous
sample paths (uniqueness follows from lemma 2.4.6). We are going to show that it is
precisely this modification that is the optional projection.
Proposition 8.2.3. Let Xt be a process with right-continuous sample paths, and let
f be a bounded continuous function. Suppose there is a stochastic process π t (f )
with continuous sample paths such that πt (f ) = E(f (Xt )|FtY ) for every t. Then
πτ (f ) = E(f (Xτ )|FτY ) for all FtY -stopping times τ < ∞.
Remark 8.2.4. The “suppose” part of this result is superfluous: it can be shown that
a continuous modification of E(f (Xt )|FtY ) always exists in this setting [Rao72]. We
will not need to prove this, however, as we have already explicitly found a continuous
modification of the conditional expectation process, viz. the one given by the filtering
equations, in all cases in which we are interested.
To prove this statement, we will begin by proving it for the special case that τ
only takes a countable number of values. In this case, the result is independent of
modification: after all, the problem essentially reduces to discrete time, where two
modifications are always indistinguishable (see section 2.4).
Lemma 8.2.5. Proposition 8.2.3 holds for any modification of πt (f ) whenever τ takes
values only in a countable set times {t1 , t2 , . . .}.
Proof. We need to show that E(πτ (f ) IB ) = E(f (Xτ ) IB ) for every B ∈ FτY , and that πτ (f )
is FτY -measurable. This establishes the claim by the Kolmogorov definition
S of the conditional
expectation. We begin by demonstrating the first claim. Note that B = i≥1 B ∩ {τ = ti }, so
∞
X ∞
X
E(πτ (f ) IB ) = E(πτ (f ) IB∩{τ =ti } ), E(f (Xτ ) IB ) = E(f (Xτ ) IB∩{τ =ti} ).
i=1 i=1
Hence it suffices to prove E(πτ (f ) IB∩{τ =ti } ) = E(f (Xτ ) IB∩{τ =ti } ) for every B ∈ FτY
and i ≥ 1. But by definition, Bi = B ∩ {τ = ti } ∈ FtYi , so we do indeed find that
E(πτ (f ) IBi ) = E(πti (f ) IBi ) = E(f (Xti ) IBi ) = E(f (Xτ ) IBi ).
for every Borel set A; hence {πτ (f ) ∈ A} ∩ {τ = tj } ∈ FtYj ⊂ FtYi for every j ≤ i, so it
follows easily that {πτ (f ) ∈ A} ∈ FτY (take the union over j ≤ i). We are done.
To prove proposition 8.2.3 in its full glory, we can now proceed as follows. Even
though τ does not take a countable number of values, we can always approximate
it by a sequence of stopping times τn such that every τn takes a countable number
of values and τn & τ . We can now take limits in the previous lemma, and this is
precisely where the various continuity assumptions will come in. Before we complete
the proof, we need a an additional lemma which helps us take the appropriate limits.
8.2. Partial observations: the modification problem 221
E((Fm − Fn )2 ) = E(Fm
2
) + E(Fn2 ) − 2 E(Fn Fm )
2
= E(Fm ) + E(Fn2 ) − 2 E(Fn E(Fm |Fn )) = E(Fm
2
) − E(Fn2 ).
But then we find, in particular, that
n
X
E((Fk−1 − Fk )2 ) = E(Fm
2
) − E(Fn2 ) ≤ 2 E(X∞
2
) < ∞,
k=m+1
We now have to show that E(f (Xτ )|G) = E(f (Xτ )|FτY ) a.s. Clearly FτY ⊂ G (as τ < τn for
all n), so it suffices to show that E(E(f (Xτ )|G)|FτY ) = E(f (Xτ )|G) a.s. But we know that
E(f (Xτ )|G) = πτ (f ) a.s. Hence we are done if we can show that πτ (f ) is FτY -measurable.
To see that this is the case, define the stopping times σn = τn − 2−n . Then σn ≤ τ , and
σn → τ as n → ∞. But πσn (f ) is FσYn -measurable (see the proof of lemma 8.2.5), so it is FτY -
measurable for every n (as σn ≤ τ implies FσYn ⊂ FτY ). But then πτ (f ) = limn→∞ πσn (f )
(by the continuity of the sample paths) must be FτY -measurable.
8.3. Changepoint detection 222
where we have used the result of the previous section (note that I τ ≤t does not have
continuous sample paths, but it does have right-continuous sample paths). We can
thus apply proposition 8.1.3 with K = [0, 1], and the variational inequality reads
2 2
γ x (1 − x)2 ∂ 2 V (x)
∂V (x)
min + λ(1 − x) + cx, 1 − x − V (x) = 0.
2σ 2 ∂x2 ∂x
Perhaps remarkably, this problem has a (somewhat) explicit solution.
To begin, recall that once we have obtaind a suitable solution V (x) to this problem,
the interval [0, 1] is divided into the continuation region D = {x : V (x) < 1 − x}
and the stopping region D c = {x : V (x) = 1 − x}. On the former, we must have
L V (x)+cx = 0, while on the latter we must have L V (x)+cx ≥ 0. In particular, we
can use the latter requirement to find a necessary condition on the set D: substituting
V (x) = 1 − x into the inequality, we find that it must be the case that x ≥ λ/(c + λ)
for any x ∈ Dc . In particular, this implies that [0, λ/(c + λ)[ ⊂ D.
Let us now try to solve for V (x) on D. Note that LV (x) + cx = 0 gives
2σ 2
∂U (x) λ c
=− 2 U (x) + (x > 0), U (0) = 0,
∂x γ x2 (1 − x) x(1 − x)2
Here is another very useful property of U (x), which is not immediately obvious.
1 We have to require this to apply proposition 8.1.3, as we know that 0 ∈ D and V (x) must be C 1 .
8.3. Changepoint detection 224
∂U (x) c x
<0 ⇐⇒ −U (x) < .
∂x λ1−x
We would like to show that this is in fact the case. Using the expression for U (x), this becomes
Z 2σ2 λ (log( y )− 1 )
x
2σ 2 λ − 2σγ22λ (log( 1−x
x )− 1 )
x
e γ2 1−y y
x
2
e dy < .
γ 0 y(1 − y)2 1−x
The trick is to note that we have the identity
2σ2 λ (log( y )− 1 )
d 2σγ22λ (log( 1−y
y 1)
)− y 2σ 2 λ e γ2 1−y y
e = ,
dy γ2 y 2 (1 − y)
so that it evidently remains to prove that
2 Z x
x )− 1 )
− 2σ λ (log( 1−x y d 2σγ22λ (log( 1−y
y 1)
)− y x
e γ2 x
e dy < .
0 1 − y dy 1 − x
But this is clearly true for all 0 < x < 1, and thus the proof is complete.
We can now finally complete the solution of the optimal stopping problem. We
need to use the principle of smooth fit to determine D. Note that for x on the boundary
of D, we must have U (x) = −1 in order to make V (x) be C 1 . But the previous
lemmas demonstrate that there is a unique point π ∗ ∈ ]0, 1[ such that U (π ∗ ) = −1:
there is at least one such point (as U (0) ≥ −1 ≥ U (1)), and uniqueness follows
as U (x) is decreasing. We have thus established, in particular, that the continuation
region must be of the form D = [0, π ∗ [, and the remainder of the argument is routine.
Theorem 8.3.4 (Changepoint detection). Let π ∗ ∈ ]0, 1[ be the unique point such
that U (π ∗ ) = −1, and define the concave function V (x) as
Rx
1 − π ∗ + π∗ U (y) dy for x ∈ [0, π ∗ [,
V (x) =
1−x for x ∈ [π ∗ , 1].
definition of U (x), while that V (x) < 1 − x on [0, π ∗ ] follows from the concavity of V (x).
Hence we have established that the variational inequality is satisfied on [0, 1].
We now invoke proposition 8.1.3 with K = [0, 1]. Clearly V (x) is sufficiently smooth,
and as V (x) and both its derivatives are bounded, it remains by lemma 6.3.4 to show that
E(ϑ∗ ) < ∞. To this end, define αt = − log(1 − πt ). Then Itô’s rule gives
γ γ2 2
dαt = πt dB̄t + λ dt + π dt.
σ 2σ 2 t
Hence we obtain, in particular,
E(αt∧ϑ∗ ) ≥ α0 + λ E(t ∧ ϑ∗ ).
We now have a complete solution to the basic changepoint detection problem for
our model. The rest of this section discusses some variations on this theme.
Example 8.3.5 (Variational formulation). We have discussed what is known as the
“Bayesian” form of the changepoint detection problem: we have quantified the trade-
off between false alarm probability and expected delay by minimizing a weighted sum
of the two. Sometimes, however, a “variational” form of the problem is more appro-
priate. The latter asks the following: in the class of FtY -stopping rules ϑ with a false
alarm probability of at most α ∈ ]0, 1[, what is the stopping rule that minimizes the
expected delay? With the solution to the Bayesian problem in hand, we can now solve
the variational problem using a method similar to the one used in example 6.4.7.
Corollary 8.3.6. Let α ∈ ]0, 1[. Then amongst those FtY -stopping times ϑ such that
P(ϑ < τ ) ≤ α, the expected delay is minimized by ϑ∗ = inf{t : πt 6∈ [0, 1 − α[}.
Proof. First, we claim that we can choose the Bayesian cost J[ϑ] in such a way that π ∗ = 1−α
in the previous theorem, i.e., there exists a constant cα > 0 such that U (1 − α) = −1 for
c = cα . This is trivially seen, however, as U (x) is directly proportional to c. Denote by Jα [ϑ]
the cost with c = cα ; evidently ϑ∗ = inf{t : πt 6∈ [0, 1 − α[} minimizes Jα [ϑ].
Next, we claim that P(ϑ∗ < τ ) = α whenever π0 < 1 − α. To see this, note that using
proposition 8.2.3, we can write P(ϑ∗ < τ ) = 1 − E(P(τ ≤ ϑ∗ |FϑY∗ )) = 1 − E(πϑ∗ ) = α.
Whenever π0 ≥ 1 − α, it must be the case that ϑ∗ = 0, so we find P(ϑ∗ < τ ) = 1 − π0 .
For the latter case, however, the result holds trivially. To see this, note that ϑ∗ = 0 satisfies
P(ϑ < τ ) ≤ α, while the expected delay is zero for this case. As a smaller delay is impossible,
ϑ∗ = 0 is indeed optimal in the class of stopping rules in which we are interested.
It thus remains to consider the case when π0 < 1 − α. To this end, let ϑ be an arbitrary
FtY -stopping time with P(ϑ < τ ) ≤ α. Then Jα [ϑ∗ ] ≤ Jα [ϑ] gives
Thus E((ϑ∗ −τ )+ ) ≤ E((ϑ−τ )+ ), so ϑ cannot have a smaller delay than ϑ∗ . We are done.
Note that the variational problem is, in some sense, much more intuitive than its
Bayesian counterpart: given that we can tolerate a fixed probability of false alarm α,
it is best not to stop until the conditional probability of being in error drops below α.
8.3. Changepoint detection 226
Example 8.3.7 (Expected miss criterion). The cost J[ϑ] is quite general, and con-
tains seemingly quite different cost functionals as special cases. To demonstrate this
point, let us show how to obtain a stopping rule ϑ∗ that minimizes the expected miss
criterion J 0 [ϑ] = E(|ϑ − τ |). In the absence of explicit false alarm/expected delay
preferences, this is arguably the most natural cost functional to investigate!
The solution of this problem is immediate if we can rewrite the cost J 0 [ϑ] in terms
of J[ϑ] (for some suitable c). This is a matter of some clever manipulations, combined
with explicit use of the Shiryaev-Wonham filter. We begin by noting that
"Z #
ϑ
0 1 − π0
J [ϑ] = E(τ + ϑ − 2 τ ∧ ϑ) = +E {1 − 2 Is<τ } ds ,
λ 0
where we have used E(τ ) = (1 − π0 )/λ. Using the the tower property, we obtain
"Z # "Z #
ϑ ϑ
0 1 − π0 1 − π0
J [ϑ] = +E {2 Iτ ≤s − 1} ds = +E {πs − (1 − πs )} ds .
λ 0 λ 0
In particular, if we restrict to stopping times ϑ with E(ϑ) < ∞ (this is without loss of
generality, as it is easy to see that J 0 [ϑ] = ∞ if E(ϑ) = ∞), then lemma 6.3.4 gives
"Z #
ϑ
λ J 0 [ϑ] = 1 − π0 + E λπs ds + π0 − πϑ = Jλ [ϑ],
0
where Jλ [ϑ] is our usual cost J[ϑ] with c = λ. Evidently J 0 [ϑ] and Jλ [ϑ] differ only
by a constant factor, and hence their minima are the same. It thus remains to invoke
theorem 8.3.4 with c = λ, and we find that ϑ∗ = inf{t : πt 6∈ [0, π ∗ [} for suitable π ∗ .
Let us complete this section with an interesting application from [RH06].
Example 8.3.8 (Optimal stock selling). The problem which we wish to consider is
how to best make money off a “bubble stock”. Suppose we own a certain amount of
stock in a company that is doing well—the stock price increases on average. However,
at some random point in time τ the company gets into trouble (the “bubble bursts”),
and the stock price starts falling rapidly from that point onward. Concretely, you can
imagine a situation similar to the dot-com bubble burst in early 2000.
It seems evident that we should sell our stock before the price has dropped too far,
otherwise we will lose a lot of money. However, all we can see is the stock price: if
the stock price starts dropping, we are not sure whether it is because the bubble burst
or whether it is just a local fluctuation in the market (in which case the stock price will
go up again very shortly). The problem is thus to try to determine, based only on the
observed stock prices, when is the best time to sell our stock.
8.3. Changepoint detection 227
Let us introduce a simple model in which we can study this problem. The total
amount of money we own in stock is denoted St , and satisfies
dSt = µt St dt + σSt dWt .
Prior to the burst time τ , the stock is making money: we set µt = a > 0 for t < τ .
After the burst, the stock loses money: we set µt = −b < 0 for τ ≤ t. In particular,
dSt = (a It<τ − b Iτ ≤t )St dt + σSt dWt .
Denote by FtS = σ{Ss : s ≤ t} the filtration generated by the stock price, and we
choose τ to be an exponentially distributed random variable, i.e., P(τ = 0) = π 0 and
P(τ > t|τ > 0) = e−λt , which is independent of the Wiener process Wt . Our goal
is to maximize the expected utility E(u(Sϑ )) from selling at time ϑ, i.e., we seek to
minimize the cost J 0 [ϑ] = E(−u(Sϑ )) in the class of FtS -stopping rules (see example
6.2.4 for a discussion of utility). For simplicity we will concentrate here on the Kelly
criterion u(x) = log(x) (the risk-neutral case can be treated as well, see [RH06]).
We begin by rewriting the cost in a more convenient form. We will restrict our-
selves throughout to stopping times with E(ϑ) < ∞ (and we seek an optimal stopping
rule in this class), so that we can apply lemma 6.3.4. Using Itô’s rule, we then obtain
"Z #
ϑ
0 σ2
J [ϑ] = E − a Is<τ + b Iτ ≤s ds − log(S0 ),
0 2
1
300
0.9
0.8 200 St
1−x
0.7 100
0.6 τ ϑ*
V(x) 0 0.5 1 1.5 2
0.5
0.4 1
0.3 π*
0.2 .5
0.1 πt
π* 0
0
0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2
Things are starting to look up—this is almost the changepoint detection problem!
To proceed, we need to distinguish between two cases. The first case is when
2a ≤ σ 2 . In this case, the problem becomes essentially trivial; indeed, you can read
of from the expression for J 0 [ϑ] above that the optimal stopping rule for this case tries
to simultaneously minimize the expected delay, and maximize the probability of false
alarm. This is easily accomplished by setting ϑ∗ = 0, and this is indeed the optimal
stopping rule when 2a ≤ σ 2 . It thus remains to consider the nontrivial case 2a > σ 2 .
Define the constants q = (2a − σ 2 )/2λ and c = (σ 2 + 2b)/2q, and note in
particular that q, c > 0 when 2a > σ 2 (which we now assume). Then we can write
" #
Z ϑ
q −1 J 0 [ϑ] = E 1 − πϑ + cπs ds − q −1 log(S0 ) + π0 − 1.
0
where X (the bit) is either zero or one. We would like to determine the value of X on
the basis of the observations, i.e., we would like to accept one of the two hypotheses
X = 0 or X = 1, and we would like to do this in such a way that the probabilities of
selecting the wrong hypothesis (we accept the hypothesis X = 0 when in fact X = 1,
and vice versa) are as small as possible. On a fixed time horizon [0, T ], it is well
known how to do this: the Neyman-Pearson test characterizes this case completely.
The problem becomes more interesting, however, when we do not fix the obser-
vation interval [0, T ], but allow ourselves to decide when we have collected enough
information from the observations to accept one of the hypotheses with sufficient con-
fidence. A decision rule in this problem consists of two quantities: an F tY -stopping
time τ (the decision time) and a {0, 1}-valued FτY -measurable random variable H
(the accepted hypothesis, i.e., H = 1 means we think that X = 1, etc.) We are now
faced with the competing goals of minimizing the following quantities: the probabil-
ity that H = 1 when in fact X = 0; the probability that H = 0 when in fact X = 1;
and the observation time τ required to determine our accepted hypothesis H. The
question is, of course, how to choose (τ, H) to achieve these goals.
Remark 8.4.1. A simple-minded application might help clarify the idea. We wish to
send a binary message through a noisy channel using the following communication
scheme. At any point in time, we transmit the current bit “telegraph style”, i.e., the
receiver observes yt = γ X +σ ξt , where ξt is white noise and X is the value of the bit
(zero or one). One way of transmitting a message is to allocate a fixed time interval
of length ∆ for every bit: i.e., we send the first bit during t ∈ [0, ∆[, the second
bit during [∆, 2∆[, etc. The Neyman-Pearson test then provides the optimal way for
the receiver to determine the value of each bit in the message, and the probability of
error depends purely on ∆. If we thus have an upper bound on the acceptable error
probability, we need to choose ∆ sufficiently large to attain this bound.
Now suppose, however, that we allow the receiver to signal back to the transmitter
when he wishes to start receiving the next bit (e.g., by sending a pulse on a feedback
channel). Given a fixed upper bound on the acceptable error probability, we should
now be able to decrease significantly the total amount of time necessary to transmit
the message. After all, for some realizations of the noise the observations may be rela-
tively unambiguous, while for other realizations it might be very difficult to tell which
bit was transmitted. By adapting the transmission time of every bit to the random
fluctuations of the noise, we can try to optimize the transmission time of the mes-
sage while retaining the upper bound on the probability of error. This is a sequential
hypothesis testing problem of the type considered in this section.
8.4. Hypothesis testing 230
Bayesian problem
As in the changepoint detection problem, we will begin by solving the “Bayesian”
problem and deduce the variational form of the problem at the end of the section.
To define the Bayesian problem, we suppose that X is in fact a random variable,
independent of Bt , such that P(X = 1) = π0 (where π0 ∈ ]0, 1[, otherwise the
problem is trivial!). For any decision rule (τ, H), we can then introduce the cost
˜ H] = E(τ ) + a P(X = 1 and H = 0) + b P(X = 0 and H = 1),
J[τ,
where a > 0 and b > 0 are constants that determine the tradeoff between the two
types of error and the length of the observation interval. The goal of the Bayesian
˜ H].
problem is to select a decision rule (τ ∗ , H ∗ ) that minimizes J[τ,
Remark 8.4.2. Nothing is lost by assuming that E(τ ) < ∞, as otherwise the cost is
infinite. We will thus always make this assumption throughout this section.
To convert this problem into an optimal stopping problem, our first goal is to
eliminate H from the problem. For any fixed stopping rule τ , it is not difficult to
˜ H]. If we substitute this optimal
find the hypothesis Hτ∗ that minimizes H 7→ J[τ,
hypothesis into the cost above, the problem reduces to a minimization of the cost
˜ Hτ∗ ] over τ only. Let us work out the details.
functional J[τ ] = J[τ,
Lemma 8.4.3. Denote by πt the stochastic process with continuous sample paths such
that πt = P(X = 1|FtY ) for every t. Then for any fixed FtY -stopping time τ with
˜ H] is minimized by accepting the hypothesis
E(τ ) < ∞, the cost J[τ,
1 if aπτ ≥ b(1 − πτ ),
Hτ∗ =
0 if aπτ < b(1 − πτ ).
Moreover, the optimal cost is given by
˜ Hτ∗ ] = E(τ + aπτ ∧ b(1 − πτ )).
J[τ ] = J[τ,
Proof. As τ is fixed, it suffices to find an FτY -measurable H that minimizes E(a IX=1 IH=0 +
b IX=0 IH=1 ). But using the tower property of the conditional expectation and the optional
projection, we find that we can equivalently minimize E(a πτ (1 − IH=1 ) + b (1 − πτ ) IH=1 ).
But clearly this expression is minimized by Hτ∗ , as a πτ (1 − IHτ∗ =1 ) + b (1 − πτ ) IHτ∗=1 ≤
a πτ (1 − IH=1 ) + b (1 − πτ ) IH=1 a.s. for any other H. The result now follows directly.
0.6
ax
ax
ax
b (1
b (1
0.4 V(x)
b (1
−x
−x
)
)
−x
0.2
)
π0 π0 π0 π0 π1
0
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
Figure 8.3. Illustration of lemma 8.4.4. If π 0 is chosen too small then V (x) does not touch the
line b(1 − x) at all (first figure), while if π 0 is too large then the maximum of V (x) − b(1 − x)
lies above zero (second figure). For the correct choice of π 0 , the curve V (x) will be precisely
tangent to b(1 − x) (third figure). The final construction of the value function (as in theorem
8.4.5) is shown in the last figure. For these plots γ = 1, σ = .4, a = 1 and b = 1.5.
Lemma 8.4.4. There is a unique π 0 , π 1 with 0 < π 0 < b/(a + b) < π 1 < 1 such that
Proof. Consider the function W (x) = V (x) − b (1 − x) (recall that this equation depends on
π 0 ). Note that W (x) is strictly concave for every π 0 and satisfies W (x) → −∞ as x & 0 or
x % 1. Hence W (x) has a maximum in the interval ]0, 1[ for every π 0 .
For π 0 = b/(a + b) the maximum lies above zero: after all, in this case W (π 0 ) = 0,
while ∂W (x)/∂x|x=π0 = a is positive. As π 0 → 0, however, the maximum of W (x)
goes below zero. To see this, note that W (x) attains its maximum at the point x∗ such that
∂W (x)/∂x|x=x∗ = ψ(x∗ ) − ψ(π 0 ) + a + b = 0, and ψ(π 0 ) → ∞ as π 0 → 0; hence x∗ → 0
as well. On the other hand, W (x) ≤ ax − b (1 − x) everywhere by concavity, so as x∗ → 0 we
obtain at least W (x∗ ) ≤ −b/2. Now note that ψ(x∗ ) − ψ(π 0 ) + a + b = 0 implies that x∗ is
strictly decreasing as π 0 decreases. Hence there must thus be a unique 0 < π 0 < b/(a+b) such
that x∗ is precisely zero. But then for that π 0 , V (x∗ ) = b (1−x∗ ) and ∂V (x)/∂x|x=x∗ = −b,
so we have found the desired π 0 and π 1 = x∗ . Note that π 1 > b/(a + b) necessarily, as
V (x) ≤ ax everywhere (so V (x∗ ) = b(1 − x∗ ) means b(1 − x∗ ) < ax∗ ).
for 0 ≤ x < π 0 ,
ax
0 0 0 0
V (x) = Ψ(x) − Ψ(π ) + (a − ψ(π )) (x − π ) + a π for π 0 ≤ x ≤ π 1 ,
b(1 − x) for π 1 < x ≤ 1,
b
where 0 < π 0 < a+b < π 1 < 1 are the unique points such that V (π 1 ) = b(1 − π 1 )
and ψ(π ) − ψ(π ) = a + b. Then V (x) is C 1 on [0, 1], C 2 on [0, 1]\{π 0, π 1 }, and
0 1
2 2
γ x (1 − x)2 ∂ 2 V (x)
min + 1, ax ∧ b(1 − x) − V (x) = 0.
2σ 2 ∂x2
8.4. Hypothesis testing 233
Variational problem
We now consider the variational version of the problem. Rather than minimizing a
cost functional, which trades off between the error probabilities and the length of the
observation interval, we now specify fixed upper bounds on the probability of error.
We then seek a decision strategy that minimizes the observation time within the class
of strategies with acceptable error probabilities. In many situations this is the most
natural formulation, and we will see that also this problem has an explicit solution.
We first need to define the problem precisely. To this end, let us denote by ∆ α,β
the class of decision rules (τ, H) such that
For fixed α, β, our goal is to find a decision rule (τ ∗ , H ∗ ) that minimizes E(τ )
amongst all decision rules in ∆α,β . We will need the following lemmas.
Lemma 8.4.6. When π0 ∈ ]π 0 , π 1 [, the rule (τ ∗ , H ∗ ) of theorem 8.4.5 satisfies
π 0 π 1 − π0 1 − π 1 π0 − π 0
P(H ∗ = 0|X = 1) = , P(H ∗ = 1|X = 0) = ,
π0 π 1 − π 0 1 − π0 π 1 − π 0
Proof. We will consider P(H ∗ = 0|X = 1); the remaining claim follows identically. Note
that P(H ∗ = 0|X = 1) = P(H ∗ = 0 and X = 1)/P(X = 1) = P(H ∗ = 0 and X = 1)/π0 .
Using the tower property of the conditional expectation and the optional projection, we find that
P(H ∗ = 0 and X = 1) = E(IH ∗ =0 πτ ∗ ) = E(Iπτ ∗ =π0 πτ ∗ ) = π 0 P(πτ ∗ = π 0 ). To evaluate
the latter, note that πτ ∗ is a {π 0 , π 1 }-valued random variable; hence
π 1 − π0
E(πτ ∗ ) = π 0 P(πτ ∗ = π 0 )+π 1 (1−P(πτ ∗ = π 0 )) = π0 =⇒ P(πτ ∗ = π 0 ) = ,
π1 − π0
as πt is a bounded martingale. This establishes the result.
8.4. Hypothesis testing 234
Lemma 8.4.7. Given 0 < α + β < 1 and π0 ∈ ]0, 1[, there are unique constants
˜ H] such that (τ ∗ , H ∗ ) of theorem 8.4.5 satisfies
a, b > 0 in the Bayesian cost J[τ,
P(H = 0|X = 1) = α and P(H ∗ = 1|X = 0) = β; moreover, for these a, b we find
∗
π0 α π0 (1 − α)
π0 = , π1 = ,
(1 − π0 )(1 − β) + π0 α (1 − π0 )β + π0 (1 − α)
so it is indeed the case that E(τ ∗ ) ≤ E(τ ). This establishes the claim.
Though this result does, in principle, solve the variational problem, the true struc-
ture of the problem is still in disguise. It is illuminating to remove the somewhat
strange dependence of the stopping boundaries π 0 , π 1 on π0 = P(X = 1) through a
change of variables. To this end, let us define the likelihood ratio
γ2
πt 1 − π 0 γ
ϕt ≡ = exp Y t − t ,
1 − π t π0 σ2 2σ 2
where the latter equality can be read off from example 7.1.9. As x/(1 − x) is strictly
increasing, the stopping rule (τ ∗ , H ∗ ) of lemma 8.4.8 can be equivalently written as
α 1−α 1 if ϕτ ∗ ≥ (1 − α)/β,
τ ∗ = inf t : ϕt 6∈ , , H∗ =
1−β β 0 if ϕτ ∗ ≤ α/(1 − β).
Evidently the optimal variational decision rule (τ ∗ , H ∗ ) can be computed without any
knowledge of π0 ; after all, both the functional ϕt and the stopping boundaries no
longer depend on π0 . Hence we find that unlike in the Bayesian problem, no prob-
abilistic assumption needs to be made on the law of X in the variational problem.
Indeed, we can consider the value of X simply as being “unknown”, rather than “ran-
dom”. This is very much in the spirit of the Neyman-Pearson test, and the two methods
are in fact more closely related than our approach indicates (see [Shi73, section IV.2]).
8.5. Impulse control 235
conditions that ensure existence and uniqueness of the solution. At the stopping time
τ1 , we impulsively change the system state from Xτu1 − to Xτu1 = Γ(Xτu1 − , ζ1 ), where
Γ : Rn × U → Rn is a given control action function and U is the control set. The
control ζ1 is assumed to be Fτ1 -measurable, i.e., the control strategy is adapted.
Remark 8.5.1. As the state of the system jumps at the intervention time, we need to
have notation that distinguishes between the state just prior and just after the interven-
tion. In the following, we will denote by Xτ − the state just prior to the intervention
time τ , and by Xτ the state just after the intervention time τ . We will thus always
have Xτ = Γ(Xτ − , ζ), where ζ is the control applied at time τ .
2 An interesting economic application is the following. Due to various economic factors, the exchange
rate between two currencies (say, the dollar and some foreign currency) fluctuates randomly in time. It is
not a good idea, however, to have the exchange rate be too far away from unity. As such, the central bank
tries to exert its influence on the exchange rates to keep them in a certain “safe” target zone. One way in
which the central bank can influence the exchange rate is by buying or selling large quantities of foreign
currency. The question then becomes, at which points in time should the central bank decide to make a large
transaction in foreign currency, and for what amount, in order to keep the exchange rate in the target zone.
This is an impulse control problem. See [Kor99] for a review of impulse control applications in finance.
8.5. Impulse control 236
We left off just after the first intervention time τ1 . For times after τ1 but before the
second intervention time τ2 > τ1 , we solve the stochastic differential equation
Z t Z t
u u u
Xt = X τ 1 + b(Xs ) ds + σ(Xsu ) dWs .
τ1 τ1
We will assume that such an equation has a unique solution starting from every finite
stopping time, so that we can solve for Xt between every pair τi < τi+1 .
Remark 8.5.2. This is indeed the case when b, σ satisfy the usual Lipschitz condi-
tions; this follows from the strong Markov property, but let us not dwell on this point.
At time τ2 we apply another control action ζ2 , etc. We now have the following.
Definition 8.5.3. An impulse control strategy u consists of
1. a sequence of stopping times {τj }j=1,2,... such that τj < ∞ a.s. and τj < τj+1 ;
2. a sequence {ζj }j=1,2,... such that ζj ∈ U and ζj is Fτj -measurable.
The strategy u is called admissible if the intervention times τj do not accumulate, and
Xtu has a unique solution on the infinite time interval [0, ∞[.
Let us investigate the discounted version of the impulse control problem (a time-
average cost can also be investigated, you can try to work this case out yourself or
consult [JZ06]). We introduce the following discounted cost functional:
Z ∞ X∞
J[u] = E e−λs w(Xs ) ds + e−λτj v(Xτj − , ζj ) .
0 j=1
Assume that |E(V (X0 ))| < ∞, and denote by K the class of admissible strategies u
such that E(e−λτj V (Xτuj − )) −−−→ 0 and such that
j→∞
n X
m Z
" #
τj
X
−λs ∂V u ik u k
E e (X ) σ (Xs ) dWs = 0 for all j.
i=1 k=1 τj−1 ∂xi s
8.5. Impulse control 237
Define the continuation set D = {x ∈ K : K V (x) > V (x)} and strategy u∗ with
∗ ∗ ∗
τj∗ = inf{t > τj−1
∗
: Xtu 6∈ D}, ζj∗ ∈ argminα∈U {V (Γ(Xτuj − , α))+v(Xτuj − , α)}.
If u∗ defines an admissible impulse control strategy in K, then J[u∗ ] ≤ J[u] for any
u ∈ K, and the optimal cost can be written as E(V (X0 )) = J[u∗ ].
Proof. We may assume without loss of generality that w and v are both nonnegative or both
nonpositive; otherwise this can always be accomplished by shifting the cost by a constant. We
now begin by applying Itô’s rule to e−λt V (Xtu ). Familiar manipulations give
Now let t, j → ∞, using monotone convergence on the right and the assumption on u ∈ K on
the left (recall that as u is admissible, the intervention times cannot accumulate so τj % ∞).
This gives E(V (X0 )) ≤ J[u]. Repeating the same arguments with u∗ instead of u gives
E(V (X0 )) = J[u∗ ], so the claim is established.
Remark 8.5.5. The equation for the value function V (x) is almost a variational in-
equality, but not quite; in a variational inequality, the stopping cost z(x) was indepen-
dent of V (x), while in the current problem the intervention cost K V (x) very much
depends on the value function (in a nontrivial manner!). The equation for V (x) in
proposition 8.5.4 is known as a quasivariational inequality.
Let us treat an interesting example, taken from [Wil98].
Example 8.5.6 (Optimal forest harvesting). We own a forest which is harvested for
lumber. When the forest is planted, it starts off with a (nonrandom) total biomass
x0 > 0; as the forest grows, the biomass of the forest grows according the equation
At some time τ1 , we can decide to cut the forest and sell the wood; we then replant
the forest so that it starts off again with biomass x0 . The forest can then grow freely
8.5. Impulse control 238
until we decide to cut and replant again at time τ2 , etc. Every time τ we cut the forest,
we obtain Xτ − dollars from selling the wood, but we pay a fee proportional to the
total biomass αXτ − (0 ≤ α < 1) for cutting the forest, and a fixed fee Q > 0 for
replanting the forest to its initial biomass x0 . When inflation, with rate3 λ > µ, is
taken into account, the expected future profit from this operation is given by
∞
X
E e−λτj ((1 − α) Xτj − − Q) .
j=1
For this impulse control problem, U consists of only one point (so we can essentially
ignore it) and the control action is Γ(x, α) = x0 for any x. Note, moreover, that
Xt > 0 always, so we can apply proposition 8.5.4 with K = ]0, ∞[.
To solve the impulse control problem, we consider the quasivariational inequality
2 2 2
σ x ∂ V (x) ∂V (x)
min + µx − λ V (x), Q − βx + V (x 0 ) − V (x) = 0,
2 ∂x2 ∂x
where β = 1−α. The first thing to note is that in order to obtain a meaningful impulse
control strategy, the initial biomass x0 must be in the continuation set D; if this is not
the case, then replanting the forest is so cheap that you might as well immediately
cut down what has just been planted, without waiting for it to grow. To avoid this
possibility, note that x0 ∈ D requires x0 < Q/β. We will assume this from now on.
Now consider L V (x) − λV (x) = 0, which must hold on the continuation region
D. The general solution to this equation is given by V (x) = c+ xγ+ + c− xγ− , where
σ 2 − 2µ ± (σ 2 − 2µ)2 + 8σ 2 λ
p
γ± = .
2σ 2
Note that γ+ > 1 (due to λ > µ), while γ− < 0.
We could proceed to analyze every possible case, but let us make a few educated
guesses at this point. There is nothing lost by doing this: if we can find one solution
that satisfies the conditions of the verification theorem, then we are done; otherwise
we can always go back to the drawing board! We thus guess away. First, it seems
unlikely that it will be advantageous to cut the forest when there is very little biomass;
this will only cause us to pay the replanting fee Q, without any of the benefit of selling
the harvested wood. Hence we conjecture that the continuation region has the form
D = ]0, y[ for some y > x0 . In particular, this means the V (x) = c+ xγ+ + c− xγ− for
3 If the inflation rate were lower than the mean growth rate of the forest, then it never pays to cut down
the forest—if we are patient, we can always make more money by waiting longer before cutting the forest.
8.5. Impulse control 239
x ∈ ]0, y[. But as the second term has a pole at zero, we must choose c − = 0; after all,
c− > 0 is impossible as the cost is bounded from above, while c− < 0 would imply
that we make more and more profit the less biomass there is in the forest; clearly this
cannot be true. Collecting these ideas, we find that V (x) should be of the form
c x γ+
for x < y,
V (x) = γ
Q − βx + c x0+ for x ≥ y.
The constants c and y remain to be determined. To this end, we apply the principle of
smooth fit. As V (x) should be C 1 at y, we require
γ
γ+ c y γ+ −1 = −β, c y γ+ = Q − βy + c x0+ .
γ+ Q − βy (x0 /y)γ+
y= .
β(γ+ − 1)
To complete the argument, it remains to show (i) that there does exist a solution y >
x0 ; and (ii) that the conditions of proposition 8.5.4 are satisfied for V (x).
Let us first deal with question (i). Let f (z) = β(γ+ − 1)z + βz (x0 /z)γ+ − γ+ Q;
then y satisfies f (y) = 0. It is easily verified that f (z) is strictly convex and attains
its minimum at z = x0 ; furthermore, f (x0 ) < 0 as we have assumed that x0 < Q/β.
Hence f (z) has exactly two roots, one of which is larger than x0 . Hence we find that
there exists a unique y > x0 that satisfies the desired relation.
We now verify (ii). Note that V (x) is, by construction, C 1 on ]0, ∞[ and C 2
on ]0, ∞[\{y}. Hence V (x) is sufficiently smooth. The running cost w is zero in our
case, while the intervention cost v is bounded from above. By construction, L V (x)−
λV (x) = 0 on D = ]0, y[, while V (x) = K V (x) on D c . It is easily verified by
explicit computation that L V (x) − λV (x) ≥ 0 on D c and that V (x) < K V (x) on
D. Hence the quasivariational inequality is satisfied. It thus remains to show that the
candidate optimal strategy u∗ is admissible, and in particular that it is in K.
To show that this is the case, we proceed as follows. First, we claim that τ j <
∞ a.s. for every j. To see this, it suffices to note that for the uncontrolled process
E(Xt ) = x0 eµt → ∞, so there exists a subsequence tn such that Xtn → ∞ a.s., so
that in particular Xt must eventually exit D a.s. Furthermore, due to the continuity of
the sample paths of Xt , we can immediately see that τj−1 < τj . Now note that our
process Xtu restarts at the same, non-random point at every intervention time τ j . In
particular, this means that τj − τj−1 are independent of each other for every j, and as
τj −τj−1 > 0, there must exist some ε > 0 such that P(τj −τj−1 > ε) > 0. These two
facts together imply that P(τj − τj−1 > ε i.o.) = 1 (see, for example, the argument in
the example at the end of section 4.1). But then we conclude that the stopping times
τj cannot accumulate, and in particular τj % ∞. Thus u∗ is admissible.
8.6. Further reading 240
0
0 1 2 3 4 5 6 7 8 9 10
Figure 8.4. Simulation of the optimal impulse control strategy of example 8.5.6. One sample
path of the controlled process Xtu is shown in blue; the intervention threshold y is shown in
green. The parameters for this simulation were µ = σ = 1, λ = Q = 2, α = .1, and x0 = 1.
To show that u∗ is also in K is now not difficult. Indeed, as D has compact closure,
Xtu is a bounded process. Hence E(e−λτj V (Xτuj − )) → 0, while
n X
m Z
" #
τj
X ∂V
E e−λs (X u ) σ ik (Xsu ) dWsk = 0
i=1 k=1 τj−1 ∂xi s
follows from the fact that the integrand is square-integrable on the infinite time horizon
(being the product of a decaying exponential and a bounded process). Thus all the
requirements of proposition 8.5.4 are satisfied, and we are convinced that we have
indeed found an optimal impulse control strategy, as we set out to do (see figure 8.4).
More elaborate optimal harvesting models can be found in [Wil98] and in [Alv04].
An in-depth study of the optional projection and friends (who we did not intro-
duce) can be found in Dellacherie and Meyer [DM82]. For more on the separation
principle in the optimal stopping setting see, e.g., Szpirglas and Mazziotto [SM79].
Our treatment of both the changepoint detection problem and the hypothesis test-
ing problem come straight from Shiryaev [Shi73], see also [PS06]. For more on the
expected miss criterion see Karatzas [Kar03], while the stock selling problem is from
Rishel and Helmes [RH06], where the risk-neutral version can also be found. Many
interesting applications of changepoint detection can be found in Basseville and Niki-
forov [BN93]. An extension of the hypothesis testing problem to time-varying signals
can be found in Liptser and Shiryaev [LS01b, section 17.6].
Our discussion of the impulse control problem is inspired by Øksendal and Sulem
[ØS05] and by Brekke and Øksendal [BØ94]. An extension of example 8.1.7 to the
impulse control setting can be found in the latter. The time-average cost criterion is
discussed, e.g., in Jack and Zervos [JZ06]. Finally, the classic tome on quasivaria-
tional inequalities, and a rich source of examples of impulse control applications in
management problems, is Bensoussan and Lions [BL84]. Markov chain approxima-
tions for the solution of impulse control problems can be found in Kushner [Kus77].
APPENDIX
A
Problem sets
242
A.1. Problem set 1 243
Q under which X −a is a Gaussian random variable with zero mean and unit variance,
where a ∈ R is some fixed (non-random) constant.
1. Is it true that Q P, and if so, what is the Radon-Nikodym derivative dQ/dP?
Similarly, is it true that P Q, and if so, what is dP/dQ?
We are running a nuclear reactor. That being a potentially dangerous business, we
would like to detect the presence of a radiation leak, in which case we should shut
down the reactor. Unfortunately, we only have a noisy detector: the detector generates
some random value ξ when everything is ok, while in the presence of a radiation leak
the noise has a constant offset a + ξ. Based on the value returned by the detector, we
need to make a decision as to whether to shut down the reactor.
In our setting, the value returned by the detector is modelled by the random vari-
able X. If everything is running ok, then the outcomes of X are distributed according
to the measure P. This is called the null hypothesis H0 . If there is a radiation leak,
however, then X is distributed according to Q. This is the alternative hypothesis H 1 .
Based on the value X returned by the detector, we decide to shut down the reactor if
f (X) = 1, with some f : R → {0, 1}. Our goal is to find a suitable function f .
How do we choose the decision function f ? What we absolutely cannot toler-
ate is that a radiation leak occurs, but we do not decide to shut down the reactor—
disaster would ensue! For this reason, we fix a tolerance threshold: under the measure
corresponding to H1 , the probability that f (X) = 0 must be at most some fixed
value α (say, 10−12 ). That is, we insist that any acceptable f must be such that
Q(f (X) = 0) ≤ α. Given this constraint, we now try to find an acceptable f that
minimizes P(f (X) = 1), the probability of false alarm (i.e., there is no radiation leak,
but we think there is).
Claim: an f ∗ that minimizes P(f (X) = 1) subject to Q(f (X) = 0) ≤ α is
if dQ
∗ 1 dP (x) > β,
f (x) =
0 otherwise,
where β > 0 is chosen such that Q(f ∗ (X) = 0) = α. This is called the Neyman-
Pearson test, and is a very fundamental result in statistics (if you already know it, all
the better!). You are going to prove this result.
2. Let f : R → {0, 1} be an arbitrary measurable function s.t. Q(f (X) = 0) ≤ α.
Using Q(f (X) = 0) ≤ α and Q(f ∗ (X) = 0) = α, show that
Then x∗ is stable. (Note: as ξn are i.i.d., the condition does not depend on n.)
3. (Inverted pendulum in the rain) A simple discrete time model for a controlled,
randomly forced overdamped pendulum is
Find some control law un+1 = g(xn , yn ) that makes the inverted position θ = 0
stable. (Try an intuitive control law and a linear Lyapunov function; you might
want to use your favorite computer program to plot k(·).)
4. Bonus question: The previous results can be localized to a neighborhood.
Prove the following modifications of the previous theorems:
Theorem A.2.3. Suppose there is a continuous function V : S → [0, ∞[ with
V (x∗ ) = 0 and V (x) > 0 for x 6= x∗ , and a neighborhood U of x∗ , such that
Then x∗ is stable.
Theorem A.2.4. Suppose there is a continuous function V : S → [0, ∞[ with
V (x∗ ) = 0 and V (x) > 0 for x 6= x∗ , and a neighborhood U of x∗ , such that
Hint. Define a suitable stopping time τ , and apply the previous results to x n∧τ .
You can now show that the controlled pendulum is asymptotically stable.
1. Consider the annulus D = {x : r < kxk < R} for some 0 < r < R < ∞,
and define the stopping time τx = inf{t : Wtx 6∈ D}. For which functions
h : Rn → R is h(Wt∧τx
x
) a martingale for all x ∈ D? You may assume that h
2
is C in some neighborhood of D. (Such functions are called harmonic).
2. Using the previous part, show that h(x) = |x| is harmonic for n = 1, h(x) =
log kxk is harmonic for n = 2, and h(x) = kxk2−n is harmonic for n ≥ 3.
3. Let us write τxR = inf{t : kWtx k ≥ R} and τxr = inf{t : kWtx k ≤ r}. What is
P(τxr < τxR ) for n = 1, 2, 3, . . .? [Hint: kWτxx k can only take values r or R.]
4. What is P(τxr < ∞)? Conclude the Brownian motion is recurrent for dimen-
sions 1 and 2, but not for 3 and higher. [Hint: {τxr < ∞} = R>r {τxr < τxR }.]
S
where αt and βt are the simple integrands that take the values αti and βti on
the interval [ti , ti+1 ], respectively. [Assume that αti and βti are Fti -measurable
(obviously!) and sufficiently integrable.]
The integral expression for Xt still makes sense for continuous time strategies with
αt St and βt Rt in L2 (µT × P) (which we will always assume). Hence we can define
a self-financing strategy to be a pair αt , βt that satisfies this expression (in addition to
Xt = αt St + βt Rt , of course). You can see this as a limit of discrete time strategies.
In a sensible model, we should not be able to find a reasonable strategy α t , βt that
makes money for nothing. Of course, if we put all our money in the bank, then we
A.3. Problem set 3 248
will always make money for sure just from the interest. It makes more sense to study
the normalized market, where all the prices are discounted by the interest rate. So we
will consider the discounted wealth X t = Xt /Rt and stock price S t = St /Rt . We
want to show that there does not exist a trading strategy with X 0 = a, X t ≥ a a.s.,
and P(X t > a) > 0. Such a money-for-nothing opportunity is called arbitrage.
We are going to do some simple option pricing theory. Consider something called
a European call option. This is a contract that says the following: at some predeter-
mined time T (the maturity), we are allowed to buy one unit of stock at some prede-
termined price K (the strike price). This is a sort of insurance against the stock price
going very high: if the stock price goes below K by time T we can still buy stock at
the market price, and we only lose the money we paid to take out the option; if the
stock price goes above K by time T , then we make money as we can buy the stock
below the market price. The total payoff for us is thus (ST − K)+ , minus the option
price. The question is what the seller of the option should charge for that service.
5. If we took out the option, we would make (ST − K)+ dollars (excluding the
option price). Argue that we could obtain exactly the same payoff by imple-
menting a particular trading strategy αt , βt , a hedging strategy, provided that
we have sufficient starting capital (i.e., for some X0 , αt , βt , we actually have
XT = (ST − K)+ ). Moreover, show that there is only one such strategy.
6. Argue that the starting capital required for the hedging strategy is the only fair
price for the option. (If a different price is charged, either we or the seller of the
option can make money for nothing.)
7. What is the price of the option? [Hint: use the equivalent martingale measure.]
This is a much more advanced topic than we are going deal with in this course. As
we have the necessary tools to get started, however, I can’t resist having you explore
some of the simplest ideas (for fun and extra credit—this is not a required problem!).
We work on (Ω, F, P), on which is defined a Wiener process Wt with its natural
filtration Ft = σ{Ws : s ≤ t}. We restrict ourselves to a finite time interval t ∈ [0, T ].
An FT -measurable random variable X is called cylindrical if it can be written as
X = f (Wt1 , . . . , Wtn ) for a finite number of times 0 < t1 < · · · < tn ≤ T and some
function f ∈ C0∞ . For such X, the Malliavin derivative of X is defined as
n
X ∂f
Dt X = i
(Wt1 , . . . , Wtn ) It≤ti .
i=1
∂x
2. Let ut be bounded and Ft -adapted, and let ε ∈ R. Prove the invariance formula
R
2 R
Z ·
ε 0T us dWs − ε2 0T (us )2 ds
E(f (W· )) = E f W· − ε us ds e .
0
3. Show that this definition of Dt X coincides with our previous definition for
cylindrical random variables X.
A.4. Problem set 4 250
4. Let X = f (W· ), and assume for simplicity that f (x· ) and f 0 (t, x· ) are bounded.
By taking the derivative with respect to ε, at ε = 0, of the invariance formula
above, prove the Malliavin integration by parts formula
" Z # "Z #
T T
E X us dWs = E us Ds X ds
0 0
5. Using the Itô representation theorem, prove that there is a unique F t -adapted
process Ct such that for any bounded and Ft -adapted process ut
" Z # "Z #
T T
E X us dWs = E us Cs ds .
0 0
where the latter is the solution of the n-dimensional SDE every component of which
satisfies the equation for Xtr above.
3. Use the Euler-Maruyama method to compute several sample paths of X t and
of Yt in the interval t ∈ [0, 10], with (r1 , . . . , rn ) = (−3, −2.5, −2 . . . , 3) and
with step size ∆t = .001. Qualitatively, what do you see?
Apparently the SDEs for Xtr and Ytr are qualitatively different, despite that for ev-
ery initial condition their solutions have precisely the same law! These SDEs generate
the same Markov process, but a different flow r 7→ Xtr , r 7→ Ytr . Stochastic flows are
important in random dynamics (they can be used to define Lyapunov exponents, etc.),
and have applications, e.g., in the modelling of ocean currents.
Q. 9. We are going to investigate the inverted pendulum of example 6.6.5, but with a
different cost functional. Recall that we set
dθtu = c1 sin(θtu ) dt − c2 cos(θtu ) ut dt + σ dWt .
As the coefficients of this equation are periodic in θ, we may interpret its solution
modulo 2π (i.e., θtu evolves on the circle, which is of course the intention).
Our goal is to keep θtu as close to the up position θ = 0 as possible on some
reasonable time scale. We will thus investigate the discounted cost
Z ∞
Jλ [u] = E e−λs {p (us )2 + q (1 − cos(θsu ))} ds .
0
This problem does not lend itself to analytic solution, so we approach it numerically.
1. Starting from the appropriate Bellman equation, develop a Markov chain ap-
proximation to the control problem of minimizing Jλ [u] following the finite-
difference approach of section 6.6. Take the fact that θtu evolves on the circle
into account to introduce appropriate boundary conditions.
[Hint: it is helpful to realize what the discrete dynamic programming equa-
tion for a discounted cost looks like. If xα n is a controlled Markov chain with
α
transition probabilies Pi,j from state i to state j under the control α, and
"∞ #
X
n u
K% [u] = E % w(xn , un+1 ) , 0 < % < 1,
n=0
P α
then the value function satisfies V (i) = minα∈U {% j Pi,j V (j) + w(i, α)}.
You will prove a verification theorem for such a setting in part 2.]
2. To which discrete optimal control problem does your numerical method corre-
spond? Prove an analog of proposition 6.6.2 for this case.
3. Using the Jacobi iteration method, implement the numerical scheme you devel-
oped, and plot the optimal control and the value function.
You can try, for example, c1 = c2 = σ = .5, p = q = 1, λ = .1; a grid which
divides [0, 2π[ into 100 points; and 500 iterations of the Jacobi method (but play
around with the parameters and see what happens, if you are curious!)
A.5. Problem set 5 252
dxt = γ Iτ ≤t dt + σ dBt , x0 = x,
where σ determines the vigorousness of the butterfly’s fluttering, τ is the time at which
it decides to fly away, γ is the speed at which it flies away, and Bt is a Wiener process.
We will assume that τ is exponentially distributed, i.e., that P(τ > t) = e −λt .
Beside the butterfly the forest also features a biologist, who has come equipped
with a butterfly net and a Segway. The biologist can move around at will on his
Segway by applying some amount of power ut ; his position ztu is then given by
dztu
= β ut , z0 = z.
dt
Mesmerized by the colorful butterfly, the biologist hatches a plan: he will try to in-
tercept the butterfly at a fixed time T , so that he can catch it and bring it back to his
laboratory for further study. However, he would like to keep his total energy consump-
tion low, because he knows from experience that if he runs the battery in the Segway
dry he will flop over (and miss the butterfly). As such, the biologist wishes to pursue
the butterfly using a strategy u that minimizes the cost functional
" Z #
T
J[u] = E P (ut )2 dt + Q (xT − zTu )2 , P, Q > 0,
0
where the first term quantifies the total energy consumption and the second term quan-
tifies the effectiveness of the pursuit. The entire setup is depicted in figure A.1.
Note that this is a partially observed control problem: the control ut is allowed to
be Ftx = σ{xs : s ≤ t}-adapted, as the biologist can see where the butterfly is, but
the biologist does not know the time τ at which the butterfly decides to leave.
2. Prove that for s > t, we have 1 − P(τ ≤ s|Ftx ) = e−λ(s−t) (1 − P(τ ≤ t|Ftx )).
Now obtain an explicit expression for rt in terms of xt and πt = P(τ ≤ t|Ftx ).
3. Using Itô’s rule and the appropriate filter, find a stochastic differential equation
for (rt , πt ) which is driven by the innovations process B̄t and in which τ no
longer appears explicitly.
A.5. Problem set 5 253
[Alv04] L. H. R. Alvarez, Stochastic forest stand value and optimal timber harvest-
ing, SIAM J. Control Optim. 42 (2004), 1972–1993.
[Apo69] T. M. Apostol, Calculus. Volume II, Wiley, 1969.
[Arn74] L. Arnold, Stochastic differential equations: Theory and applications, Wi-
ley, 1974.
[Bac00] L. Bachelier, Théorie de la spéculation, Ann. Sci. E.N.S. Sér. 3 17 (1900),
21–86.
[BD98] M. Boué and P. Dupuis, A variational representation for certain functionals
of Brownian motion, Ann. Probab. 26 (1998), 1641–1659.
[Ben82] A. Bensoussan, Stochastic control by functional analysis methods, North
Holland, 1982.
[Ben92] , Stochastic control of partially observable systems, Cambridge
University Press, 1992.
[BHL99] D. Brigo, B. Hanzon, and F. Le Gland, Approximate nonlinear filtering by
projection on exponential manifolds of densities, Bernoulli 5 (1999), 495–
534.
[Bic02] K. Bichteler, Stochastic integration with jumps, Cambridge University
Press, 2002.
[Bil86] P. Billingsley, Probability and measure, second ed., Wiley, 1986.
[Bil99] , Convergence of probability measures, second ed., Wiley, 1999.
[Bir67] G. Birkhoff, Lattice theory, third ed., AMS, 1967.
[Bis81] J.-M. Bismut, Mécanique aléatoire, Lecture Notes in Mathematics 866,
Springer, 1981.
[BJ87] R. S. Bucy and P. D. Joseph, Filtering for stochastic processes with appli-
cations to guidance, second ed., Chelsea, 1987.
[BK29] S. Banach and C. Kuratowski, Sur une généralisation du problème de la
mesure, Fund. Math. 14 (1929), 127–131.
254
Bibliography 255