Stochastic Models
Stochastic Models
1 Probability Theory 13
1.1 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2 Poisson Processes 55
3
4 CONTENTS
2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Extensions 77
4 Renewal Processes 89
4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
George Box
Someone who wishes to understand the world around them has two options:
they either become religious or a mathematician.
Our physical world is so complex that trying to model even the simplest
processes can be quite difficult. One reason for this is that there is a vast amount
of variables that can influence the outcome of an experiment: temperatures,
pressure, air humidity, wind, electromagnetic signals, ... . These variables need
to be taken into account in some way or another.
The main way to deal with this is to keep the covariates fixed. Often ex-
periments are done in a sterile lab or a vacuum. However, there is always one
variable that we will never be able to fix, namely time. Unfortunately, time is
correlated with almost everything and hence we cannot just ignore time as a
variable.
Since we can’t keep it fixed nor ignore it we are forced to take it into account
more directly. For example, suppose we want to consider the evolution of a
stock price. We don’t consider this evolution as one value, e.g.
S = $14,
9
10 CONTENTS
Just like nature, stock prices have a huge amount of variables that drive
their evolution. Even if one could theoretically pinpoint what all these vari-
ables are, it would be practically impossible to determine their values1 and how
they impact the stock prices. This is where we enter the world of randomness
and probabilities: instead of trying to find a deterministic function for St , one
defines a function that tells us how likely it is to observe a given value for St .
P (St ) : R≥0 ×R≥0 → [0, 1] : (t, x) 7→ Probability that the stock at time t has value x
Even though they are crucial tools, we hope that the reader will appreciate
their beauty as a topic in itself. We will see in the following chapters that
stochastic processes have fascinating properties and results that, in contrast
to a lot of other domains in mathematical modeling, often require little to no
technical results. The proofs do not require deep topological, algebraic, or
analytical results making the topic surprisingly accessible.
However, as the name and the introduction suggests, the reader does need
at least some basic probability theory. Therefore we will start by covering some
probability theory, mainly focusing on concepts and results which we will need
in later chapters.
We will then look at Poisson processes and their extensions. They are one of
the most well-known and studied stochastic processes, and for a reason. They
arise naturally in a lot of different scenarios.
1 For
example, stock prices are heavily determined by human thinking, which is arguably the
most unpredictable variable of them all.
CONTENTS 11
We will also look at Markov processes, which a lot of us will probably already
be familiar with. Our focus is mainly on the dynamical properties of Markov
processes.
Probability Theory
Joseph Bertrand
Probability theory is the natural mathematical setting when dealing with un-
certainties or randomness. The first results in this field were already obtained
in the sixteenth century when most of the efforts were made in modeling gam-
bling games. However, it was only in the twentieth century that the first formal
mathematical framework of probabilities was constructed by Kolmogorov. In
his celebrated paper Foundations of the theory of probability, he developed the
axiomatic system of probability theory that is still used to date.
We will use Kolmogorov’s axiomatic system as our starting point for this
chapter. We will look at some important concepts, such as conditional proba-
bilities and random variables. This chapter is mainly for those who never had
a mathematical course on probability theory or those that need a refresher.
13
14 CHAPTER 1. PROBABILITY THEORY
1.1 Probabilities
Many scientific experiments are also experiments in the above sense. Even
though one often tries to account for the covariates that influence the outcome,
it is often not feasible or even impossible to control everything. This can lead
to small fluctuations in the set-up or the procedure of the experiment which can
in turn lead to large differences in the outcomes. 1
Definition 1.2. The sample space Ω of an experiment is the set of all possible
outcomes ω of the experiment.
Example 1.1.
1. Consider the experiment where we flip a coin. Then the sample space is
given by
Ω = {H, T }
Here, the outcome H denotes the outcome where the coin lands on heads.
2. Consider the experiment where we flip a coin three times in a row. Then,
the sample space is given by
Ω = {HHH, HHT , HT H, HT T , T HH, T HT , T T H, T T T } .
3. In a clinical trial, we want to consider the time elapsed from the moment
a patient was treated and the time when the patient is cured. Then, the
sample space is given by
Ω = R≥0 .
Notice that in the first two examples, the sample space is finite. In the last
example, the set is uncountable.
Example 1.2.
3. Given a sample space, there are a total of 2|Ω| possible events, where |Ω|
represents the cardinality of the sample space.
1. ∅ ∈ F
2. If A ∈ F then AC ∈ F
S∞
3. If Ai ∈ F , i = 1, 2, ... then i=1 Ai ∈F.
P : F → [0, 1] : A 7→ P (A)
such that
• P (Ω) = 1
1. P AC = 1 − P (A)
2. P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
P (A ∪ B) + P (A ∩ B) = P (A) + P (B \ (A ∩ B)) + P (A ∩ B) .
3. Left as an exercise for the reader. (Tip: Show this using induction.)
We have seen that by the axioms of probability theory, any probability func-
tion P satisfies P(Ω) = 1. This can also hold for other events A, which we will
define as follows.
Definition 1.8. Let A ∈ F such that P (A) = 1. Then we say that A holds almost
surely, denoted A a.e. The complement of an event that holds almost surely is called
a null set.
Alternatively, we have
P (A ∩ B) = P (A | B) P (B) .
Suppose we have that P (A | B) > P (A). This implies that if we know that B
happened, it becomes more probable to also observe A. For example, suppose
A denotes whether a person is infected by some disease and B denotes the
outcome of a test. Then generally speaking we would say that the person has a
higher probability of being infected when the test is positive.
If the probability remains the same, we say that the events are independent.
• The events
A = The sum of the two throws is 12
and
B = The first throw is a six
are dependent.
• The events
A = The first throw is a six
and
B = The second throw is a four
are independent.
The following result concerning conditionals is called the law of total prob-
ability. It will be used a lot in the following chapters.
1. Bi ∩ Bj = ∅ for all i , j
1.1. PROBABILITIES 21
2.
F
i Bi = Ω.
Since the Bi are disjoint, the sets A ∩ Bi are also disjoint. Hence, by the
σ −additive property
X X
P (A | Bi ) P (Bi ) = P (A ∩ Bi )
i i
X [ X [
= P Ai ∩ Bi = P A ∩ Bi
i i i i
= P (A ∩ Ω) = P (A) .
22 CHAPTER 1. PROBABILITY THEORY
P (A ∩ B | C) = P (A | C) P (B | C) .
Since both systems have component B, the events that both brakes fail are
not independent. However, conditioning on B they are independent. Denoting
by Wi (resp. Fi ) if component or system i works (resp. fails), we get
We are often not interested in the exact outcome of the experiment but rather
in some value that is associated with the outcome. This brings us to the concept
of random variables. We start by giving an example.
However, our main interest is often a specific (numerical) feature, such as salary,
height, age, etc. For each feature, we have a corresponding random variable,
eg.
X : Ω → R : ω 7→ Salary of ω.
24 CHAPTER 1. PROBABILITY THEORY
Notation 1.15. In the following, we will often omit the random variable from the
notation, meaning that we will write PX as P.
Remark. Since P (A) is only defined for events A ∈ F , the induced probability
measure in property 1.14 only exists if for all events B in the associated σ -algebra on
R we have
X −1 (B) ∈ F ⊂ Ω.
In measure theory, we then call X F −measurable. In the remainder of this book, we
will always assume that the random variables we cover are F −measurable.
FX (x) = PX (X ≤ x)
= P (ω | X(ω) ≤ x)
1.2. RANDOM VARIABLES 25
This distribution function gives the probability that the observed random
variable is bounded above by a given value x. Notice that
Hence, using the distribution function we can find the probability for X on any
bounded half-open interval as well.
F (x) = 1 − FX (x).
Thus, the tail function of a random variable X is the tail of the distribution
function. In order to understand these concepts a bit better, we look at some
examples of random variables.
26 CHAPTER 1. PROBABILITY THEORY
Remark. Examples of countable subsets are all finite sets, but also countably in-
finite sets. A set is countably infinite if there exists a bijection (i.e a one-to-one
correspondence) between the set itself and the natural numbers N.
In other words, countable sets are those sets that are either finite or ’as large as
N’.
Example 1.9.
The reason for their simplicity lies in the fact that probability functions satisfy
σ −additivity. Indeed, we then have the following result.
Property 1.19. Suppose X is a discrete random variable. For any event A in the
σ −algebra F of the probability space, we have
X
P (X ∈ A) = px ,
x∈A
There are many different types of discrete random variables. In the next
sections, we will cover some of the most important ones.
Definition 1.20. Let X denote the number of successes in one experiment with given
success probability p. A success is denoted by X = 1. We denote the probability of a
failure as P (X = 0) = q = 1 − p. Then, we say that X is Bernoulli distributed with
success probability p.
Notation 1.21. We write X ∼ Bernoulli(p).
P (X = k) = (1 − p)k · p.
Hence
This distribution is often used to model the number of events that happen
in a given time interval.
Luckily, we still have an alternative called the density function. Just like
one sums the probability over the outcomes in an event for the countable case,
the density function recovers the probability by taking the integral over the
outcomes in the event.
30 CHAPTER 1. PROBABILITY THEORY
Figure 1.6: The density function and distribution function of continuous random
variables
For any interval [a,b], we find that the probability of a < X ≤ b is given by
Zb
P (a < X ≤ b) = F(b) − F(a) = fX (u)du.
a
Instead of working with the probability function, we often work with the
density function.
Definition 1.31. A function f is a density function if the following conditions hold:
• f is non-negative fX (x) ≥ 0
1.2. RANDOM VARIABLES 31
R∞
• f (u)du
−∞ X
= 1.
One can show that the distribution function of a uniformly distributed ran-
dom variable is given by
x−a
FX (x) = , a < x < b.
b−a
Notation 1.33. We write X ∼ U (a, b).
Example 1.10. Suppose that exactly every 12 minutes, a bus passes your bus
stop. Without checking the schedule, you wait at the bus stop for the next bus
32 CHAPTER 1. PROBABILITY THEORY
Property 1.36. The exponential distribution has the Lack Of Memory property, i.e
In the context of the waiting time between events, this means that if you have
already waited for t units, the probability that you have to wait an additional s
time units is the same as if you had not been waiting at all. For exponentially
distributed times, it hence does not make sense to expect something to happen
just because it hasn’t happened for a long period.
1.2. RANDOM VARIABLES 33
1 − 1
(x−µ)2
fX (x) = √ e 2σ 2 , −∞ < x < ∞.
2πσ 2
Then we say that X is normally distributed with mean µ and standard deviation σ .
Figure 1.9: The distribution function and density function for a standard nor-
mally distributed random variable
The distribution of Y is denoted by χ2 (n). In this context, we often call n the degrees
of freedom.
4 Just like events, random variables can be independent. We will see what this means in a
later section.
34 CHAPTER 1. PROBABILITY THEORY
Figure 1.10: The distribution function and density function for a χ2 (3)-
distributed random variable
λN xN −1 −λx
fX (x) = e .
Γ (N )
λN xN −1 −λx
fX (x) = e .
(N − 1)!
Remark. Some authors define the scale to be λ1 . Especially when using programming
packages, it is important to always check the convention that is used.
1.2. RANDOM VARIABLES 35
Figure 1.11: The distribution function and density function for a Gamma(2, 1)-
distributed random variable
Gamma(1, λ) = exp(λ).
2. If N = k
2 and λ = 12 , we recover the χ2 (k) distribution:
!
k 1
Gamma , ∼ χ2 (k).
2 2
Proof. Exercise. You can use the fact that the density of the χ2 (k) distribution
is given by
1 k x
fχ2 (k) (x) = k x 2 −1 e− 2 .
2 2 Γ 2k
Definition 1.45. A random variable is Pareto distributed with scale k > 0 and
shape α if it has the following density function:
α
xα+1 if x ≥ k
αk
fX (x) = .
0 if x < k
The Pareto distribution is often used to model quantities with very fat right
tails.
Figure 1.12: The distribution function and density function for a P areto(1, 2)-
distributed random variable
In practice, we are often interested in more than one feature of a given obser-
vation. For example, when studying how wealth is distributed within a given
population we might want to record both the age and the salary of the members
of the population. In order to record this information, we can use a stochastic
vector.
Definition 1.48. Let X1 , X2 , ..., Xn be random variables defined on the same prob-
ability space (Ω, F , P). Define the mapping
Notation 1.49. Instead of writing X = (x), we often write out the vectors in full:
or
(X1 = x1 , X2 = x2 , ..., Xn = xn ).
1.2. RANDOM VARIABLES 37
Here, we write xi = Xi (ω). In most cases, symbols written in capital denote the
random variable itself whilst those written in lowercase denote their realization,
x ∈ R.
Just like in the univariate case, a stochastic vector gives rise to a probability
function.
Definition 1.50. Let X1 , X2 , ..., XN be discrete random variables. Then the joint
(probability) density function is given by
For any Xi , i = 1, ..., N , we can recover the marginal density from the joint density
via
XX XX X
fXi (xi ) = ··· ··· P (X1 = x1 , X2 = x2 , ..., Xi = xi , ..., XN = xN ) .
x1 x2 xi−1 xi+1 xN
Definition 1.51. The joint density function of a set of continuous random variables
X1 , X2 , .., XN is such that
(
P ((X1 , ..., XN ) ∈ A) = f (u1 , ..., uN )du1 · · · duN
A
Just like in the univariate case, we can also define the distribution function.
Definition 1.53. For any two random variables X, Y , we say that they are indepen-
dent if
P (X ∈ A, Y ∈ B) = P (X ∈ A) P (X ∈ B)
for any two sets A, B in the σ −algebra F .
P (X = x, Y = y) = P (X = x) P (Y = y) , ∀x, y
n
X
P (X + Y = n) = P (X = m) P (Y = n − m)
m=0
n
X λmX −λY λY
n−m
= e−λX e
m! (n − m)!
m=0
n !
−(λX +λY ) 1 n m n−m
X
=e λ λ
n! m X Y
m=0
Proof. Since g is strictly increasing and continuous,n the inverse ois defined on the
image of g. Furthermore, we have that the events X ≤ g −1 (y) and {g(X) ≤ y}
40 CHAPTER 1. PROBABILITY THEORY
FY (y) = P (Y ≤ y)
= P (g(X) ≤ y)
= P X ≤ g −1 (y)
= FX (g −1 (y)).
Differentiating the above equality and using the chain rule, we find
d −1
fY (y) = fX (g −1 (y)) g (y).
dy
Fy (y) = P (Y ≤ y)
√ √
=P − y≤X≤ y
√ √
= P X ≤ y −P X ≤ − y
√ √
= FX ( y) − FX (− y).
1 y 1
fY (y) = √ e− 2 √ .
2π y
1 (log(y)−µ)2
−
fY (y) = √ e 2σ 2 .
2πyσ
Associated with a random variable are its moments. These values contain
some interesting information regarding the behavior of the variable. Probably
the most well-known one is the expected value.
Definition 1.56. The expected value of a discrete random variable X with distri-
bution fX (k) is given by X
E [X] = kfX (k),
k
provided that k |k|fX (k) < ∞. Similarly, the expected value of a continuous random
P
variable X with density function fX (x) is given by
Z
E [X] = xfX (x)dx,
R
R
provided that R
|x|fX (x)dx < ∞.
42 CHAPTER 1. PROBABILITY THEORY
Notation 1.58. The second central moment µc2 is often called the variance and will
be denoted as Var (X).
= xfX (x)dxE [Y ]
−∞
= E [X] E [Y ] .
Intuitively, the covariance measures the linear relationship between the two
variables. If the covariance is positive, large values of X tend to correspond
with large values of Y . A negative covariance shows the opposite effect: large
values of X correspond to small values of Y .
Example 1.16. Assume X ∼ P oisson(λ). Then the mean and the variance are
λ.
Furthermore, we find
∞ ∞
h
2
i X λj 2 −j
X λj−1
E X = j e =λ je−λ
j! (j − 1)!
j=0 j=1
∞ j
∞ j
λ λ −j
X X
−λ
=λ (j + 1)e = λ 1 + je = λ(1 + λ).
j! j!
j=0 j=0
| {z }
E[X]
Example 1.17. Assume X ∼ B(n, p). Then, the mean is np and the variance is
npq = np(1 − p).
For this, notice that a Binomial random variable can be written as the sum
of n independent Bernoulli random variables. Using this, try to give a proof
using property 1.61.
The reason why this is interesting follows from the following observation.
Property 1.65. Let A be any event in the σ −algebra F and let I(A) be the associ-
ated indicator function which is a random variable. Then the probability function
associated with this random variable is
P (A) if i = 1
P (I(A) = i) =
1 − P (A) if i = 0.
Thus in particular,
E [I(A)] = P (A) .
Proof. Since I(A) is either 0 or 1, it suffices to check the pre-image of those two
values. Notice that
I(A)−1 (1) = {ω ∈ Ω | I(A)(ω) = 1} = A,
and
I(A)−1 (0) = {ω ∈ Ω | I(A)(ω) = 0} = Ω \ A.
Therefore
P (I(A) = 1) = P ({ω ∈ Ω | I(A)(ω) = 1}) = P (A)
and
P (I(A) = 0) = P ({ω ∈ Ω | I(A)(ω) = 0}) = P (Ω \ A) .
Notice that indeed
E [I(A)] = 1 · P (A) + 0 · (1 − P (A)) = P (A) .
For the expected value of non-negative random variables, we have the fol-
lowing very handy result.
Property 1.66. Let X be a non-negative continuous random variable. Then
Z∞ Z∞
E [X] = xfX (x)dx = F (x) dx,
0 0
where F (x) = 1−F(x) is the tail function. For non-negative integer-valued variables,
this becomes
X∞
E [X] = F (k)
k=0
1.3. EXPECTATIONS, MOMENTS AND MOMENT GENERATING FUNCTIONS47
Proof. We first give the proof for the discrete integer-valued non-negative ran-
dom variables. We have that
∞
X
E [X] = kpk .
k=0
X j
∞ X ∞
X
= pj = jpj = E [X] .
j=0 k=1 j=0
= −g(u)d(1 − dF(u))
0 Z∞
∞
= [−g(u)(1 − F(u))]0 − (1 − F(u))d(−g(u))
| {z } 0
=0
Z ∞ Z ∞
=− (1 − F(u))d(−g(u)) = F (u) g ′ (u)du.
0 0
To demystify this equality, we give the following visualization for the discrete
case.
48 CHAPTER 1. PROBABILITY THEORY
Suppose X takes on the values 0 < x1 < ... < x5 . Then, notice that for any
value x < x1 , we have P (X > x) = 1. For x1 < x ≤ x2 , we have P (X > x) =
1 − P (x1 ) . For values x2 < x ≤ x3 , we have P (X > x) = 1 − P (x1 ) − P (x2 ) and
so forth.
Thus, the area under the curve can be given by the sum of the rectangles
Z∞ X5
F(s)ds = xi P (xi ) = E [X] .
0 i=1
Proof. Since X, Y are independent, so are etX and etY . By proposition 1.62, we
find
h i h i h i h i
φX+Y (t) = E et(X+Y ) = E etX etY = E etX E etY = φX (t)φY (t),
Notice that the generating function fully describes the probability distribu-
tion of the random variable, since for any k ∈ N,
1 dk
P (X = k) = γ (s)| .
k! dsk X s=0
Hence, if two random variables have the same generating functions, they
are equal.
1.4 Conditioning
For jointly continuous random variables X,Y the conditional density of X given
y is defined as
fX,Y (x, y)
fX|Y (x | y) = . (∀y : fY (y) > 0)
fY (y)
52 CHAPTER 1. PROBABILITY THEORY
Proof. This proof is straightforward and is left as an exercise for the reader.
Another result we will need later on is the so-called chain rule for condi-
tional probabilities.
Proof. We will use induction to prove this. First, assume k = 1, i.e we have two
events A0 , A1 . Then it follows by definition that
Suppose now that the property holds for all values 1, ..., k − 1. Then
k k−1
\ \
P Ai = P Ak ∩ Ai .
i=0 i=1
Tk−1
Since i=0 Ai is an event in F , we can use the base case to write
k k−1
k−1
\ \ \
P Ai = P Ak | Ai P Ai .
i=0 i=1 i=1
Poisson Processes
Margaret Drabble
55
56 CHAPTER 2. POISSON PROCESSES
2.1.1 Introduction
These processes are not just some mathematical toy but arise naturally in
all kinds of domains.
Just like we have continuous and discrete random variables, we can distin-
guish different types of stochastic models.
Definition 2.1. Let {Xt | t ∈ T } be a stochastic process, then
Perhaps the simplest of stochastic processes are the IID (Independent Identically
Distributed) processes. They are defined as follows.
Definition 2.2. Let {Xn | n ∈ N} be a discrete process. Then this stochastic process
is called an IID process if the following hold:
Hence, we can consider the process as a sequence of independent repeats of the same
experiment.
Using these processes, one can generate a new type of process called random
walks.
Definition 2.3. Let {Xn | n ∈ N} be an IID process. Define the discrete process
{Sn | n ∈ N} via
Sn+1 = Sn + Xn+1 , S0 = 0,
the cumulative sum with as increments the elements Xn . We call such a discrete
process a random walk generated by X.
P (Sn+1 ≤ s | S0 = 0, S1 = s1 , ..., Sn = sn ) = P (X ≤ s − sn ) .
Notice that the value of Sn+1 only depends on the value of the previous obser-
vation Sn , and not how it got there. We call this form of dependency Markov
dependency.
2.1. STOCHASTIC PROCESSES 59
t
X
Xt = Bi = Number of successes in the t trials.
i=1
Example 2.2. A basketball coach knows from experience that each team
member has the same probability of scoring a free throw, namely p. Each
minute, a player attempts a free throw. Each such attempt can be seen as a
Bernoulli experiment with success probability p.
The following property shows why this process is called a binomial process. It
also shows that the time between two jumps in the random walk is geometrically
distributed.
Property 2.6. Let Xt be a binomial process generated by the IID process Bt . Then
Proof. The first property follows immediately from the fact that a binomial
random variable is a sum of IID Bernoulli random variables. Let us now show
the second point. For this, let k ≥ 0 and denote
Xh = min {Xn = k} ,
n
i.e the time of the k−th arrival. We will show that the time until the (k + 1)−th
arrival is geometrically distributed with probability p. For this, we will once
2.1. STOCHASTIC PROCESSES 61
The time between the k−th arrival and the (k + 1)−th arrival is then given by
h+j
X
n o
Tk+1 = min j|Xh+j − Xh = 1 = min j| B = 1 .
i
j j
i=h+1
Notice that in the case of binomial processes, it is assumed that the time
between experiments is fixed: each time unit should consist of one and only
one experiment. As a consequence, the value of the binomial process can only
change at discrete times t = 1, 2, .... This is shown by the fact that the time
between jumps has a discrete distribution.
Many processes in everyday life that count the events up to a particular point
in time can be adequately modeled using the Poisson distribution we have seen
in chapter one. These special counting processes, which we will call Poisson
processes, possess many desirable properties as we will soon discover.
In the next sections, we will give three different, but ultimately equivalent
definitions of Poisson processes. Each definition gives a characterization of a
different flavor. Knowing and understanding these definitions can not only lead
to deeper understanding but can also greatly simplify calculations.
For the first definition, we take a look at the case of random variables. There,
we have seen in proposition 1.29 that a Poisson random variable can be seen as
the limit of Bernoulli random variables. We will start by considering the analog
in the case of processes.
In the limit, one can show that this process converges to a process with the
following properties, which we will call a Poisson process.
Definition 2.7. The counting process {Nt | t ≥ 0} is a Poisson process with rate
λ > 0 if it satisfies the following requirements
2.2. POISSON PROCESSES 63
The following result shows why these processes are called Poisson processes.
Property 2.8. For a Poisson process Nt with rate λ, the number of events observed
in any interval with length t satisfies
(λt)k
P (Nt = k) = P (Ns+t − Ns = k) = e−λt , k = 0, 1, ....
k!
In other words, Ns+t − Ns is Poisson(λt) distributed.
fk (t) = P (Nt = k) , t ≥ 0.
The probability that no event happens in some time interval t + h is then given
by f0 (t + h). Using all three properties of the Poisson processes, we get
Hence,
f0 (t + h) − f0 (t) o(h)
= −λf0 (t) + f (t).
h h 0
64 CHAPTER 2. POISSON PROCESSES
df0 (t)
= −λf0 (t).
dt
For k ≥ 1, we will use the law of total probability (property 1.11). We obtain
fk (t + h) = P (Nt+h = k | Nt = k) P (Nt = k)
+ P (Nt+h = k | Nt = k − 1) P (Nt = k − 1)
k
X
+ P (Nt+h = k | Nt = k − j) P (Nt = k − j) .
j=2
However, by the third property of Poisson processes, for small h, this becomes
fk (t + h) − fk (t) o(h)
= λ(fk−1 (t) − fk (t)) + .
h h
Taking the limit, this becomes
dfk (t)
= λ(fk−1 (t) − fk (t)) k ≥ 1.
dt
We thus need to solve the differential equations for all fk .
For this, we use a clever trick. We first consider the generating function of
Nt .
h i X ∞
γNt (s) = E sNt = sk fk (t).
k=0
Since the generating function uniquely determines the distribution, it suffices to
show that the generating function coincides with the generating function for a
2.2. POISSON PROCESSES 65
γX (s) = e−λt(1−s) .
∞
∂γNt (s, t) X df (t)
= sk k
∂t dt
k=0
∞
X
= −λf0 (t) + sk λ(fk−1 (t) − fk (t))
k=1
X∞
= λf0 (t) + λs sk−1 (fk−1 (t) − fk (t))
k=1
∞
X ∞
X
k
= −λ s fk (t) + λs sk fk (t)
k=0 k=0
= −λ(1 − s)γNt (s)
We will now show that the second definition implies the first. The opposite
direction has already been shown in property 2.8 in the previous section.
For this, consider the Taylor expansion of P (Nt = k). This can easily be
shown to be equal to
(λt)k (λt)k+1
P (Nt = k) = − + ...
k! k!
Then for h small, we get
1 − λh + o(h) k=0
P (Nh = k) = λh + o(h) k=1
o(h) k>1
In the third and final characterization, we will focus on the time between the
events instead of the counting process directly. These times are called jump (or
arrival) times.
Definition 2.10. Let Nt be a counting process. We then define the k-th jump (or
arrival) time
tk = min {t | Nt ≥ k} .
t≥0
Thus, the k-th arrival time denotes the time at which we observe the k-th event.
Definition 2.11. Let Nt be a counting process. Then define the k-th waiting time
Ti = ti − ti−1 .
Just like the arrival times, the waiting times are random variables. They are
easily seen to be non-negative.
Some counting processes allow for multiple events to happen at the same
time with non-zero probability. We call these clustered processes.
P (NC < ∞) = 1.
Notice that if we know the arrival times, we know all the information about
the counting process, and vice versa. Hence, we know that there must be a
characterization using only the arrival times. For this, we have the following
result.
Property 2.14. Let Nt be a Poisson process. Then the waiting times Ti are i.i.d
exponential(λ) distributed.
Hence,
From this, we conclude that the waiting times are exponential(λ) distributed.
We now show how we can define a Poisson process using the waiting times
Ti .
Then
Nt = max {n | tn ≤ t}
is a Poisson process with rate λ.
λn t n−1 −λt
ftn (t) = e .
(n − 1)!
2.2. POISSON PROCESSES 69
We first show that this indeed defines a Poisson process that corresponds to
the other definitions.
Notice that
t
λn xn−1 −λx
Z
Ftn (t) = e dx.
0 (n − 1)!
70 CHAPTER 2. POISSON PROCESSES
t
λn
Z
Ftn (t) = xn−1 e−λx dx
(n − 1)!
0"
xn−1 e−λx t n − 1 t n−2 −λx
λn
# Z
= − + x e dx
(n − 1)! λ 0 λ 0
n−1 n−1 Zt
(λt) λ
=− e−λt + xn−2 e−λx
(n − 1)! (n − 2)! 0
= ···
n−1
X (λt)i −λt
=− e + 1.
i!
i=0
Hence,
Notice that from this result, we can show that Poisson processes behave quite
nicely.
Proof. We first show that it is stable. For this, notice that for any C,
e−λC (λC)k
P (NC = k) = → 0 as k → ∞
k!
To show that Poisson processes are not clustered, recall that T is exponentially
distributed, which is a continuous distribution. Hence, P (T = 0) = 0.
2.2. POISSON PROCESSES 71
N
Xt
E (t − ti ) .
i=1
One would be tempted to use the linearity on the sum in the expectation, but
notice that this cannot be done since Nt is a random variable. We will see that
one way to circumvent this issue is by using the Tower Rule, for which we will
need conditionals.
Theorem 2.18. Suppose that Nt = 1 for some given t. Then conditionally on that
information, the time of arrival t1 is uniformly distributed over (0, t].
Proof. Using the definition of conditional probability we have for any s ∈ (0, t]
P (t1 < s, Nt = 1)
P (t1 < s | Nt = 1) = .
P (Nt = 1)
72 CHAPTER 2. POISSON PROCESSES
Let us now revisit the example, and notice how this easy result makes a
complicated exercise almost trivial.
Example 2.4. With the same setting as before, notice that we can use the
linearity of the expected value if we condition on the count. Using the Tower
Rule,
N N
Xt X t
E (t − ti ) = E E (t − ti ) | Nt = n
i=1 i=1
t λt 2
= E Nt t − Nt = .
2 2
hP i
Here, we use the fact that E ni=1 ti | Nt = n = E ni U[0,t] , where U[0,t] is a
P
uniformly distributed random variable.
In this section, we study how Poisson processes can be decomposed into other
Poisson processes. We start by considering a motivating example.
Example 2.5. An insurance company has modeled the car crashes of their
clients as a Poisson process. They found that the rate at which these crashes
happen is given by λ and that this rate is constant through time.
2.2. POISSON PROCESSES 73
However, they would like to make a distinction between heavy car crashes
and lighter car crashes. Suppose that they see that the probability of a car crash
being a heavy car crash is around p1 , and thus the probability of the crash being
a small car crash is 1 − p1 = p2 . It would be logical for the insurance company
to split the original Poisson process into two different Poisson processes. This
is called thinning.
Property 2.19. Let N (t) be a Poisson process with rate λ. Suppose that the events
are either of type 1 or type 2. Suppose furthermore that the probability of an event
being type 1 is given by p1 . Then the process N1 (t) counting the type 1 events is also
a Poisson process with rate p1 λ and a similar result holds for process N2 (t)
Proof. Since events are either type 1 or 2, it is clear that N (t) = N1 (t) + N2 (t).
We start by calculating the joint probability distribution of N1 (t) and N2 (t).
P (N1 (t) = m, N2 (t) = n) = P (N1 (t) = m, N2 (t) = n | N (t) = m + n) P (N (t) = m + n)
!
m+n m n
= p1 p2 P (N (t) = m + n)
m
m + n m n −λt (λt)m+n
!
= p1 p2 e
m (m + n)!
(m + n)! 1
= (p1 tλ)m (p2 tλ)n e−(p1 +p2 )λt
m!n! (m + n)!
m
(p tλ) −p2 tλ (p2 tλ) n
= e−λp1 t 1 e
m! n!
Thus, the marginal distribution of N1 (t) is given by
X ∞
P (N1 (t) = m) = P (N1 (t) = m, N2 (t) = n)
n=0
∞
X (p1 tλ)m −p2 tλ (p2 tλ)n
= e−λp1 t e
m! n!
n=0
∞
mX n
−λp1 t (p1 tλ) −p2 tλ (p2 tλ)
=e e
m! n!
n=0
(p1 tλ)m
= e−λp1 t .
m!
74 CHAPTER 2. POISSON PROCESSES
We find that N1 (t) and N2 (t) are independent and Poisson processes with rates
p1 λ and p2 λ respectively.
More generally, suppose events can change from type 2 to type 1. Once it is
type 1, it remains to be of type 1. Denote by H the random variable that repre-
sents the time between the event’s arrival and the event occurring. In queuing,
this is called the service time. Denote by FH the cumulative distribution func-
tion. Then FH (x) represents the proportion of customers that are serviced after
an elapsed time x, or more generally the probability of conversion happening
before time x.
Similar to before, we find two processes N1 (t), N2 (t). Suppose that the
customers arrive according to a Poisson process with constant rate λ, then the
distributions of the processes N1 (t) and N2 (t) are given by
and
(λt(1 − p1 (t))m −λt(1−p1 (t))
P (N2 (t) = m) = e .
m!
Notice that the processes no longer have constant arrival rates. The expected
number of customers served at time t is given by
Zt
E [N1 (t)] = λtp1 (t) = λ FH (s)ds,
0
and similarly Z t
E [N2 (t)] = λ (1 − FH (s))ds.
0
Letting t → ∞, the right-hand side becomes the integral of the tail function
which we know is equal to the average. Hence, we find that in the long run, the
expected number of people waiting to be serviced is equal to
Just like we can split a Poisson process into several Poisson processes, we
can also combine Poisson processes. This is called a superposition of Poisson
processes. We need the following theorem.
Theorem 2.20. Let s < t and 0 ≤ m ≤ n, then the conditional distribution of Ns
given Nt = n is the binomial distribution B(n, st ).
Extensions
Thomas Hardys
In the previous chapter, we have briefly covered a Poisson process with a non-
constant arrival rate. These occur quite naturally in many situations.
77
78 CHAPTER 3. EXTENSIONS
Figure 3.1: Most customers go to restaurants during breakfast, dinner, and lunch
times
2. One has
1 − λ(t)h k=0
P (Nt+h − Nt = k) = λ(t)h + o(h) k=1.
o(h) k≥2
Notice that the stationary increment property no longer holds, since the
distribution depends on λ(t).
m(s, t)k
P (Nt − Ns = k) = e−m(s,t) .
k!
f0 (t + h) = P (Nt+h − Ns = 0)
= P (Nt − Ns = 0) P (Nt+h − Nt = 0) (independent increments)
= f0 (t)(1 − λ(h) + o(h)).
γY (t) = e−λ(1−s) .
In our case, we find
∞
∂γNt −Ns (v) X df (t)
= vk k
∂v dt
k=0
∞
X ∞
X
= λ(t)v v k−1 fk−1 − λ(t) v k fk (t)
k=1 k=0
= −λ(t)(1 − v)γNt −Ns (v).
3.1. NON-HOMOGENEOUS POISSON PROCESSES 81
One can show that the solution of this differential equation is given by
Rt
λ(u)du
γNt −Ns (v) = e−(1−v) s ,
E [Nt ] = Λ(t).
We have already seen that the amount of car crashes can be modeled using a
Poisson distribution. Here, λ equals the expected amount of car crashes for a
given population in one time unit. However, this λ need not be representative
of every member of the population: some people are better drivers and hence
have a much lower λ.
Every person has a different driving skill, and hence we could consider λ
as a random variable. Similarly, we could consider a bond portfolio where
we model the number of defaults. Companies with an investment-grade credit
rating will have a lower probability of default than others. Once again, we could
see the expected rate of defaults as a random variable.
This Θ is used as a multiplier that encodes the risk that a certain member
of the population has relative to the average of the population. A good driver
3.2. MIXTURE MODELS 83
will have θ < 1 whilst a bad driver will have θ > 1. In order to preserve the
population average, it is common to choose E [Θ] = 1.
Definition 3.6. Let Θ be a positive random variable with E [Θ] = 1, and let
λ > 0. Then we say that X is mixed Poisson distributed if
∞
(λtθ)k
Z
P (X = k) = e−λtθ f (θ)dθ.
0 k! θ
Nt ∼ MP oisson(λt, Θ).
Let us consider what happens with the moments of mixed Poisson processes.
Proposition 3.9. The mean and the variance of a mixed Poisson process are given
by
E [Nt ] = λt
Var (Nt ) = λt + λ2 t 2 Var (Θ)
Proof. We start by showing the equality for the mean. Using the Tower Rule, we
find
E [Nt ] = E [E [Nt | Θ]] = E [λtθ] = λt.
84 CHAPTER 3. EXTENSIONS
Notice that the variance is larger than the mean, which we call overdisper-
sion. This is due to the fact that we add another source of variance by allowing
the parameter to vary over the population.
For the next property, we will need an important inequality from measure
theory called Jensen’s inequality.
Definition 3.10. A function g is convex if ∀x, y
g(x) ≥ g(y) + (x − y)g ′ (y),
Intuitively, it means that if you connect two points with a straight line, the
function will lie below the straight line. When the function always lies above
the straight line, the function is called concave.
3.2. MIXTURE MODELS 85
An easy way to check for the convexity of the function is by using the second
derivative.
g”(x) ≥ 0.
Property 3.13. Let {Nt | t ≥ 0} be a mixed Poisson process with rate λ and risk
level
n Θ. Then
o this process has an excess of zeroes compared to the Poisson process
Ñt | t ≥ 0 with constant λ:
P (Nt = 0) ≥ P Ñt = 0 .
Proof. Notice that the exponential function is convex, since for all x > 0:
d2 x
2
e = ex ≥ 0.
dx
86 CHAPTER 3. EXTENSIONS
Notice that
Z ∞
P (Nt = 0) = E [P (Nt = 0) | Θ] = e−λtθ fθ (θ)dθ.
0
Using Jensen’s inequality, we find
R∞
−λt fθ (θ)dθ
P (Nt = 0) ≥ e 0
= e−λt = P Ñt = 0 ,
proving the desired result.
In practice, we indeed often find that models that do not account for the
variability in the population tend to underestimate the number of zero-claims.
Of course, the difficulty of using a mixed Poisson model is that one needs to
find a good model for fθ .
Just like in the previous section, we can also use the notion of mixtures in a
Bernoulli setting. In particular, suppose we have a stochastic vector X of size
N consisting of independent (but not identically distributed) Bernoulli random
variables. The probability distribution function is then given by
n
Y
x
P (X | p) = pi i (1 − pi )1−xi .
i=1
Suppose now that we have a mixture of two populations. Just like before, we
would then have two probability vectors, one for each population. We denote
these by p1 and p2 , and their k-th component by p1k and p2k respectively.
3.2. MIXTURE MODELS 87
The component pik represents the probability that someone from population
i has a success for the experiment associated with the k-th random variable in
the stochastic vector. We illustrate this with an example.
Example 3.2. To evaluate whether a person has a mental illness, many experts
use questionnaires. Suppose we have such a questionnaire where the patient
is presented with N statements and records whether they agree or disagree.
Each question can then be seen as a Bernoulli experiment with X = 1 if the
respondent agrees and X = 0 otherwise.
Example 3.3. The MNIST database is a widely-used data set that contains
handwritten digits ranging from 0 to 9. It is often used to test the performance
of classifiers that are designed to recognize these digits. Each image in the
database consists of a 28x28 grid of pixels, which are either black or white. We
can treat each pixel as a Bernoulli experiment, where a value of 1 indicates that
the pixel is black and a value of 0 indicates that it is white.
In this context, we can consider a mixed model where each digit is its own
population. This gives us 10 probability vectors, one for each digit, with a size
of 784 (corresponding to the number of pixels in each image). For example,
the probability vector p3i represents the probability of the ith pixel being black
given that the digit is a 3.
Renewal Processes
Wilson Mizner
Renewal processes are a special type of counting process for which the
waiting times are independent and identically distributed. They have many
interesting properties, whilst still being quite general. In fact, a lot of the
processes covered so far are examples of renewal processes.
4.1 Definition
89
90 CHAPTER 4. RENEWAL PROCESSES
Definition 4.2. Let Xn be a renewal process. Consider the random walk generated
by this process, i.e the partial sums
n
X
Sn = Xi .
i=1
Then the renewal counting process Nt generated by Xn is the counting process defined
by
Nt = max {n | Sn ≤ t} .
The reason that this process is called a renewal process is because the
process loses all its memory (i.e it is ’renewed’) whenever an event occurs.
m(t) = E [Nt ] .
In the classical strong law of numbers, it is stated that under mild conditions the
long-term average of a sequence of IID random variables Xi converges almost
surely to the expected value of the distribution, µ, as the number of elements
in the sequence approaches infinity:
Pn !
i=1 Xi
lim P = µ = 1.
n→∞ n
Proof. Let Sn denote the associated random walk. For all t, we have
SNt ≤ t ≤ SNt +1 .
Therefore,
tNt t tN tN +1 Nt + 1
≤ < t+1 = t .
Nt Nt Nt Nt + 1 Nt
By the law of large numbers, it follows that
!
tNt
lim P = µ = 1,
t→∞ Nt
and !
tNt +1
lim P = µ = 1.
t→∞ Nt + 1
Additionally,
Nt + 1
lim = 1.
t→∞ Nt
The result then follows from the sandwich theorem.
92 CHAPTER 4. RENEWAL PROCESSES
We can interpret Nt t as the average time between events in our sample. The theorem
thus states that, in the long run, the sample average time between events converges
almost surely to the true average time between events µ.
Another limiting result that we will briefly mention is the central limit the-
orem for renewal counting processes.
Theorem 4.6. Let Nt be a renewal counting process and suppose that the associated
inter-arrival times Xn satisfy E [X] = µ < ∞ and Var (X) < ∞. Then
Nt − tµ
p →D N (0, 1).
σ µ3 t
In other words, the value of the stopping time is determined solely by the
events that have occurred up until that point and are not influenced by events
that occur afterwards.
Example 4.2. Consider a gambler with an initial budget B0 , and let {Bn | n ≥ 0}
be the process representing the gambler’s budget after playing the n-th game.
• Playing until the budget is at its highest is not a stopping time, since it
depends on the future values of the budget.
• Playing until 10 games have been played is a stopping time, as it is deter-
mined solely by the events that have occurred up to that point.
4.3. STOPPING TIMES 93
• Playing until the budget runs out is also a stopping time, as it is deter-
mined by the events that have occurred up to that point.
Notice that Ii = 1 − I(N > i − 1)1 , and the latter is fully described by X1 , ..., Xi−1
since N is a stopping time. Hence, Ii is determined in the conditional so we
1 Recall that N is integer-valued
94 CHAPTER 4. RENEWAL PROCESSES
can write
N ∞ ∞
X X X
E Xi =
E [Ii E [Xi |X1 , ..., Xi−1 ]] = E [Ii E [Xi ]] .
i=1 i=1 i=1
Using the formula for the tail from property 1.66, this becomes
N
X
E Xi = E [X] E [N ] .
i=1
Example 4.3. Suppose we play a game in which we throw a six-sided die. The
amount of eyes the die lands on is the number of times we will throw another
die. For example, if our first die lands on three, we will throw another die three
times. Not counting the initial throw, what is the total amount of eyes that we
can expect?
Notice that N is a stopping time for the second batch of throws, and hence
we can use Wald’s equation. This gives
N
7 7 49
X
E Xi = E [N ] E [X] =
= = 12.25.
2 2 4
i=1
The above game could very well be played in a gambling setting. Then we
would not be so interested in the outcome of the game itself but more in the
payouts associated with the gamble. In the next section, we consider so-called
renewal-reward processes that are more appropriate to use in this setting.
4.3. STOPPING TIMES 95
Just like the renewal counting process Nt counts the number of events
through time, we can define a reward process Rt to count the (total) reward
due to the events through time. We define it as follows.
Definition 4.9. Let Nt be a renewal counting process for the inter-arrival times
Xn . Let Rn be a sequence of IID random variables. We call Rn the reward of the
renewal Xn whenever Ri is independent of Xj for j , i.
It is worth noting that, in general, the rewards Rn and the inter-arrival times
Xn may be dependent on each other.
Example 4.4.
• Let Xi denote the duration of the i-th taxi ride. Denote the associated
fare of this ride by Ri , then Xi and Ri are dependent.
• let Xi denote the inter-arrival time of buses on a given bus stop, and Ri
the number of passengers entering the associated bus. Then again Ri and
Xi are dependent.
Ct = Ri
i=1
Example 4.5.
• Suppose Xn denotes the duration of the nth taxi ride and Rn the fare of
the associated drive. As the fare is only paid at the end of the drive, the
cumulative reward process is given by
Nt
X
Ct = Ri .
i=1
Ct = Ri .
i=1
98 CHAPTER 4. RENEWAL PROCESSES
• The owner of an internet café bills its users continuously using a credit
system. For every time unit (eg. a minute), the customer pays 10 cents.
Denote by Xn the duration that a customer uses the service. Assum-
ing that there is only one computer available, the associated cumulative
reward process is given by
Nt
X
Ct = 0.1 · (ti − ti−1 ) + 0.1 · (t − tNt ),
i=1
where ti is the time of the i − th renewal.
Ct
Proof. We can rewrite t as
PNt
Ct i=1 Ri Nt
= .
t Nt t
By the strong law of large numbers, we know that with probability 1,
Nt 1
lim =
t→∞ t E [X]
Furthermore, by the classical law of large numbers, we know that with proba-
bility 1
PNt
Ri
lim i=1 = E [R]
t→∞ Nt
4.4. RENEWAL-REWARD THEOREM 99
Ct E [R]
lim = .
t t E [X]
Since E [R] is strictly positive, we thus find that Ct is strictly positive as well.
Evidently, the larger E [R] the faster Ct grows on average. On the other hand if
E [R] is too large then the game is not as attractive for the gambler.
Remark. In one time unit, we have on average by the law of large numbers Nt t →
1
E[X]
games. For each of those E[X]
1
games, we have on average an associated cost
E [R]. Thus, per time unit we would expect a cost
1
E [R] × .
|{z} E [X]
Average cost per game
|{z}
Average amount of games per time unit
Ct
Theorem 4.11 states that in the long run, the ratio tends to the quantity
E[R]
t E[X]
.
100 CHAPTER 4. RENEWAL PROCESSES
Chapter 5
Markov Processes
Mahatma Gandhi
In almost all the stochastic processes we have seen so far, the dependency
between the different states was quite weak. The main reason for this is that al-
lowing stronger dependencies can quickly lead to incredibly complex processes
for which even the most basic results become very difficult to prove.
The easiest case would be total independence, i.e the state of some process
Xn does not depend on any of the previous states:
In other words, any information regarding the past of the system is irrelevant
for the prediction of the next state. Of course, in many cases, this type of
simplification cannot be justified since complete independence is seldom present
in physical examples.
101
102 CHAPTER 5. MARKOV PROCESSES
5.1 Definition
Definition 5.1. Let {Xn | n ≥ 0} be a discrete chain. This process is a discrete time
Markov chain if it has the Markov property, i.e
P (Xn+1 = j | Xn = i, Xn−1 = in−1 ..., X0 = i0 ) = P (Xn+1 = j | Xn = i)
for all j, i, in−1 , ..., i0 and n ≥ 0.
For the sake of notation, we will often just write pij even when the Markov
chain is not homogeneous.
5.1. DEFINITION 103
P = (pij )
p11 p12 p13 ... p1n
p p p ... p2n
= 21 22 23 .
... ... ... ... ...
pn1 pn2 pn3 ... pnn
Thus the j-th column of the i-th row pij gives the transition probability to go from
state i to state j.
The following property tells us what transition matrices look like, and also
how we can construct new ones quite easily.
Property 5.6. The transition matrix P of a Markov chain is a stochastic matrix,
i.e a matrix that satisfies
X
pij = 1, ∀i = 1, ..., n
j
and pij ≥ 0 for all i, j. The opposite also holds: any stochastic matrix M gives rise
to a unique Markov chain with homogeneous probabilities.
Thus the probability distribution of the chain is fully determined by P (X0 = k) and
P.
5.1. DEFINITION 105
5.1.1 Examples
Example 5.1. Suppose we count the number of times since we last rolled a 6
on a six-sided die. This can be modeled using a Markov chain with
5 1
pi,i+1 = , pi,0 = .
6 6
For any amount of throws n, the transition matrix is given by
0 1 2 ... n
0 16 56 0 0 0
1 1 0 5 0 0
Pn = 6 6 .
2 16 0 0 0 0
... 1 0 0 5 0
6 6
n 16 0 0 0 56
106 CHAPTER 5. MARKOV PROCESSES
Proof. Let Xn be a process with independent increments, i.e the random vari-
ables
Xn1 − Xn0 , Xn2 − Xn1 , ..., Xnk − Xnk−1
are independent for any 0 ≤ n0 < n1 < ... < nk . In particular, we have
P (Xn = j | Xn−1 = jn−1 , Xn−2 = jn−2 , ..., X0 = j0 )
= P (Xn − Xn−1 = j − jn−1 | Xn−1 = jn−1 , Xn−2 − Xn−1 = jn−2 − jn−1 , ..., X0 − Xn−1 = j0 − jn−1 )
Since Xn − Xn−1 is independent of Xn−1 − Xn−i for any i , 0, we have that for
any n
P (Xn = j | Xn−1 = jn−1 , Xn−2 = jn−2 , ..., X0 = j0 ) = P (Xn − Xn−1 = j − jn−1 |Xn−1 = jn−1 ) .
From this, we can deduce
P (Xn = j | Xn−1 = jn−1 , Xn−2 = jn−2 , ..., X0 = j0 ) = P (Xn = j | Xn−1 = jn−1 ) ,
and since we assumed n to be random this proves the desired result.
108 CHAPTER 5. MARKOV PROCESSES
Example 5.3. Consider the Gambler’s ruin problem. A casino holds a game
in which one can either win $1 with probability p or lose $1 with probability
q = 1 − p.
In the case where N = 4 and p = 0.4, we have the following transition matrix.
0 1 2 3 4
0 1 0 0 0 0
1 0.6 0 0.4 0 0
P =
2 0 0.6 0 0.4 0
3 0 0 0.6 0 0.4
4 0 0 0 0 1
Using the tree diagram, we can also represent the game as follows:
Figure 5.7: Tree diagram of the Gambler’s ruin. Blue transitions have probability
0.4, red transitions have probability 0.6 and green transitions have probability
1
In the previous example, once we reached the 0-state or the N −state, we could
never leave. These are examples of absorbing states.
Definition 5.9. Let Xn be a discrete-time Markov chain. Suppose that there is a
state i such that pii = 1, then we call this state i absorbing.
Chains with absorbing states have interesting dynamics which we will cover
in more detail in a later section.
In the remainder of this section, we will use the language developed so far
to introduce and derive the Kelly criterion. This criterion is used to determine
the betting or investment strategy that maximizes the growth of wealth over
long periods.
There are several different investment strategies, each with its benefits and
their downsides. For example, maximizing the highest possible short-term re-
turns is achieved by investing the whole budget, but this of course carries the
greatest risk for ruin1 . On the other hand, minimizing the risk of ruin is achieved
by never investing at all.
Suppose that we have n bets, and each one of them is associated with a
cash flow Bi . A loss creates a negative cash flow, given by −Bi . Then, the total
wealth after the n bets is given by
n
X
Sn = S0 + Xi Bi ,
i=1
Suppose that according to the investor, the probability to win on the bet is
given by p. Then the expected portfolio value after n bets is given by
n
X n
X
E [Sn ] = S0 + E [Xi Bi ] = S0 + (2p − 1) E [Bi ] .
i=1 i=1
Instead of betting the full wealth (Bi = Si ), suppose we bet a fraction of the
current wealth. Denoting this fraction by fi , we hence get
Taking the expectation of the equation above, we find that the expected
exponential growth rate is given by
! n1
S
E [Gn (f )] = E log n = p log(1 + f ) + (1 − p) log(1 − f ).
S0
Example 5.4. Suppose that the bet is uneven, i.e the gain when winning the
bet is not the same as the loss when losing even with the same input f Si .
112 CHAPTER 5. MARKOV PROCESSES
Denote by b the gain for every unit that is bet. We assume that in a loss, the
full input of the bet is lost. For example, if the house is offering a 2-to-1 odds
then b = 2, i.e you get twice your initial investment.
Sn = S0 (1 + bf )nS (1 − f )nF .
Example 5.5. Consider the random walk on the integers as in example 5.2.
Suppose that
P (Bn = 1) = 0.5 = P (Bn = −1) .
(2)
Let us try to calculate the probability of the two-step transition p00 , the prob-
ability of being back at zero after two moves.
The four possible paths with length two are given by (0 → 1 → 2),(0 →
1 → 0), (0 → −1 → 0) and (0 → −1 → −2).
This shows that the two-step dynamics can be obtained by summing over the
relevant one-step dynamics.
Theorem 5.11. Let i, j be any two states and let m, n > 0. Then
(m+n)
X (m) (n)
pij = pik pkj .
k
(m+m)
X
pij = P (Xm+n = j | X0 = i) = P (Xm+n = j, Xm = k | X0 = i) .
k
Using the definition of conditional probabilities, this can then be further rewrit-
5.2. MULTI-STEP DYNAMICS 115
ten as
(m+n)
X P (X = j, Xm = k, X0 = i)
m+n
pij =
P (X0 = i)
k
X P (X = k, X = i) P (X
m 0 m+n = j, Xm = k, X0 = i)
=
P (X0 = i) P (Xm = k, X0 = i)
k
X
= P (Xm = k | X0 = i) P (Xm+n = j | Xm = k, X0 = i)
k
X
= P (Xm = k | X0 = i) P (Xm+n = j | Xm = k)
k
X (m) (n)
= pik pkj
k
As we have already pointed out, all the dynamical information of the tran-
sitions of a Markov chain is contained in the transition matrix P. It should
therefore not be surprising that the Chapman-Kolmogorov equations can also
be formulated in terms of the transition matrix. This is shown in the next
proposition.
(m)
Proposition 5.12. The matrix of m−step transition probabilities P(m) = (pij ) is
the m−th power of the transition matrix P(m) = Pm .
Proof. We will prove this using induction. Notice that if we set n = 1 in the
Chapman-Kolmogorov equations, we obtain
(m+1)
X (m)
pij = pik pkj .
k
(m)
(pij ). Then the matrix product P(m) × P is given by
... ... ... ... p1j ...
(m) (m)
(P(m) × P)ij = pi1 ... pin × ... ... ...
... p
... ... ... nj ...
X (m) (m+1)
= pik pkj = pij ,
k
Finally, we show that the k−step transition matrix is stochastic. For this, we
use the following proposition.
Proposition 5.13. Suppose X and Y are two stochastic matrices of the same size.
Then XY is also stochastic.
Proof. We need to show that = 1 for all j. For this, notice that
P
j (XY )ij
X XX
(XY )ij = xik ykj
j j k
XX X X
= xik ykj = xik ykj
k j k j
X
= xk xik = 1.
Using this proposition, one can easily show the following result.
Example 5.6. In this example, we will consider the multi-step dynamics for
the Gambler’s Ruin, see example 5.3. Recall that the transition matrix was
given by
1 0 0 0 0 0
0.6 0 0.4 0 0 0
0 0.6 0 0.4 0 0
P = .
0 0 0.6 0 0.4 0
0 0 0 0.6 0 0.4
0 0 0 0 0 1
Using proposition 5.12, we can find the two-step transition probabilities and
find (check this!)
1 0 0 0 0 0
0.6 0.24 0 0.16 0 0
0.36 0 0.48 0 0.16 0
P(2) = .
0 0.36 0 0.48 0 0.16
0 0 0.36 0 0.24 0.4
0 0 0 0 0 1
As an exercise, calculate P(10) 2 and interpret the results. What can you say
about the transitions between non-absorbing states?
In this section, we briefly cover the basic steps on how to simulate a Markov
chain. Recall that from proposition 5.7, it follows that it suffices to know the
initial distribution α and the transition matrix P.
We assume that the state space S (Xn ) is countable. Suppose that after
labeling, we can write the state space as
As an exercise, check whether the obtained Markov chain has the same tran-
sition probabilities as governed by P and check whether the obtained chains
satisfy the Markov property.
5.4. CALIBRATION OF MARKOV CHAINS 119
We will assume that the transition probabilities are homogeneous. Suppose that
we have observed the evolution of a Markov chain up to time n. Thus, we have
as data the path
Pn = (X0 = i0 , X1 = i1 , ..., Xn = in ).
Using proposition 5.7, one can easily show that the likelihood of this path is
given by
n−1
Y
L(Pn ) = P (X0 = i0 ) pik ,ik+1 .
k=0
Denote by nij the amount of times the one-step transition i → j is observed in
the path Pn . Then since we assumed homogeneous transition probabilities we
can rewrite the likelihood as
Y n
ij
L(Pn ) = P (X0 = i0 ) pij .
i,j
Hence, we obtain
nij
p̃ij = .
λi
The constraints give rise to
X X nij
p̃ij − 1 = −1 = 0 ∀i ∈ S (Xn ) ,
λi
j j
Therefore, the maximum likelihood estimators for the transition probability pij
are given by
nij Number of observed transitions i → j
p̃ij = P = .
j nij Number of transitions starting from i
5.5. STRUCTURE OF MARKOV CHAINS 121
It is clear from the previous sections that Markov chains have a lot of structure.
Due to the possibly large amount of states and transitions between them, it
can be quite difficult to fully understand the structure. In this section, we will
develop tools and concepts that make this a lot easier.
In the following sections, we will consider the more global structure of Markov
chains. For this, we will need the following concepts.
Definition 5.15. Let Xn be a Markov chain. We then define the state space S (Xn )
to be the state of all possible values of Xn .
Example 5.7.
We have put a lot of focus on states i and j that have paths between them.
However, it is just as interesting to consider states with no paths between the
two.
Definition 5.16. Let i, j ∈ S (Xn ) be any two states in the state space. We say i
communicates with state j if there is a non-zero probability of ever reaching j starting
from i:
(n)
pij > 0 for some 0 ≤ n < ∞.
We say the states i and j intercommunicate if they communicate with each other.
122 CHAPTER 5. MARKOV PROCESSES
(n)
Proof. Since i ∼ j, we know that there exists an n such that pij > 0. Similarly,
(m)
we can find an m such that pjk > 0. From the Chapman-Kolmogorov equation,
we find
(m+n)
X (m) (n) (m) (n)
pik = pil plk ≥ pij pjk > 0.
l
It is easy to see that the relationship is also reflexive and symmetric. The
nice thing about equivalence relationships is that they can be used to decompose
a set into its equivalence classes. In this case, we can decompose the state space
S (Xn ) as follows:
7. Repeat the procedure until all the states have been assigned to a class.
By construction, this gives a disjoint and exhaustive set of equivalence
classes of the state space S (Xn ).
5.5. STRUCTURE OF MARKOV CHAINS 123
Example 5.8. In the case of the gambler’s ruin, we have three different
equivalence classes: {0} , {N }, and {1, ..., N − 1}.
Notice that in the previous example, the obtained equivalence classes have a
different dynamical flavor: we have the sets {0} , {N } that we can only enter but
never leave and we have the set {1, ..., N − 1} that we can leave.
Definition 5.19. Let Xn be a Markov chain. A subset A ⊂ S (Xn ) of the state space
is said to be closed (with respect to the Markov chain Xn ) if
pij = 0, ∀i ∈ A, j < A.
In other words, it is impossible to leave A.
Some examples of closed sets are absorbing states and the state space itself.
Even though the state space S (Xn ) is always closed, it is not necessarily
irreducible. We introduce the following terminology.
Definition 5.21. Let Xn be a Markov chain. If the state space S (Xn ) is irreducible
then we call the Markov chain irreducible.
Example 5.9. The Markov chains in examples 5.1 and 5.2 are both irreducible.
The Markov chain in example 5.3 is not.
Definition 5.22. Let Xn be a discrete-time Markov chain. Then we say that the
chain is regular if there exists an n0 > 0 such that
(n )
pij 0 > 0∀i, j.
In other words, if we start in i and make n0 transitions, we can be at any possible
state in the state space.
To see why an irreducible chain need not be regular, we have the following
example.
Example 5.10. Consider the random walk discussed in example 5.2. This is
easily seen to be an irreducible Markov chain. However, notice that if i is even,
(2n)
pi,2k+1 = 0∀n, k > 0
and
(2n+1)
pi,2k = 0∀n, k > 0.
This implies that our chain can not be regular.
Just like all other dynamical properties, we can define the notion of chain
regularity using the transition matrix.
Property 5.23. A chain Xn with transition matrix P is regular if and only if there
exists an n0 > 0 such that all elements in Pn0 are strictly positive.
Example 5.11. Consider the Markov chain generated by the transition matrix
0.75 0.25 0
P = 0 0.5 0.5 .
0.6 0.4 0
It seems like the rows for a regular transition matrix converge. We will see that
this is a general result for regular chains in a later section.
Given a Markov chain Xn with state space i, j ∈ S (Xn ) such that i ↷ j, then we
only know that i can reach j. We could instead be interested in knowing how
long it takes to go from i to j in terms of the number of transitions in the path
from i to j.
Notice that if iHH i.e i does not communicate with j, then Tij = ∞. How-
↷j,
ever even if i ↷ j, it can still happen that Tij = ∞.
Example 5.12. Consider the Gambler’s ruin example discussed earlier. Notice
that if we start at state 1, we have that 1 communicates with 2 (1 ↷ 2). However,
if we have the transition 1 → 0, then (T12 | X1 = 0) = ∞ since 0 is an absorbing
state.
(n)
Definition 5.25. The passage probability fij is the probability that the chain first
visits state j on the nth step, starting from some state i
(n)
fij = P (X1 , j, X2 , j, ..., Xn−1 , j, Xn = j | X0 = i) .
The probability fjj that the chain starting in state j ever returns to this state is then,
by the law of total probability
∞
X (n)
fjj = fjj = P Tj < ∞ = 1 − P Tj = ∞ .
n=1
Again, by the law of probability, we can also define the probability fij that a chain
starting in state i ever reaches state j as
∞
X (n)
fij = fij = P Tij < ∞ .
n=1
fij > 0, i , j ⇐⇒ i ↷ j.
Using the above definition, we can define the average time of recurrence as
follows.
Definition 5.26. The mean recurrence time of a state j is
∞
X (n)
h i
µj = nfjj = E Tj .
n=1
Example 5.13. In example 5.1, the mean recurrence time of the state 0 is given
by
X (n) X 1 5 n−1
µ0 = nfjj = n· · .
6 6
n≥1 n≥1
5.5. STRUCTURE OF MARKOV CHAINS 127
In the definition of first passage and first return, we deliberately do not count
the initial state X0 . When we are interested in knowing the time it takes for a
state to occur in the chain, we use the hitting time.
Hij = min {n ≥ 0 : Xn = j | X0 = i} .
Hi,A = min {n ≥ 0 : Xn ∈ A | X0 = i} .
Using the notation above, we can give a different formulation for concepts
we have already seen.
4 Recall that for an arithmetico-geometric sequence, we have
∞
X r
kr k =
(1 − r)2
k=0
128 CHAPTER 5. MARKOV PROCESSES
⇐= : Suppose P Hi,AC = ∞ = 1. Then for any i ∈ A, j < A we have
pij = P Hij = 1 ≤ P Hi,j < ∞ ≤ P Hi,Ac < ∞ = 1 − P Hi,Ac = ∞ = 0.
i ↷ j ⇐⇒ hij > 0.
5.5. STRUCTURE OF MARKOV CHAINS 129
The hitting probability hij is the probability that the chain hits the state j
starting from a given state i in a finite amount of time.
Now that we have defined these concepts, the next question is how one can
work with them. We consider an example.
Example 5.14. We use the same setting as in example 5.3, and suppose that
p = q = 0.5. We want to answer the following questions:
• What is the probability of ruin, i.e the probability that we enter the state
0?
We will answer this question using a method called the first step analysis. No-
tice that the one-step transitions satisfy the following
• hN ,0 = 0.
We first try to answer the question regarding the probability of ruin. We start
from state i, and we use the law of total probability and the Markov property
to obtain
Here, we used the fact that the probability of ever hitting 0 from i given that
we transitioned to i − 1 is the same as the probability of ever hitting 0 starting
130 CHAPTER 5. MARKOV PROCESSES
hi−1,0 + hi+1,0
hi,0 = ∀i = 1, ..., N − 1.
2
Rewriting this set of equations gives us the equivalent set of equations
h0,0 hi+1,0 i
hi,0 = + ∀i = 1, ..., N − 1.
i +1 i +1
Notice that we have h0,0 = 1 and hN ,0 = 0. Hence, we find that
0
*
1 (N − 1)h
N ,0
hN −1,0 = +
N i + 1
Thus, we find that
1 N −2 1
hN −2,0 = +
N −1 N −1N
N + N − 2 2(N − 1) 2
= = = .
(N − 1)N (N − 1)N N
Using the same reasoning above, one can show (do this!) that
N −i
hi,0 = .
N
This solves the first problem. Notice how conditioning on the first step allowed
us to find a system of equations in terms of the hitting probabilities. This
first-step analysis can be used for the next problem as well.
We again condition on the first step. Notice that during this conditioning
we increase the hitting times by one. We get
HN −1,0∪N = N − 1.
Thus, we find
(N − 2)HN −1,0∪N
HN −2,0∪N = N − 2 +
N −1
(N − 2)(N − 1)
= N −2+ = 2(N − 2).
(N − 1)
Remark. As an exercise, try to show that in the Gambler’s ruin problem, the hitting
probabilities satisfy hi,N = Ni . We thus find that hi,N + hi,0 = 1, or in other words
that the Markov chain always ends up in an absorbing state. Is this surprising?
The first step analysis used in the previous example can be generalized to
the following result.
Theorem 5.31. The hitting probabilities hij are the minimal non-negative solution
of
X
hi,j = pik hkj , i , j, hjj = 1.
k∈S(Xn )
The expected hitting times Hi,j are the minimal non-negative solution of
X
Hij = 1 + pik Hkj , i , j, Hjj = 0.
k,j∈S(Xn )
132 CHAPTER 5. MARKOV PROCESSES
Example 5.15. Consider a rat in a maze with four cells. We label each cell
with an integer from 1 to 4. The maze has an exit which we denote by 0, the
’free’ state. The rat is placed randomly in a cell and then moves throughout the
maze until it finds its way out. We will assume that the rat is ’Markovian’, in
the sense that the transitions between cells chosen by the rat are not influenced
by past choices. We also assume that the rat has no preference when choosing
the next cell, in the sense that at each move the rat has an equal probability to
go to any of the neighboring cells. We denote by Xn the cell visited right after
the n−th move. We then have that S (Xn ) = {0, 1, 2, 3, 4}.
0 1 2 3 4
0 1 0 0 0 0
1 0 0 0.5 0.5 0
P = .
2 0 0.5 0 0 0.5
3 0 0.5 0 0 0.5
1 1 1
4 0 0
3 3 3
We will calculate the expected escape time for each of the starting positions
using theorem 5.31. We obtain the following set of equations
H2,0 H3,0
H1,0 = 1 + +
2 2
H1,0 H4,0
H2,0 = 1 + +
2 2
H1,0 H4,0
H3,0 = 1 + +
2 2
H2,0 H3,0
H4,0 = 1 + +
3 3
Solving these equations gives us the solution
Note that even though a rat starting in the fourth cell can free itself in only one
move, it will on average need around 9 moves in order to exit the maze!
Remark. It is not true that a recurrent state must be hit by all chains almost surely.
Can you find an example?
Example 5.16. It can be shown that all states in example 5.2 are null recurrent.
The proof of this lies outside the scope of this book.
Example 5.17. In example 5.3 the only recurrent states are the absorbing
states {0} , {N }. All other states are transient.
Just like we have done previously, we can use a first-step analysis to calculate
(n)
the values of fii .
Theorem 5.34. The passage probabilities satisfy the following recurrence relation
(n) pij
n=1
fij = P (n−1) .
k,j pik fkj
n>1
(n)
X
fij = P (Xn = j | Xn−1 = k) P (X1 , j, X2 , j, ..., Xn−1 = k | X0 = i)
k,j
(n)
Then the formulas for fij can be rewritten as
(n) (n−1)
fj = P(j) fj ,
where P(j) is the transition matrix where we set the j−th column equal to zero
in order to ensure i , j in the sums.
We get
(n) (1)
fj = P(j) · · · P(j) ·fj .
| {z }
n-1 times
• Unskilled labor
The main interest was to see how transitions between the different levels oc-
curred.
136 CHAPTER 5. MARKOV PROCESSES
(3)
The first question can be obtained by calculating f31 . Notice that we have
for n = 1 T T
(1)
f1 = p11 p21 p31 = 0.45 0.05 0.01
For n = 2, this becomes
0 0.48 0.07 0.45
(2) (1)
f1 = P(1) f1 = 0 0.7 0.25 · 0.05
0 0.5 0.49 0.01
0.0247
= 0.0375
0.0299
Similarly, we obtain
0.02009
(3)
f1 = 0.03372 .
0.0334
(3)
Thus, we obtain that f31 = 3.34% of the low-skilled workers make a transition
to an upper-level position in exactly three years.
This is numerically less straightforward. Using theorem 5.34, this can be ob-
tained by calculating
∞
X (1)
f1 = Pn(1) f1
n=0
One way to evaluate this is by truncating the sum and evaluating the partial
sum. The results of this can be found in figure 5.9
Another way to check solve this numerically is by using the fact that if the
largest eigenvalue is smaller than one (check that this is the case for P(1) ), we
have that
∞
X (1) (1)
f1 = P(1) f1 = (I − P(1) )−1 f1
j=0
1
= 1
1
Example 5.19. It is not difficult to show that the following are examples of
stopping times.
• Recurrence times,
n o
Tj = n ⇐⇒ {X0 = j, X1 , j, ..., Xn−1 , j, Xn = j}
• Hitting times
n o
Hij = n ⇐⇒ {X0 , j, X1 , j, ..., Xn−1 , j, Xn = j} .
Definition 5.36. Let Xn be a Markov chain and let A ⊂ S (Xn ) be any set of states.
Then the last exit time LA is the time
LA = sup {n ≥ 0 | Xn ∈ A} .
The following theorem, often called the Strong Markov Property (SMP),
shows the importance of stopping times.
Theorem 5.37. Let Xn be a Markov chain, and suppose T is a stopping time for Xn .
Then, conditionally on T < ∞ and XT = j, any other information about X0 , ..., XT
is irrelevant for predicting the future and the sequence {XT +n | n ≥ 0} is a Markov
chain that behaves like X started at j.
P (XT +1 = k, XT = j, T = n)
pjk = = P (XT +1 = k | XT = j, T = n) .
P (T = n, XT = j)
140 CHAPTER 5. MARKOV PROCESSES
Notice that the strong Markov property does not hold when T is not a
stopping time. Indeed, suppose that pii > 0 and consider Lxi , the last passage
time of xi . Then
P (XLxi +1 = i | XLxi = i, Lxi = n) = 0,
by definition of the last passage time. Thus,
Using the strong Markov property, we can show the following interesting
result.
Property 5.38.
The
n probability that a chain returns n times to a given state j is
given by P Tj < ∞ .
Another interesting result that uses the SMP is the following, which relates
the first-passage probabilities with the multi-step transition probabilities.
Notice that even though it seems like a straightforward and intuitive result,
proving it depends crucially on the Strong Markov Property.
For n ≥ 1, we condition on the first passage time Tij = k. Using the law of total
probability, we get
(n)
pij = P (Xn = j | X0 = i)
X∞
= P Xn = j, Tij = k | X0 = i .
k=1
Notice that for k > n it follows by definition of the first passage time that
P Xn = j, Tij = k | X0 = i = 0
Using the chain rule for conditionals (property 1.75) and the strong Markov
property, we obtain
n
(n)
X
pij = P Xn = j, Tij = k | X0 = i
k=1
Xn
= P Xn = j | Tij = k, X0 = i P Tij = k | X0 = i
}
k=1 | {z
Xk =j
n
X (k)
= P (Xn = j | Xk = j) fij
k=1
n
X (n−k) (k)
= pjj fij ,
k=1
We can express the previous result using generating functions (see definition
142 CHAPTER 5. MARKOV PROCESSES
1.70). For the first passage times Tij , this generating function is given by
h i ∞
X
Tij
γTij (s) = E s = P Tij = n sn
n=0
∞
X (n)
= fij sn .
n=0
We denote γTij (s) = Fij (s). Similarly, we write the generating function for the
multi-step transition probabilities as
∞
X (n)
Pij (s) = sn pij .
n=0
P∞ (n)
Proof. Suppose Fjj (1) = 1, then = 1. Hence
n=1 fjj
P Tjj < ∞ = 1
and thus j is recurrent.
If a state is recurrent, then we know by definition that our chain will revisit this
state almost surely in a finite amount of time. Intuitively, this should imply that
we expect to revisit a recurrent state infinitely often. In this section, we will
make this concrete. We will need the following definition.
Definition 5.43. Let Xn be a Markov chain, then we define the random variable
∞
X ∞
X
Nj = I(Xn = j) = In .
n=1 n=1
Remark. The random variable Nj essentially counts the number of times the chain
visits a given state j.
An immediate
h consequence
i of proposition 5.44 is that the amount of ex-
pected E Nj | X0 = i visits to a transient state j is finite. The next proposition
tells us how we can calculate this quantity.
Proposition 5.45. Suppose j is a transient state. Then the expected amount of visits
to j is given by
h P Tij < ∞
i
E Nj | X0 = i = .
1 − P Tj < ∞
P Nj = k | X0 = j = (1 − fjj )fjjk .
(k)
Proof. Denote by Tij the time of the k−th visit to the state j for the chain
starting in i. For the first visit k = 1, we have
(1)
P Tij < ∞ = fij ,
(k+1) (k) (m)
P Tij = n + m | Tij = n = P Tjj = m = fjj .
(k)
Conditioning on the stopping time Tij = n essentially resets the Markov chain
146 CHAPTER 5. MARKOV PROCESSES
∞
h i X fij
E Nj | X0 = i = fij fjjk−1 = ,
1 − fjj
k=1
P (N ≥ k | X0 = j) = fjjk .
Nj | X0 = j ∼ geometric(1 − fjj ).
5.5. STRUCTURE OF MARKOV CHAINS 147
5.5.6 Periodicity
We saw in example 5.10 that for some chains, we have some kind of periodic
behavior in the chain: a state could only be revisited using an even amount
of transitions. In this section, we will make this interesting restriction on the
dynamics more concrete.
Definition 5.46. Let Xn be a Markov chain, and let i ∈ S (Xn ) be any state. Then,
the period of the state d(i) is
(n)
d(i) = gcd n ≥ 1 | pii > 0 ,
Example 5.20. Consider the example of the rat in the maze, example 5.15.
What are the periods of the different states?
The periodicity of the states in a chain tells us a lot about the dynamics. We
will show that the period is constant on equivalence classes. For this, we need
the following algebraic results.
Proof. We first show the first statement. For this, since a, b are divisible by d we
can write a = αd and b = βd. Then a + b = (α + β)d, which is again divisible by
d.
For the second statement, since a+b is divisible by d, we can write a+b = γd.
Since a is divisible by d, we can write a = αd. Notice that b = a+b−b = (γ −α)d,
and hence b is also divisible by b.
148 CHAPTER 5. MARKOV PROCESSES
Proof. Since i ∼ j, we know that there must a finite path p1 from i to j and a
finite path p2 from j to i. Let the respective lengths of both paths be a and b
respectively. The concatenation of these paths p2 ◦ p1 leads to a path from i to
i with length a + b, and by definition we have that a + b is divisible by d(i).
Then, choose any path q from j to j, and denote its length by c. Then the
concatenation p2 ◦ q ◦ p1 is a path from i to i. Hence, a + b + c is divisible
by d(i), and from the previous proposition it holds that c is thus divisible by
d(i). Since the path q was chosen randomly, we find that d(i) is a common
(n)
divisor for all n with pjj > 0 and therefore must also be a divisor of d(j). Since
the argument can be repeated with the roles of i and j swapped it follows that
d(i) = d(j).
Property 5.49. Let Xn be an irreducible Markov chain. Then all states have the
same period.
In example 5.10, we noticed that the states had period 2, and this was used
to argue that the chain was not regular. The following result shows that the
aperiodicity and irreducibility of a chain are equivalent to its regularity. We
omit the proof.
Theorem 5.50. Let Xn be a Markov chain. Then this chain is regular if and only
if it is aperiodic and irreducible.
• i∼j
• State j is recurrent
• fij = fji = 1.
Proof. We start by proving the first statement. For this, denote by m the smallest
(n)
integer such that pij > 0, which exists because of the fact that i ↷ j. Then we
can choose a corresponding path
i → j1 → j2 → ... → jm−1 → j,
where
pij1 pj1 j2 ....pjm−1 jn > 0.
Notice that
P (Ti = ∞) ≥ pij1 pj1 j2 ...pjm−1 j P Tji = ∞ .
However, since P (Ti = ∞) = 0 and by the above, this can only hold if P Tji = ∞ =
0. Hence, j ↷ i.
For the second statement, we will show the converse result: if i is transient
and i ∼ j, then j is transient. For this, notice that since i ∼ j, we can find an
(m) (n)
n and an m such that pij > 0 and pji > 0. By the Chapman-Kolmogorov
equations, we have that for all k ≥ 1,
(m+k+n) (m) (k) (n)
pii ≥ pij pjj pji
The converse of the above proposition gives us another way to think about
transient states. It says that if we can go from i to j but not back, then it cannot
be recurrent.
Proposition 5.52. If i ↷ j and j
H
↷
H
Hi, then i is transient
We now relate the results above with the closedness property of a set.
Proposition 5.53. In any finite closed set, there is at least one recurrent state.
Proof. Since the Markov chain is finite, S (Xn ) is a finite and closed set. The
result then follows from proposition 5.53.
Using the previous results, we can show the following theorem. It relates the
property of irreducibility with recurrence in the case of finite sets.
Theorem 5.55. Suppose A is a finite closed and irreducible set. Then all states in
A are recurrent.
Proof. Since A is finite and closed we know by proposition 5.53 that there is
at least one recurrent state which we will denote by r. Since A is irreducible,
we have that i ↷ j for any two states i, j. In particular, any state i in S (Xn )
communicates with r and is thus recurrent by proposition 5.51 which shows the
desired result.
Theorem 5.56. Suppose we have a finite state space S (Xn ), then we can write
S (Xn ) = T ⊔ R1 ⊔ R2 ⊔ ... ⊔ Rk ,
where T is a set of transient states and where each Ri is a closed irreducible set of
recurrent states.
Proof. Define T as the set of all states in i ∈ S (Xn ) for which there exists a
j such that i ↷ j and jH ↷i.
H Then by proposition 5.52, it follows that i is
transient, and thus all states in T are transient.
R1 = {j ∈ S (Xn ) | i ↷ j} .
R2 = {j ∈ S (Xn ) | s ↷ j} .
Consider now the following relabeling of the states in the transition matrix:
• States in the same set Ri are given consecutive labels and are thus grouped.
• Given an ordering R1 , R2 , ...., we give the states in R1 the labels 1, 2, ...., |R1 |,
the states in R2 the labels |R1 | + 1, ..., |R1 | + |R2 |, ... .
where Pi is the matrix of size |Ri | × |Ri |. Indeed, we can make the following
observations:
• If s ∈ Ri and t < Rj , then pst = 0. Hence, the off-block diagonals are zero
for rows corresponding to elements in one of the Ri .
Hence transitions within the Ri are governed by Pi and transitions from T are
governed by Qj .
This relabeling of the states is only possible because of the canonical decom-
position theorem. As one might expect, working with this relabeled transition
matrix makes calculations a lot easier. In the next section, we will use this result
extensively in order to calculate absorption probabilities.
In the Gambler’s ruin problem (example 5.3), we found that the probability of
ending up in one of the absorbing states was equal to one. In this section, we
focus more on the dynamics underlying the absorption of chains. We start with
a definition.
Definition 5.57. A Markov chain is absorbing if it has the following two properties:
In other words, we can transition from any non-absorbing state to an absorbing state
in a finite number of steps.
Suppose we have a Markov chain that starts in the transient class T . In some
sense, we can think of this process as still being undecided. Indeed, as soon
as the chain enters one of the Ri , we know that it will remain there since these
are closed sets. It is interesting to consider the time it takes for the chain to
transition from T into one of the closed sets Ri . We hence define the following
random variable.
Definition 5.58. Given a Markov chain Xn , we can define the exit time of transient
states T for a chain starting in i ∈ T as
τi = min {n ≥ 0 : Xn < T | X0 = i ∈ T } .
Exiting the transient states can be due to the chain entering any of the Ri .
We will call this transition an absorption, but this should not be confused with
a transition to absorbing states6 . The associated probabilities are defined as
follows.
Definition 5.59. The (absorption) probability that the chain starting from i ∈ T
leaves T because of absorption at state j , T is denoted
uij = P Xτi = j | X0 = i.
as " #
P∗ 0
P= .
C Q
Furthermore, we denote (Q)i,j = qij , (P∗ )i,j = pij and (C)i,j = cij . Notice that
the absorption transitions are those governed by the probabilities in C.
Intuitively, this comes from the fact that the chain gets absorbed in one
step (with probability pij ), or that the chain first moves to another state k (with
probability qik ) and later gets absorbed in j (with probability ukj ).
Using the canonical decomposition theorem, we can rewrite this as a set of sums
X X
uij = P Xτi = j, X1 = k | X0 = i + P Xτi = j, X1 = k | X0 = i
k∈T k∈R1
X
+ P Xτi = j, X1 = k | X0 = i + ....
k∈R2
Hence, we obtain
X
uij = P Xτi = j, X1 = k | X0 = i + pij .
k∈T
U = (I − Q)−1 C.
(n)
We provide an interpretation of this result. For this, denote by uij the
probability that a chain starting at i has been absorbed in j in n or fewer
156 CHAPTER 5. MARKOV PROCESSES
(n)
transitions. Denote by U(n) the matrix (uij ). It is not difficult to see that
limn→∞ U(n) = U.
Notice that U(1) = C, since the probability that it goes from i straight to j
is exactly pij . For U(2) , this is given by
U(2) = C + QC = (I + Q)C.
Thus, we have all the two-step transitions within T and then the transitions
from T to j. This gives
U(3) = C + QC + Q(2) C,
where Q(2) is the two-step transition matrix. We have already seen that this
equals Q2 , and hence we get
U(3) = C + QC + Q2 C = (I + Q + Q2 )C.
In the rest of this section, we will assume without proof7 that the matrices Q we
encounter are sufficiently well-behaved. This leads to the solution
U = (I − Q)−1 C.
1 2 3 0 4
1 0 0.4 0 0.6 0
2 0.6 0 0.4 0 0
P =
3 0 0.6 0 0 0.4
0 0 0 0 1 0
4 0 0 0 0 1
and hence
0 0.4 0 0.6 0
Q = 0.6 0 0.4 , C = 0 0
0 0.6 0 0 0.4
and
0.876923 0.123077
U = (1 − Q)−1 C = 0.692308 0.307692
0.415385 0.584615
Hence, if you start with 1 unit of money the probability that you end up with
nothing is around 0.87 and even if you start with 3 units, the probability of
ruin is still around 40%!
7 See the remark at the end of this section for more information.
158 CHAPTER 5. MARKOV PROCESSES
By setting g ≡ 1, we get
τ =1
i
X
wi = E 1 | X0 = i = E [τi ] ,
n=0
the expected absorption time. If we set g(j) = τi,j then wi is the expected
number of visits to the (transient) state j when starting from i.
Proposition 5.61. The matrix W = (wij )ij of expected rewards is given by
W = (I − Q)−1 G,
where the ith entry of the column matrix (G)i is given by g(i).
Proof. We can prove this by conditioning on the first step and adopting a similar
strategy as in proposition 5.60 (try this!).
(n)
However, we will prove this using a different strategy. Denote by Wij the
total expected reward of the chains starting in i that have been absorbed in j in
n or fewer transitions. Notice that
(1)
Wij = G,
since for these chains, the only transient state visited is the initial state i. Hence,
(1)
wij = g(i). For W (2) , it is easy to see that we get
(2)
X
wij = g(i) + pik g(k),
k∈T
5.7. STATIONARITY 159
or in matrix notation
W (2) = G + QG.
Just like we did in the case of absorption probabilities, one can show that
W (n) = (I + Q + ... + Qn−1 )G.
It follows that ∞
X
lim W (n) = W = Qn G.
n→∞
n=0
Therefore, we indeed obtain
W = (I − Q)−1 G.
Remark. The observant reader might argue that it has not been shown that ∞ n
P
n=0 Q
exists. The reason for this is that proving this requires quite a bit of machinery. For
the sake of completeness, we give a brief overview of the argument but omit a lot of
the details.
5.7 Stationarity
We start this section by linking Markov chains to the theory of renewal pro-
cesses. For this, consider the following result.
Proposition 5.62. Suppose Xn is a Markov chain and suppose j ∈ S (Xn ) is a
recurrent state. Then the counting process
k
X
Njj (k) = I(Xi = j | X0 = j)
i=1
which counts the number of times the chain visits j starting from i is a renewal
process.
160 CHAPTER 5. MARKOV PROCESSES
In this case, the increments are given by the random variables Tjj .
Using the strong law of renewal processes (theorem 4.5), we find the follow-
ing result.
Theorem 5.63. Let Xn be a Markov chain and suppose that j ∈ S (Xn ) is a recur-
rent state. Then
Njj (t) 1
lim = with probability 1,
t→∞ t T jj
where T jj is the mean recurrence time for the state j.
Proof. This is an immediate result from theorem 4.5 and proposition 5.62.
The theorem applies to chains that start at j. However, is this result still
true if we start from any i ↷ j?
Theorem 5.64. Suppose i ↷ j and suppose furthermore that j is recurrent. Define
the process
k
X
Nij (k) = I(Xi = j | X0 = i).
i=1
Then
Nij (t) 1
lim = with probability 1.
t→∞ t T jj
In what follows, we will show that some Markov chains have an interesting
property closely related to the above results. It all starts with the question:
Given some state j, what is the probability that a chain starting at i is at j at
time N >> 0? In general, this of course heavily depends on i. However, we
have the following result.
Proposition 5.65. Suppose the transition probabilities of a Markov chain X con-
verge to a limit independent of X0 , i.e
(n)
lim pij = πj ∀i, j.
n→∞ |{z}
|{z}
Depends on i Doesn’t depend on i
5.7. STATIONARITY 161
•
P
j πj =1
• πP = π
• Suppose that X0 has as distribution π, i,e P (X0 = i) = πi , then every position
at the chain has the distribution π, not only the limit.
Proof. We start by proving the first statement. This statement is easily seen to
be equivalent to stating that P(n) = Pn is stochastic for all n which we have
shown in property 5.14.
Thus, it follows from the second statement that P (X1 = j) = πj . One can repeat
this argument iteratively to obtain the result for all n.
The invariant measures that are also probability measures are of the form
ν = α2 α2 1−α 2
1−α
2 .
From the above example, we see that stationary distributions need not be
unique. In fact, each closed and irreducible equivalence class Ri may "sup-
port" its own distribution. This measure is given by π|i which satisfies π|i (j) =
0∀j < Ri . In the example above, we have
ν = 21 12 0 0 , ν ′ = 0 0 21 12
We briefly mention the following uniqueness theorem, but we omit the proof.
Theorem 5.68. Suppose that the Markov chain Xn is regular. Then it admits a
unique stationary distribution.
In this section, we provide three ways to obtain the stationary distribution. The
first two ways are using numerical approximations and the third approach uses
results from linear algebra.
Using this, the first approach calculates a high power of P(n) and takes the row
vector, which is going to be an approximation of the stationary distribution.
This has as steady state π = [0, 0.6429, 0.3571]. To check how fast the matrix
converges, we consider the Euclidean distance
π1 π2 π3
dn = P(n) − π1 π2 π3
π1 π2 π3
Of course, the quality of this method heavily depends on the size of the state
space, the sparseness of the stationary distribution, and the choice of simulation
amount M. Using methods such as parallel computing can heavily speed up
this procedure.
The third approach uses results from linear algebra. For this, we will need
the following definition.
Ax = λx for some x , 0.
det (A − λI) = 0.
(A − λI) v = 0.
v1 = v42
4v 1 − v2 = 0
v2 = v43
4v2 − v3 = 0 =⇒
v1 17v − 4v = 0
−4v + 17v − 4v = 0
1 2 3 + 2 3
which has as solution set v = {α, 4α, 16α}. The unique eigenvector with norm 1
is given by
v = [0.06050.24210.9684].
π = πP
implies that
πT = PT πT .
0 = (PT − 1I)πT
0.3 0 0 π1
= −0.2 0.5 −0.9 π2 .
−0.1 −0.5 0.9 π3
5.7. STATIONARITY 167
From the first equation, we find that π1 = 0. The second and third equations
are equivalent, and we find that the solution set is given by
[0, 1.8α, α]
π=
0 + 1.8α + α
1.8 1
= [0, , ]
2.8 1.8
= [0, 0.6429, 0.3571],
To finish up this section, we show a special case where the stationary distribu-
tion is very nice.
Definition 5.70. Let Xn be a Markov chain with transition matrix P. We call this
matrix doubly stochastic if both the columns and the rows sum up to 1:
X X
pij = pij = 1.
j i
N
1 X
πi = = πj pij ∀i = 1, ..., N .
N
j=1
168 CHAPTER 5. MARKOV PROCESSES
N N
X 1X 1
πi pij = pij = ,
N N
i=1 i=1
When we want to model something that has some kind of reproductive dynam-
ics, branching processes are often a good choice. This kind of behavior occurs
in many different domains such as epidemiology, biological processes, and even
nuclear physics.
X
Xn−1
Xn = Zn−1,l
l=1
• The branching processes start with a single member in the zeroth gener-
ation: X0 = 1.
• The number of children/offspring Zi,j are IID with finite mean µ for all
i, j. We will denote this random variable by Z.
3. What is the distribution of the time T until the extinction of the popula-
tion?
We start with the first question. Using the tower law (property 1.73) and our
assumptions, we get
X X
n−1
X X n−1
E [Xn ] = E Z = E E Z | Xn−1
i=1 i=1
= E [ZXn−1 ] = E [Z] E [Xn−1 ] = µE [Xn−1 ] .
Since E [X0 ] = 1, one can easily see that
E [Xn ] = µn , ∀n ∈ N.
Indeed, we can simply consider each branch of children from the j parents
as their own independent branching process, see figure 5.12.
Intuitively, this corresponds to the fact that for the whole population to die out,
either no one lives in the first generation or all the branches from the members
of the first generations die out, see figure 5.13. There, we indeed see that if the
first generation has j offspring then the only way that the population can go
extinct is if the branches of each of the j members go extinct.
172 CHAPTER 5. MARKOV PROCESSES
Suppose that we are given the distribution of Z, and can therefore find the
explicit generating function. We can then find p by looking for the solutions of
the equation γZ (x) = x. However, this might have more than one solution. The
following result tells us that p is not only a solution of this equation but it is the
smallest solution between 0 and 1.
5.8. BRANCHING PROCESSES 173
q = γZ (q) ≥ γZ (p) = p.
For n = 1, we have
∞
X
q = γZ (q) = qj P (Z = j)
j=0
= P (Z = 0) + qP (Z = 1) + q2 P (Z = 2) + ...
≥ P (Z = 0) = P (X1 = 0) .
Assume now that q ≥ P (Xk = 0) for all k = 1, ..., n. Using the law of total
probability, we find
∞
X
P (Xn+1 = 0) = P (Xn+1 = 0 | X1 = j) P (Z = j) .
j=0
q ≥ lim P (Xn = 0) = p.
n→∞
174 CHAPTER 5. MARKOV PROCESSES
For the second property, we need to prove that p = 1 if and only if µ = E [Z] ≤ 1.
Notice that
∞
dγZ (s) X
| = jP (Z = j) = E [Z] .
ds s=1
j=0
∞
X
γZ′ (s) = jsj−1 P (Z = j) ≥ 0
j=1
∞
X
γZ ”(s) = j(j − 1)sj−2 P (Z = j) ≥ 0
j=2
We now consider the third question: the distribution of the time T until
extinction. This random variable is defined as follows:
{T = n} = {Xn = 0} \ {Xn−1 = 0} ,
Note that γn (s) is not the same as γZ (s), but they are related. Their exact
relation is the content of the next property.
Property 5.73. Using the setting above, we have the recursive definition
γn+1 (s) = γn (γZ (s)).
Since Xn+1 is the amount of children with parents in the n−th generation, i.e
PXn
Xn+1 = k=1 Zn,k , this can be written as
X∞ Yj
γn+1 (s) = sZn,k | Xn = j P (Xn = j) .
E
j=0 k=1
We assume that each of the families grows independently of the other families.
We therefore obtain
j
∞ Y
X
γn+1 (s) = E Zn,k P (Xn = j)
j=0 k=1
∞
X
= (γZ (s))j P (Xn = j)
j=0
h i
= E γZ (s)Xn = γn (γZ (s)).
From this, the result easily follows.
1. Notice that µ = 1
2 + 3 · 13 = 3
2 > 1, and hence the extinction probability is
not 1.
Buddha
On a weekday morning, you wake up and want to choose your outfit for
the day. Instead of checking the weather application on your phone, you look
outside and see what passersby are wearing. However, this is insufficient infor-
mation: everyone knows at least one person that likes to wear shorts in freezing
temperatures, or no matter how hot it is refuses to wear less than a jacket.
179
180 CHAPTER 6. HIDDEN MARKOV MODELS
lems. Then, we consider Hidden Markov models and how we can calibrate them
in practice.
• Lab results and whether the patient is sick or not. This is a binary
classification problem, and we want to predict whether a patient is sick
or not using the lab results.
• Customer information and whether the customer might churn or not
• Banking and transaction information and whether transactions are fraud-
ulent or not
• E-mail information and whether it is spam or not
Many different algorithms are able to predict the class from the features in
the data. We call these algorithms classifiers. Well-known examples are logistic
regression, K-Nearest Neighbours, random forests, and the Gaussian Mixture
Model. We will focus on the latter.
Remark. Algorithms such as K-Nearest Neighbours are called hard classifiers be-
cause they return a single class for each observation. In contrast, soft classifiers such
as the Gaussian Mixture Model return for each class a (posterior) probability that
represents the probability that the given observation belongs to that class.
From any soft classifier, one can construct a hard classifier by simply taking the
class with the highest assigned probability. We call this the Bayes classifier (of the
soft classifier). It can be interpreted as being the ’best guess’ of the soft classifier.
6.1. GAUSSIAN MIXTURE MODELS 181
We will assume that there are two classes c1 and c2 , and a single numerical
feature X which is correlated with the class. Assume that we have a data set
consisting of 10 observations in class c1 and 10 observations in class c2 . We will
first focus on the labeled case in which we know beforehand which class any of
the 20 observations belong to.
In a Gaussian Mixture Model, we will assume that within a given class the
feature X is normally distributed with class-dependent parameters:
X | ci ∼ N (µi , σi2 ).
Figure 6.1: The population curves estimated from the data, with observations
represented by points
P (x | c1 ) P (c1 )
P (c1 | x) = ,
P (x | c1 ) P (c1 ) + P (x | c2 ) P (c2 )
P (x | c2 ) P (c2 )
P (c2 | x) = .
P (x | c1 ) P (c1 ) + P (x | c2 ) P (c2 )
Here, P (c1 ) and P (c2 ) are called the priors and P (ci | x) is called the posterior
probability for class ci .
However, in many cases, we do not have labeled data. This could for ex-
ample occur in medical tests, where we are not sure yet how to detect certain
diseases. In this case, we start by using clustering techniques. Intuitively, people
with a certain disease will have similar characteristics1 and will cluster together.
In the Gaussian Mixture Model setting, we will still assume that within
the classes the numerical feature X is normally distributed. Given the class
parameters (µi , σj ), the clustering algorithm would be quite easy: simply assign
each observation to the class with the highest posterior probability. However,
in order to estimate the class parameters we would need to know the classes for
each of the observations. We hence end up in a chicken-and-egg problem:
(0) (0)
• Choose random parameters (µj , σj ) for each population j = 1, ..., K
We see that the converged estimate lies relatively close to the true population
curves.
184 CHAPTER 6. HIDDEN MARKOV MODELS
Recall the example given in the introduction, where one uses the observable
clothing of passersby to know what the weather is like, which in this case is
unobservable. In this section, we will construct a suitable model for this.
We will use a Markov chain in order to model the weather. For this example,
we will assume that a specific day is either a hot or a cold day. Furthermore,
we assume that a cold day is followed by another cold day 80% of the time
and similarly for hot days. We hence have a Markov chain with two states,
S (Xn ) = {hot,cold}, and a transition matrix A
!
0.8 0.2
A=P= .
0.2 0.8
We furthermore distinguish three types of clothing that the passersby can wear:
a coat, a sweater, and a t-shirt. For each weather state, we can define the
probability distribution b for the types of clothing that we can observe. Assume
for example that
and
We call the bi the emission probabilities. Graphically, the above model can
be represented as in figure 6.3. This is an example of a Hidden Markov Model.
6.2. HIDDEN MARKOV MODELS 185
Figure 6.3: The graphical representation of the Markov model for the weather
example
We now consider the more general case. Suppose we have a series of hidden
states2 , denoted Q = q1 , q2 , ..., qN . The transition matrix between these hidden
states is denoted by A, and the initial probability distribution is denoted by π.
2 These correspond to the weather states in the previous example.
186 CHAPTER 6. HIDDEN MARKOV MODELS
There are three possible scenarios for Hidden Markov Models, depending on
the available data and the application.
Suppose we are given the transition matrix A and the emission probabilities B.
From the previous section, we know that the probability of observing a sequence
of T observations O is given by
X
P (O | A, B) = P (O | Q) P (Q) .
Q
If there are N different possible hidden states in Q, then there are N T terms
in this sum. For example, if we have the sequence O = [Sweater, Sweater, Coat,
Sweater] in our weather example, we have a total of 24 = 16 possible combina-
tions. Whilst still manageable, for longer time windows or higher amounts of
hidden states this possible set of sequences quickly becomes too large to handle.
Luckily, we can use the Forward Algorithm which adopts a recursive strategy
to simplify the calculations immensely. We start by introducing the following
notation
3 These correspond to the clothing of the passerby in the previous example.
6.2. HIDDEN MARKOV MODELS 187
Conditioning on the (t − 1)-th step and using the law of total probability, we get
γt (j) = P (o1 , ..., ot , qt = j | A, B)
= P (o1 , ..., ot−1 , qt = j | A, B) · P (ot | qt = j, A, B)
= P (o1 , ..., ot−1 , qt = j | A, B) · bj (ot )
X
= bj (ot ) · P (o1 , ..., ot−1 , qt−1 = i, qt = j | A, B)
i
X
= bj (ot ) P (o1 , o2 , ..., ot−1 , qt−1 = i | A, B) · P (qt = j | qt−1 = i, o1 , o2 , ..., ot−1 , A, B)
i
X
= bj (ot ) · P (o1 , o2 , ..., ot−1 , qt−1 = i | A, B) · aij ,
i
Using this, we have the following approach for calculating the likelihood.
3. Finally, calculate
N
X N
X
P (O | A, B) = P (o1 , ..., oT , qT = i | A, B) = γT (i).
i=1 i=1
Notice that the values for γt are propagated from the beginning t = 1 to the
end t = T , i.e in a forward fashion. That is why this algorithm is called the
forward algorithm.
Example 6.2. Consider once again the weather example. We are interested in
calculating the likelihood of the sequence O = [Sweater, Sweater, Coat].
and
and
Therefore,
Figure 6.4: Waveforms of different words (IPA notation), obtained from [WV]
At a given time t, this comes down to finding the hidden state j which
6.2. HIDDEN MARKOV MODELS 191
Suppose that we find that i maximizes vt−1 (i) for some t. Then it need
not be the case that vt (j) = bj (ot )vt−1 (i)aij . In other words, the most probable
sequence found at time t − 1 might look completely different from the most
probable sequence found at time t. This is shown in the following example.
Suppose that the employer finds [Neutral, Happy] as the sequence of moods.
Show that the most probable sequence found at t = 1 does not correspond to
the first state in the most probable sequence found at time t = 2.
This example shows that we can only start building the sequence Q after taking
into account all observations up until T . In this sense, the state at time t
determines the state at time t−1, and we need a procedure that goes backwards.
This procedure is obtained using the backpointer.
Given the state j at time t, it returns the state at t −1 coming from the most probable
sequence of states.
p2 (Low) = Elevated.
We now present the Viterbi algorithm, which uses the backpointer in order to
find the most likely sequence of hidden states.
1. Start by defining
v1 (j) = πj bj (o1 )
p1 (j) = 0
2. Recursively define
qT −1 = pT (qT ).
In general, we find
qt−1 = pt (qt ).
We now show that the most probable sequence in the previous example is
indeed [Elevated, Normal].
Example 6.4. We apply the Viterbi algorithm for example 6.3. We assume
that the states all have the same initial probability. Then
v1 (x) = pNeutral|x
0.3 x = Low
x = Normal
= 1
0.4 x = Elevated
It should now be obvious why we sometimes call the Viterbi algorithm the
Backward-algorithm.
194 CHAPTER 6. HIDDEN MARKOV MODELS
In the previous sections, we always assumed that we knew both A,B, and π.
This is an assumption that is often not true. We will now show how to estimate
these parameters from a data set.
We make a distinction between two cases: the labeled case and the unlabeled
case. In the labeled case, the data is of the form (O, Q), and in the unlabeled
case, it is just O.
Suppose three different ice cream vendors measure their sales for three consec-
utive days. They make a distinction between busy days, normal days, and slow
days labeled 3, 2 and 1 respectively. They also look at the temperature outside,
labeled H if it is hot and C if it is cold. Their findings are reported in the
following data set:
O1 = (3, 3, 2) Q1 = (H, H, C)
O2 = (1, 1, 2) Q2 = (C, C, H) .
O = (1, 2, 3) Q = (C, H, H)
3 3
They want to model this using a Hidden Markov model. How can we use this
data to estimate π, A and B?
We can easily estimate the initial distribution using the first element of the
Qi , i.e. q1i .
n i o n o
| q1 = H | i = 1, 2, 3 | | q1i = C | i = 1, 2, 3 |
π = [πH , πC ] = ,
3 3
1 2
= ,
3 3
6.2. HIDDEN MARKOV MODELS 195
When the unobservable states are not part of the data set, the estimation pro-
cedure shown above no longer works. We will need to determine not only π, A,
and B, but also Q.
We have already seen in the section on likelihood estimation that the γ can
be estimated using the forward algorithm
γ1 (j) = πj × bj (o1 ) (j = 1, ..., N )
N
X
γt (j) = γt−1 (i)aij bj (ot ) (j = 1, ..., N , t = 2, ..., T .)
i=1
196 CHAPTER 6. HIDDEN MARKOV MODELS
σT (i) = 1 (i = 1, ..., N )
N
X
σt (i) = σt+1 (j)aij bj (ot+1 ) (i = 1, ..., N , t = 1, ..., T − 1)
j=1
P (qt = i, qt+1 = j, O | A, B)
ζt (i, j) = .
P (O | A, B)
We will first work out the denominator and then the numerator.
N
X
P (O | A, B) = P (o1 , ..., oT | qt = j, A, B) P (qt = j | A, B)
j=1
6.2. HIDDEN MARKOV MODELS 197
This equality is not very surprising: it essentially says that the likelihood of
observing O given A, B is the same as the likelihood of observing O given that
the t−th hidden state is j, and summing this over all j.
We introduce
P (qt = j, O | A, B)
δt (j) = P (qt = j | O, A, B) = .
P (O | A, B)
γ (j)σt (j)
δt (j) = PN t .
γ
j=1 t (j)σ t (j)
4. Calculate γt and σt
5. Estimate A and B
Gaussian Processes
William of Ockham
199
200 CHAPTER 7. GAUSSIAN PROCESSES
• Suppose Xt is a Poisson process with rate λ. Let 0 = t0 < t1 < ... < tN ,
then we have that
Hence,
fX0 ,X1 ,X2 ,...,XN (x0 , x1 , x2 , ..., xN ) = fY0 ,Y1 ,Y1 +Y2 ,...,Y1 +Y2 +...+YN (x0 , x1 , x2 , ...., xN )
The main takeaway from this observation is that for the processes covered so
far, we are able to obtain the joint density functions for any (finite) subset
of states. The stochastic process we will define in this chapter is an extreme
version of this: not only is the joint probability distribution easy to find but it
is completely determined by just two functions, the mean and the covariance!
7.1 Introduction
according to Z1 and Z2 .
µ̃ = µ1 + Σ12 Σ−1
22 (z − µ2 )
Notice that it does not hold that for any two normally distributed random
variables X and Y , (X, Y ) is jointly normal.
Proof. The fact that X + Y is normal follows from property 7.4. The rest is a
consequence of property 1.61.
7.2 Stationarity
(Xt1 , Xt2 , ..., Xtk ) ∼ (Xt1 +s , Xt2 +s , ..., Xtk +s ), (∀t1 , ..., tk ≥ 0, k ∈ N, s ≥ 0)
meaning that they have the same distribution. For general stochastic processes,
it is not straightforward to calculate the joint density function. This makes
checking whether the process is stationary very difficult. That is why one some-
times prefers to use weak stationarity. For this, we only need to be able to
calculate the first and second-order statistics of the process.
Definition 7.9. Let Xt be a stochastic process. We say that this process is weakly
stationary if
E [Xt ] = E [X0 ] (∀t ≥ 0)
and
K(t, t + s) = K(0, s). (∀t, s ≥ 0)
E [Xi ] = 1, (∀i ∈ N)
K(t, t + s) = K(0, s) = δs0 . (∀t, s ∈ N)
Hence, the (easy-to-check) weak stationarity does not necessarily imply the
(hard-to-check) regular stationarity. However, as the next theorem shows, this
handy equivalence does hold for Gaussian processes.
In the remainder of this chapter, we will show how Gaussian processes can
be used in order to perform regression. Furthermore, in the next chapter, we will
focus on a particular example of a Gaussian process called Brownian motion.
7.3. GAUSSIAN PROCESS REGRESSION 205
In a general regression setting, one is given a set of features (often called covari-
ates) and target values {(xi , yi ) | i = 1, ..., N } and the goal is to find a function
f : X → Y that best describes/fits the relationship. Here, xi can be both one-
dimensional or multi-dimensional xi = (xi1 , ..., xim ). The quality of the fit can
be measured using a loss function, such as the quadratic loss
N
X
L2 (f ) = (yi − f (xi ))2 .
i=1
The function that best describes the data (xi , yi ) is then the function f that
minimizes the loss function L(f ). The closer the predicted value f (xi ) lies to
the true value yi , the lower the loss corresponding to that observation xi .
Example 7.3. Given a data set (xi , yi ), we could be interested in finding the
linear function that best represents the relationship of interest. In terms of loss
functions, this is the linear function
m
X
f (xi ) = β0 + βj xij
j=1
that minimizes the loss L2 (f ). It can be shown that the least squares estimators1
for the coefficients β are given by
β̂ = (X T X)−1 X T Y ,
1 This is the estimator that minimizes the squared loss.
206 CHAPTER 7. GAUSSIAN PROCESSES
where X is the feature matrix X = (x)ij and Y is the target vector Y = (y)j .
Figure 7.2: Polynomial regression with degree 2 applied to a noisy data set
One could ask why one would perform linear regression if polynomial regression
often leads to better fits. The reason for this lies in the fact that in many cases
the data (xi , yi ) has noise on it, and the added flexibility can start to encode
the noise instead of the actual relationship.
In the previous examples, we see that the functions are parameterized using
a fixed set of coefficients. Finding the best fitting function in these reduced
spaces is therefore equivalent to a search within their corresponding coefficient
space. These methods are called parametric methods.
Instead of fixing the model structure ab ovo, we can also use non-parametric
methods. An example of such a method is the k-nearest neighbors regression
algorithm.
• Given a point x, we calculate the k nearest points in the data set, N Bk (x)
Figure 7.3: The K-Nearest Neighbour regressor for a noisy data set
Notice that the K-nearest neighbors algorithm also minimizes a certain loss
function (can you see how?). In the next section, we will consider a framework
in which we can fit a model without using a loss function, namely Bayesian
regression.
Instead of minimizing a loss function, one can also use the Bayesian framework
in the setting of model learning. The upshot of this is that we will end up with a
7.3. GAUSSIAN PROCESS REGRESSION 209
distribution over the space of all possible models instead of just ending up with
one model2 . In order to return a single model using this probabilistic approach,
one could then use the model that maximizes this probability distribution.
Before even looking at the data, we have prior beliefs about which models
are more or less likely. In general, incredibly complex models are less likely
to represent the true underlying distribution than simpler models. This follows
from Occam’s razor, the principle that the simplest explanation is most often
the correct one. This yields a prior distribution P (h) for h ∈ H.
Given a model h, we can also consider the likelihood P (D | h), which rep-
resents how likely it is that the data is generated by the model h. Models whose
errors are very large tend to have a lower likelihood than those with low errors.
We have already obtained the least squares estimator β̂LS for this specific
problem in the previous section. Let us now consider the MAP estimator under
the assumption that all models are equally likely, i.e P (h) is constant over H.
Notice that Y
P (D | h) = P (y | x, h) P (x) .
x,y∈D
Because of the Gauss-Markov assumptions, we know that (check this!)
(yi −xT β)2
1 i
P (yi | xi , h) = √ e 2σ 2
2πσ 2
and hence
Y 1 (y−xT β)2
P (D | h) = √ e 2σ 2 P (x) .
x,y∈D 2πσ 2
Since the logarithm is a monotone function, maximizing P (D | h) is equivalent
to maximizing the log-likelihood
X √
log (P (D | h)) = − log( 2πσ 2 ) + (y − xT β)2 − 2σ 2 + log P (x).
x,y∈D
Thus, under the assumption that the prior is constant over H, the least
squares estimator coincides with the MAP estimator, which itself coincides with
the maximum likelihood estimator.
Using the Bayesian learning framework, we will now use Gaussian processes
as tools for regression problems.
Recall that the dynamics of a Gaussian process are all contained within the
mean function µt and the covariance function K(s, t). Instead of going from
the process to these functions, we can also construct Gaussian processes by
choosing a function µt and a covariance function K(s, t). The mean function
does not really have any restrictions, but the covariance function needs to satisfy
some conditions. For example, we do not want K to be negative for any pair
s, t.
There are a lot of possible choices for K, but some of the more common
ones are
f : R≥0 → R : t 7→ f (t) ≡ xt .
212 CHAPTER 7. GAUSSIAN PROCESSES
F = {f : R≥0 → R} .
P (Xn ) Ω F
R Identity R
Example 7.7. We will sample a function from the distribution defined by the
Gaussian process with zero mean and squared negative exponential covariance
function. Instead of sampling a full function, we will sample a finite amount of
function values xt and connect these with lines. The upshot of using Gaussian
processes comes from the fact that this sample will have a known distribution.
We want to sample from (Xt1 , ..., Xt250 ), which by definition has a multivari-
ate normal distribution. Thus, we sample from a multivariate normal distribu-
tion with mean 0 and covariance matrix
2
|i−j|
− 10
K(ti , tj ) = e .
Example 7.8. We repeat the setting above but now use different kernels.
for our candidate functions f , (f (X), f (Z)) has a joint multivariate normal
distribution with zero mean and covariance matrix determined by K.
and
This is the posterior distribution on the space of functions f . Notice that both
parameters are completely determined by our data and choice of the subset Z.
Furthermore, we have the following properties
• The variance K̃(Z) is lower than the variance of the marginal Z. The
magnitude of the decrease depends on the distance between X and Z.
Figure 7.8: Training data for the estimation of the sinc function
Using the above approach, we can then find the mean function µ̃ and we
can also sample from the posterior distribution. We used the negative squared
exponential kernel for our Gaussian process. The obtained functions are given
in figure 7.9.
We see that the mean function lies very close to the true function. Away
from the training data points, we see that the sampled values vary quite a lot.
On the training data itself, there is no error. Show that this is true in general.
Chapter 8
Brownian Motion
Aristotle
217
218 CHAPTER 8. BROWNIAN MOTION
8.1 Definition
These are often denoted using W instead of the usual X. In figure 8.1, one
can see some sampled paths of standard Brownian motion.
• It starts at 0: W (0) = 0
• It has independent increments Wt2 − Wt1 and Wt4 − Wt3 are independent
random variables.
Proof. We show that Wt+s − W + s ∼ N (0, t). First, notice that for all t,
Notice that
Var (Wt+s ) + Var (Ws ) + 2Cov(Wt+s , −Ws ) = Var (Wt+s ) + Var (Ws ) − 2Cov(Wt+s , Ws )
= t + s + s − 2 min(t + s, s)
= t + 2s − 2s = t.
We also have that standard Brownian motion satisfies the continuous Markov
property. In the case of continuous stochastic processes, the Markov property
can be defined as follows.
220 CHAPTER 8. BROWNIAN MOTION
Definition 8.6. Let Wt be a standard Brownian motion and let µ be any real
number. Then we define a Brownian motion with drift µ and infinitesimal variance
σ 2 via
Xt = σ Wt + µt.
This stochastic process satisfies
For an example of a Brownian motion with positive drift, see figure 8.2
Remark. Alternatively, one can define a Brownian motion with drift µ using Gaus-
sian processes. Show that the Gaussian process with mean function µt = µt and
covariance function K(s, t) = σ 2 min(s, t) corresponds to the process defined above.
Wt ∼ N (0, t),
or in other words
1 −w2
ft (w) = √ e 2t .
2πt
Using property 7.3, we can also fully determine the conditional probabilities.
Property 8.9. Let Wt be a Brownian motion. Then for any s < t, we have
s s
Ws | Wt = x ∼ N ( x, (t − s)).
t t
Using the fact that Brownian motions are Gaussian processes, it is quite easy
to generate sample paths. Another interesting way in which one can generate
sample paths is by using the continuous-time limit of a symmetric random
222 CHAPTER 8. BROWNIAN MOTION
walk. Recall that a random walk could be obtained by considering the sum of
a sequence of IID random variables
n
X
S0 = 0, Sn = Xi . (n ≥ 1)
i=1
We assume that this walk is symmetric (i.e E [X] = 0) and that the underlying
process has unit variance Var (X) = 1.
Therefore,
S⌊nt⌋
lim Wn (t) = lim √
n→∞ n→∞ n
r
S⌊nt⌋ ⌊nt⌋ √
= lim √ →d tN (0, 1) = N (0, t).
n→∞ ⌊nt⌋ n
Using a similar strategy, one can show that for s < t, we have
Hence, this scaled random walk converges in the limit to a Brownian motion.
A graphical representation of this convergence can be found in figure 8.3.
In the next result, we list some of the interesting properties that Brownian
motion satisfies.
P (Wt+∆t ∈ A | Wt = x) = P (Wt+∆t − Wt ∈ A − x | Wt = x)
224 CHAPTER 8. BROWNIAN MOTION
{WT +∆t − WT | ∆T ≥ 0}
−W ∼ W .
Remark. Another interesting property of the paths of Brownian motion is that even
though it is continuous, it is nowhere differentiable! To gain intuition on why
Brownian motion is nowhere differentiable, consider the motion of a dust particle
in a liquid. For a small enough time frame, we know that the dust particle will
remain close to its original position, meaning that this is indeed continuous. For it
to be smooth, we need a stronger condition: there exists a small time frame where
the movement of the dust particle is approximately straight. However, the erratic
movement of a dust particle does not satisfy this condition as it can abruptly change
directions. Hence, these paths are not smooth.
8.2.1 Motivation
The core idea of ordinary calculus is that if we know how f changes and if we
know the initial value, we can determine the values of f . Indeed, if we know
f ′ (t) and f (0), then we can find f (t) for all t via
Z t
f (t) = f ′ (t)dt + f (0),
0
since Z t
f ′ (t)dt = f (t) − f (0).
0
Of course, this assumes some smoothness and integrability conditions on f and
f ′.
Suppose now that the change through time of the quantity is not determinis-
tic anymore, but stochastic. This could be due to (measurement) noise or could
be intrinsic to the quantity we are trying to model. The question we want to
answer is: can we still recover the quantity if the changes become stochastic? It
is obvious that the recovered quantity itself will also be random. In fact, since
it changes through time we could model it as a stochastic process. We could
therefore write
Xt+∆t − Xt = Ct,t+∆t ,
226 CHAPTER 8. BROWNIAN MOTION
where Ct,t+∆t is the random variable that indicates the change from t to t + ∆t
in the quantity Xt . Let us try to rewrite this equality in more familiar terms.
For now, assume that the randomness is due to noise. We can then split
the random variable Ct,t+∆t into a deterministic component (the ’signal’) and a
random component (the ’noise’). We will write the rate of change due to the
signal as µ(t, Xt ) and the change due to the noise as Nt,t+∆t . For small ∆t, we
can then write1
Xt+∆t − Xt = µ(t, Xt )∆t + Nt,t+∆t ,
Let us now focus on the noise component Nt,t+∆t . In most cases, noise tends to
come from a large number of sources. For example, in communication systems,
the signal can be polluted due to noise coming from the sun, cosmic radiation,
and man-made sources. In general, we assume that the noises coming from
these sources in a given time interval [t, t + ∆t] are IID. We also assume that
the mean of the noise is zero, meaning that on average we do not expect any
noise. We can define the noise component Nt,t+∆t as the sum of these IID and
zero-mean random variables
By the central limit theorem, it follows that we can model this noise term by
As ∆t grows, the number of noise terms in the interval [t, t + ∆t] increases. As
a result, the variance σt,t+∆t
2
of Nt+∆t grows with ∆t. We will assume that this
can be modeled as σt,t+∆t = σ 2 (t, Xt )∆t
2
Therefore, we can model this noise term in terms of a standard Brownian mo-
tion using definition 8.6:
where
dWt = lim (Wt+∆t − Wt )
∆t→0
and similarly
dXt = lim (Xt+∆t − Xt )
∆t→0
We now ask ourselves the question: Is it possible to solve this differential equa-
tion? In other words, can we find a stochastic process Xt whose evolution
satisfies the dynamics set by the right-hand side of the equation just like we
were able to do in the case of classical calculus? Since there is now a stochastic
component, this is no longer an ordinary differential equation. Instead, we call
such a differential equation a stochastic differential equation.
Luckily, the answer is yes! 2 Even better, the solution looks a lot like what
we have in the classical case:
Z t+s Z t+s
Xt+s = Xs + µ(Xu , u)du + σ (Xu , u)dWu .
s s
R
In order to understand this, we need to first what we mean by σ (Xu , u)dWu .
Remark. Recall that for real functions f and random variables X we have that
f (X) is also a random variable. 3 . Similarly, applying a real function f to a
stochastic process Xt yields a stochastic process. Thus, we will start by looking at an
integral of the form
Zt
Yu dWu
0
where Yu is a stochastic process.
2 Ifthe answer to this question was no, this section would have been a waste of time for
everyone involved...
3 This actually only holds for so-called measurable functions
228 CHAPTER 8. BROWNIAN MOTION
In this section, we will define the notion of stochastic integrals. The way we will
do this heavily resembles the way one would construct the Lebesgue integral:
• Show that these easy processes can approximate a large set of stochastic
processes.
In the case of the Lebesgue integral, one uses step functions. We will use the
continuous stochastic process analog, namely simple processes.
Definition 8.12. Let Yt be a stochastic process. We say that Yt is a simple process
if we can find a finite set 0 = t0 < t1 < t2 < ... < tN < ∞ = tN +1 and random
variables Z0 , Z1 , ..., ZN such that
Still following the idea behind the construction of the Lebesgue integral, we
will define the stochastic integral or Itô integral as follows.
Definition 8.13. Let Yt be a simple process. Denote by Z0 , Z1 , ..., ZN the associated
random variables and the corresponding times by 0 = t1 < t2 < ...tN . Let Wt be a
Brownian motion. Then we define the Itô integral Xt of the simple process Yt
Zt
Xt = Yu dWu
0
as
k−1
X
Xt = Zi (Wti+1 − Wti ) + Zk (Wt − Wtk ). (where tk ≤ t < tk+1 )
i=0
We show that this definition for the Itô integral on simple processes satisfies
some very desirable properties.
8.2. STOCHASTIC CALCULUS 229
Property 8.14. Let Wt be a Brownian motion and suppose Yt and Vt are simple
processes. Then
Proof. The proof of the first statement is straightforward. For the second state-
ment, notice that if Yt is a simple process, then so are the processes
Hence
k−1
2
h i X
Var (Xt ) = E Xt2 = E Zi (Wti+1 − Wti ) + Zk (Wt − Wtk )
i=0
230 CHAPTER 8. BROWNIAN MOTION
Using
h the facti that Brownian motion has independent increments and that
E (Wt − Ws )2 = (t − s) for s < t, one can easily show that this gives
k−1 h
X i h i
Var (Xt ) = E Zi2 (ti+1 − ti ) + E Zk2 (t − tk ).
i=0
Therefore, we have
k−1 Z
X ti+1 h i Z t h i
Var (Xt ) = E Zi2 ds + E Zk2 ds.
i=0 ti tk
We now show that these processes can approximate a larger set of stochas-
tic processes. This larger set consists of all uniformly bounded, continuous
stochastic processes.
• t 7→ Yt is continuous
(n)
Then we can find a sequence of simple processes Yt such that
Z t
(n)
lim E |Yu − Yu |2 =0
n→∞ 0
for all t.
8.2. STOCHASTIC CALCULUS 231
Proof. We only give the construction of the sequence. Notice that it suffices to
show this for all t separately. We show this for t = 1. The sequence is given by
n
Z i
i −1 i n
(n)
X
Ys = I <s< · n Yu du .
n n i−1
i=0 n
One can show that this sequence has the desired properties.
Using this result, one can define the integral for the more general case. We
will omit the technical details.
Definition 8.16. Let Yt be a stochastic process satisfying the conditions in proposi-
(n)
tion 8.15. Let Yt be the associated sequence from the same proposition. Then we
can define the Itô integral of Yt
Zs
Yu dWu
0
One can show that all the properties in property 8.14 are also satisfied by
the Itô integral of the more general class of processes.
Remark. The integral can also be defined for unbounded stochastic processes Yt .
By definition, we have
n −1
2X
Z t
f (u)dWu = lim f (zk ) Wzn,k+1 − Wzn,k .
s n→∞
k=0
t−s
Wzn,k+1 − Wzn,k ∼ N 0, n .
2
Using the facts that Brownian motion has independent increments and f is
deterministic, the above result becomes, in the limit,
2n −1
Z t X (t − s)
f (u)dWu ∼ lim N 0, f (zn,k )2 n .
s n→∞ 2
k=0
We recognize the Riemann sum in the variance, and by the definition of the
classical integral
n −1
2X Z t
(t − s)
lim f (zn,k )2 n = f (u)2 du.
n→∞ 2 s
k=0
One can show that the limit of this sequence of normally distributed random
variables indeed converges in distribution:
Z t Z t !
2
f (u)dWu ∼ N 0, f (u) du ,
s s
We are now in the same position as when we are in our first calculus class
where we just learned about integrals using Riemann sums. Even though we
now understand a bit more about what is going on, we cannot really calculate
any integrals. In order to do this, we will need some more preparation.
8.2. STOCHASTIC CALCULUS 233
Just like we looked at the construction of the Lebesgue integral for inspira-
tion for the previous section, we will now look at the fundamental theorem of
(ordinary) calculus for inspiration on how to actually solve integrals.
If the stochastic process is a Brownian motion we write VΠ and QΠ for the total
variation and the quadratic variation respectively.
Remark. In order to relax the notation, we do not refer to the interval bounds T1
and T2 . As the results that follow hold for any choice of T1 and T2 , this should not
cause any confusion.
Proposition 8.19. Let Wt be a standard Brownian motion, then its total variation
satisfies
E [VΠ ] = ∞.
Remark. The above property implies that the total variation of a Brownian motion
is unbounded over any interval with non-zero Lebesgue volume!
E [QΠ ] = T1 − T1 .
In particular, h i
E dWt2 = dt
√
and dW is of order dt.
However, if we take into account the result from proposition 8.20, we would
get
d(Wt2 ) = 2Wt dWt + dWt2 .
By integrating both sides we get
Z t Z t
Wt2 =2 Ws dWs + dWt2 ,
0 0
which yields Z t
1 2
Ws dWs = Wt − t .
0 2
This correction term is precisely what we needed in order to make the expec-
tations of both sides of the equation equal.
Remark. One could wonder what happens with dW h t andi higher orders. It turns
3
3
out that they are zero. For example, we have that E |dWt |3 is proportional to dt 2 ,
which is negligible compared to dt.
∂f σt2 ∂2 f
!
∂f ∂f
df = + µt + 2
dt + σt dWt .
∂t ∂x 2 ∂x ∂x
Thus, we get
Z t
∂f
f (Xt , t) − f (Xs , s) = σ (Xu , u) (X , u)dWu
∂x u
Z st
σ 2 (Xu , u) ∂2 f
!
∂f ∂f
+ (X , u) + µ(Xu , u) (Xu , u) + (Xu , u)
s ∂t u ∂x 2 ∂x2
The next example shows how one can use Itô’s lemma in order to solve
SDEs.
σ2
d log(Xt ) = (µt − )dt + σ dWt .
2
Integrating both sides yields
Z t
d log(Xs )ds = log(Xt )
0
Z t Zt
1 2
= µ − σ ) ds + σ dWs
0 2 0
1 2
= µ − σ t + σ Wt .
2
Since Xt = exp(log(Xt )), we find that
1 2
Xt = e(µ− 2 σ )t+σ Wt .
238 CHAPTER 8. BROWNIAN MOTION
Integrating this yields the same result as we have seen before, namely
Z t
Wt2 = t+2 Ws dWs .
0
In this section, we consider the notion of hitting times for Brownian motion.
Recall that we have already seen these in the chapter on Markov processes. We
first define the continuous counterpart.
Definition 8.24. Let Wt be a Brownian motion. The first-passage time (or hitting
time) of Wt for a > 0 is the first time Wt hits the value a:
Ta = inf {t ≥ 0 | Wt = a} .
8.3. REFLECTION PRINCIPLE AND HITTING TIMES 239
For any hitting time Ta , we can define the reflected path Rt , which is ob-
tained by reflecting the portion of Wt after the hitting time Ta about the line
y = a.
Definition 8.25. Let Wt be a Brownian motion and Ta be the hitting time for a
real value a. Then we define the reflected path
Wt t < Ta
Rt = .
2a − Wt
t ≥ Ta
This reflected path mimics the original path Wt up until the random time
Ta , see figure 8.5.
240 CHAPTER 8. BROWNIAN MOTION
For times that occur after the hitting time, we have the following property.
Property 8.26. If Ta < t, then
a + Wt − WTa ∼ N (a, t − Ta ).
Proof. Since Ta is evidently a stopping time, the Strong Markov Property for
Brownian motions (property 8.10) gives us that Vt = WTa +t − WTa is a Brownian
motion independent of Ws for s ≤ Ta . Hence, property 8.7 tells us that
Vt ∼ N (0, t) ∀t ≥ 0.
We now try to find the distribution of the hitting time, i.e P (Ta ≥ t).
Property 8.28. The distribution for the hitting time Ta is given by
P (Ta ≤ t) = 2P (Wt ≥ a) .
Thus, !!
a
P (Ta ≤ t) = 2 1 − Φ √ ,
t
where Φ is the cumulative distribution function for a standard normally distributed
random variable.
:0
P (Wt ≥ a) = P (Wt ≥ a | Ta ≤ t) P (Ta ≤ t) + P
(Wt
≥a | Ta > t)P (Ta > t) .
Using the result from property 8.27, we thus find
P (Ta ≤ t) = 2P (Wt ≥ a) .
The result now easily follows from the fact that Wt ∼ N (0, t).
242 CHAPTER 8. BROWNIAN MOTION
As an immediate consequence, one can show (do this!) that the density
function will be given by
|a| a2
fTa (t) = √ e− 2t . (t ≥ 0)
2πt 3
Examples of this density are given in figure 8.6 for several values of a.
Since the probability of hitting a state a is one, the next natural question is
to ask what the expected hitting time is. In the following result, we show that
even though the probability to hit the state is 1, the expected hitting time for
any non-zero state is infinite.
Proposition 8.30. The expected hitting time for any nonzero state is infinite, i.e
∀a ∈ R \ {0},
E [Ta ] = ∞.
Proof. Denote by B the set B = {−a, +a} and denote tB = E [TB ] , ta = E [Ta ].
Since a , 0, it holds that tB , ta . Furthermore, since a ∈ B, it is obvious that
tB < ta . Suppose we reach −a first, then the expected hitting time starting from
−a to a can be written as the expected hitting time from −a to 0 (denoted r0 )
and the expected hitting time from 0 to a, which was ta . Thus4 ,
1
(r + t ) .
ta = tB +
2 0 a
Since we have stationary transition probabilities (property 8.10), we find that
ta = r0 . Hence
ta = tB + ta .
Since tB > 0, the above equality forces ta = ∞.
We have thus found the distribution and expected value of the hitting times
for a Brownian motion. In the following section, we will consider some other
interesting distributions that are related to the dynamics of a Brownian motion.
4 The equality ta = tB + 21 (r0 + ta ) follows from ta = 12 tB + 12 (tB + ta + r0 ), which is obtained
by splitting ta in the two (equally likely) scenarios where Wt hits a first or Wt hits −a first.
244 CHAPTER 8. BROWNIAN MOTION
Mt = max {Wu | 0 ≤ u ≤ t} ,
Ta ≤ t ⇐⇒ Mt ≥ a.
The above result allows us to use the results from the section on hitting
times in this case.
Proposition 8.33. The distribution function for the maximum process is given by
r
2 −z2
fMt (z) = e 2t . (z ≥ 0)
πt
Let us now consider the expected maximum level of Brownian motions. This
can easily be obtained since we have the density for the maximum process.
Property 8.34. For a Brownian motion, the expected value of the maximum is
r
2t
E [Mt ] = .
π
Proof. This follows from a simple calculation and is left as an exercise for the
reader.
Property 8.36. For a Brownian motion, the expected value of the minimum is
r
2t
E [mt ] = E [−Mt ] = − .
π
However, we thus find that E [Mt ] → ∞ and E [mt ] → −∞. In order for this
to hold, it must cross the zero line infinitely many often, and thus by property
8.10 any level is crossed infinitely many times. Thus, any level a ∈ R is recurrent.
In the previous section, we have observed that the zero state is expected to be
hit infinitely often. In this section, we are interested in studying the behavior of
these zeroes.
Proposition 8.37. Let Wt be a Brownian motion let s < u. Then the probability
that Wt has no zero in the interval (s, u) is given by
r !
2 s
P Z C = arcsin .
π u
Here, Z denotes the event Z = {Wt has at least one zero in (s, u)}.
P (Z | Ws = a) = P (T−a ≤ u − s) ,
248 CHAPTER 8. BROWNIAN MOTION
Notice that the right integral is symmetric around 0, and hence we can further
rewrite this as
Z u−s Z∞ 2 2
1 −a −a
P (Z) = 2 p ae 2y 2s dady.
0 2π sy 3 0
We obtain Z u−s
1 sy
P (Z) = dy
π sy 3 s + y
p
0
r !
2 u−s
P (Z) = arctan .
π s
r !
2 s
P (Z) = arccos .
π u
Thus,
r !! r !
2 π s 2 s
P (Z) = − arcsin = 1 − arcsin .
π 2 u π u
Hence,
r !
2 s
P Z C = arcsin .
π u
250 CHAPTER 8. BROWNIAN MOTION
Using this result, we can find the distribution of the time since the last zero.
Proposition 8.38. Let Lt denote the time that the last zero of a standard Brownian
motion Wu in [0, t], i.e
Lt = sup {s < t | Ws = 0} .
Then
r !
2 s
P (Lt < s) = arcsin .
π t
This means that Lt is arcsine-distributed with support (0, t). This has as
density
1
fLt (s) = p . (0 < s < t)
π s(t − s)
8.4. RELATED DISTRIBUTIONS 251
We have already studied the maximum process Mt from a Brownian motion, but
in this section, we study the distribution of the times at which those maximums
are attained for the first time. In other words, we consider the first-passage
times of the maximum process associated with a Brownian motion.
U (t) = inf {s ≤ t | Ws = Mt } .
1
fU (t) (u) = p ,
π u(t − u)
Proof. Start by choosing values 0 ≤ a < y. The set of paths {Mt ≤ y, Ta ≤ u | u < t}
contains all paths of Wt which, at time t, have already hit a but have not ex-
ceeded y. Furthermore, since a ≥ 0 we have that at the first passage time Ta = s,
it holds that Ws = Ms = a. Thus, from property 8.10 we find that
P (Mt ≤ y | Ta = s) = P (Mt−s ≤ y − a) .
We now try to relate this to the density of the random variable U (t) . Notice that
Mt = y implies that U = Ty (check this!). In the joint density equation above,
we can hence set a = y and Ta = Ty = U , yielding
y y2
fMt ,U (y, u) = p e− 2u . (0 < u < t, y > 0)
πu u(t − u)
Notice that, for dt small enough, we find that dNt is roughly Bernouilli:
dNt ≈d Bernouilli(λdt).
When no jumps are present, recall that an Itô process can be written as
Processes of this form are also called diffusion processes, because of their ties
to the motion of random particles.
We have seen that a function of an Itô process f (Xt ) is, under some mild
conditions, again an Itô process with associated stochastic differential equation
∂f 1 2 ∂2 f
!
∂f ∂f
df (Xt ) = + µt + σt dt + σt dWt .
∂t ∂x 2 ∂x2 ∂x
We will generalize this to stochastic processes with jump dynamics as well. This
generalization will be defined as jump-diffusion processes.
8.5. JUMP-DIFFUSION PROCESSES 255
In order to make these processes a bit more tractable, we make the following
dependency assumptions.
The only jump-diffusion processes which are Itô processes are those with no
jump dynamics5 . As a consequence, we can’t apply Itô’s lemma to jump-
diffusion processes.
However, one can show that for a stochastic function f and a jump-diffusion
process Xt , the associated stochastic process f (Xt ) is still a jump-diffusion pro-
cess. In fact, we have the following generalization of Itô’s lemma.
Xt = µt dt + σt dWt + Jt dNt ,
and suppose f (Xt , t) is a stochastic function that satisfies some mild conditions.
Then
∂f 1 2 ∂2 f
!
∂f ∂f
df (Xt ) = + µt + σt dt + σ t dWt + (f (Xt + Jt , t) − f (Xt , t))dNt .
∂t ∂x 2 ∂x2 ∂x
of the equity, σ representing the equity volatility and µ the expected return, the
stochastic differential equation of St is given by
dSt
= µdt + σ dWt + (Jt − 1)dNt .
St−
The jumps caused by dNt can be due to unannounced news (such as the start of
a war), or due to scheduled news such as the introduction of monetary policies
affecting the equity of interest.
The jump size Jt represents the recovery value: if it is 0, it means that the
equity became worthless after the jump. If it is 0.9, it means that it dropped
10% in value. Some models allow Jt to be negative as well, allowing for positive
news to be included in the jump-diffusion model.
Figure 8.12: Some sample paths from the Merton Jump-Diffusion model with
log-normal jumps
8.6. MARTINGALES 257
8.6 Martingales
In other words, given all the information regarding a stochastic process up to some
point tn , the best guess we have for any point in the future is the latest observation
Xtn .
We show that some of the processes we have seen before are martingales.
Example 8.4. Consider the random walk Sn generated by the random vari-
ables Xi with distribution.
1
2 k = 1
P (Xi = k) = .
12 k = −1
E [Wt | Wu , u ≤ s] = E [Wt − Ws + Ws | Wu , u ≤ s] .
258 CHAPTER 8. BROWNIAN MOTION
E [Wt | Wu , u ≤ s] = E [Wt − Ws + Ws | Wu , u ≤ s]
:0
= E [Wt
− s | Wu , u ≤ s] + E [Ws | Wu , u ≤ s] = Ws .
W
This is only the case for Brownian motions without drift. Indeed, for a Brownian
motion with drift
Xt = µt + σ Wt
we have
E [Xt+s | Xs ] = µt + Xs , Xs .
We claim that this is a martingale. For this, notice that of any s < t, we can
decompose (check this!)
a2 (t−s)
Me (t) = Me (s)ea(Wt −Ws )− 2 .
Hence,
" #
a2 (t−s) h i
a(Wt −Ws )− 2
E [Me (t) | Me (s)] = Me (s)E e = Me (s)E eX ,
a2 (t−s)
where X ∼ N − 2 , a2 (t − s) .
Notice that the second term on the right-hand side is exactly the moment-
generating function of X evaluated at 1, ΦX (1), and using example 1.20 we see
that this equals
h i a2 (t−s) a2 (t−s)
E eX = ΦX (1) = e− 2 + 2 = 1.
Hence,
E [Me (t) | Me (s)] = Me (s).
8.7. RELATED PROCESSES 261
d
Proof. It suffices to show that W1 = Xt . By property 8.10, we know that
√ d
cWt = Wct .
Proof. We have
K(Wt , Ws ) = min(t, s)
it follows that
Proof. We need to show that for any 0 ≤ t1 , ..., tN , it holds that (Xt1 , ..., xtN ) is
joint normally distributed. By property 7.4, this is equivalent to showing that
any non-trivial linear combination is normally distributed, i.e
n
X
ai Xi ∼ N (µ, σ 2 )
i=1
We write
n
X n
X
ai Xi = ai e−t We2ti .
i=1 i=1
Define the real numbers αi = ai e−t for all i = 1, ..., N and define the positive
real numbers si = e2ti . Using these, we can rewrite
n
X N
X
ai Xi = αi Wsi ,
i=1 i=1
which is normal since Brownian motions are Gaussian processes. Since the
choice of a1 , ..., aN was random, the same holds for any non-trivial linear com-
bination of ai , proving the result.
Proof. This follows from property 8.48, proposition 8.49, and theorem 7.10.
The next result gives the stochastic differential equation of the Ornstein-
Uhlenbeck process.
Proposition 8.51. The stochastic differential equation
√
dXt = −Xt dt + 2dWt
has as solution
Xt = e−t We2t .
264 CHAPTER 8. BROWNIAN MOTION
√
Proof. Denote by Xt the solution of the SDE −Xt dt + 2dW √ t . This is an Itô
process with mean µt (Xt , t) = −Xt and variance σt (Xt , t) = 2.
2
∂f ∂2 f √ ∂f
!
∂f
df = − Xt + 2 + 2 dWt
∂t ∂x ∂x ∂x
√
t t t
= e Xt − e Xt + 0 + 2e dWt
√
= 2et dWt .
When modeling the short rate r(t) using a stochastic process, most traders
want that the predicted model prices of the instruments should, on average,
coincide with the prices on the current market. This means that on average,
our process for rt should coincide with the market implied curve θ0 (t). This
can be forced using mean reversion.
Assuming a deterministic function for σ , we can use theorem 8.17 to find the
desired result,
E [rt ] = θ0 (t),
It turns out that these processes are all related to a special kind of stochas-
tic process, called geometric Brownian motions, which have some interesting
properties that make them incredibly useful. In fact, the Nobel prize-winning
Black-Scholes model crucially uses these stochastic processes. They are defined
as follows.
from which it follows that Xt ∼ ℓN (0, t). The result then follows from example
1.19.
We have already seen some results on stopping times, such as the Strong Markov
Property. In this section, we introduce the the related notions of stopped pro-
cesses, and also introduce a new stopping time called the first exit time.
In other words, the stopped process is the process that stops changing as soon as the
stopping time is reached.
The next result tells us that a stopped martingale process is also a martin-
gale.
8.8. OPTIONAL STOPPING AND FIRST EXITS 269
Theorem 8.55. Let Xt be a martingale, and let T be a stopping time for the process
Xt . Then the stopped process Zt = Xt∧T is also a martingale. Thus in particular for
all t,
E [Zt ] = E [Z0 ] = E [X0 ] .
Proof. Since Xt is a martingale, we have that E [|Xt |] < ∞ almost surely for all
t. Hence, we have that E [|Zt |] < ∞ almost surely for all t as well.
Notice that, using the indicator function, we can split the stopped process
as
Zt = Zs + (Xt − Xs )I(s < T ).
Check that this coincides with Zt for all t ≥ 0. Then, we find that
E [Zt | Xu , u ≤ s] = Zs .
• The stopping time is bounded almost surely, i.e there exists a c ≥ 0 such that
P (T ≤ c) = 1.
• The stopping time is almost surely finite (but not necessarily bounded) and
there exists a positive number K such that for all t ≥ 0, P (Xt ≤ K) = 1.
270 CHAPTER 8. BROWNIAN MOTION
Proof. We give an outline of the proof, see [DW] for more details. Since T is
almost surely finite, we can define the random variable XT on (Ω, F , P) via
XT (ω) = XT (ω) (ω), ∀ω such that T (ω) < ∞.
This defines XT on a set of probability one, which we denote by H. For the null
set where T is infinite, we can set it to 06 . Notice that for all ω ∈ H, we have a
pointwise convergence
lim Xt∧T (ω) = XT (ω).
t→∞
It remains to be shown that E [XT ∧t ] → E [XT ] , since the result then follows
from theorem 8.55. Using the additional assumptions, one can construct a
random variable that is both integrable and dominates |XT ∧t | for all t. From
this, the Dominated Convergence Theorem yields the desired result.
We will use this result in order to study first exit times. They are defined as
follows.
Definition 8.57. The first exit time from the interval (b, a) is the first time at
which a Brownian motion Wt hits either a or b, where b < 0 < a:
Tab = inf {t ≥ 0 | Wt < [b, a]} .
Since Brownian motion has continuous paths that start at 0, this is a non-zero
random variable almost surely.
It is evident that first exit times are stopping times. Just like we did with
hitting times, we now consider the probability that these random variables are
finite.
Property 8.58. The exit time Tab for a finite interval (b, a) satisfies
P (Tab < ∞) = 1.
Proof. It is evident that for the hitting time Ta , we have Tab ≤ Ta . The result
then follows from proposition 8.29.
6 Expectations are integrals, whose values are not influenced by events in null sets.
8.8. OPTIONAL STOPPING AND FIRST EXITS 271
Consider now the stopped martingale with stopping time Tab . As soon as
the Brownian motion exits the interval (b, a), it will stay at the boundary that it
hit. Notice that the Brownian motion can leave through either a or b. What is
the probability of these scenarios?
Proposition 8.59. Suppose Wt is a standard Brownian motion and let b < 0 < a
be an interval. Then the probability of first exit via a is given by
b
P (Tab = Ta ) = .
b−a
Hence, the probability of first exit via b is given by
a
P (Tab = b) = .
a−b
Notice that by |WTab | < a − b, which together with property 8.58 allows us to
apply theorem 8.56 to find
h i
E WTab = E [W0 ] = 0,
b a
P (Tab = Ta ) = = 1− = 1 − P (Tab = Tb )
b−a a−b
from which the required result follows.
272 CHAPTER 8. BROWNIAN MOTION
Recall that the expected hitting time E [Ta ] was infinite for any state a , 0.
We now consider the expected exit time for the interval b < 0 < a. We have the
following result.
Property 8.60. Let Wt be a Brownian motion and let b < 0 < a be an interval.
The expected exit time is given by
E [Tab ] = −ab.
Proof. We will leverage the fact that the quadratic process is a martingale (see
example 8.7)
Zt
Xt = Ws ds = Wt2 − t.
0
With Tab being the exit time for Wt , it is easy to see that Tab is a stopping time
for Xt . Therefore, by theorem 8.56 (check why the assumptions hold), it follows
that h i h i
0 = E [X0 ] = E Xt∧Tab = E WT2ab ∧t − E [t ∧ Tab ] .
This yields us
h i
E [Tab ] = E WT2ab .
We will once again use a different stochastic process. This time, we will use
the exponential process. We have seen in example 8.8 that this process, given
by
α2 t
Xt = eαWt − 2 , (α > 0)
8.9. HITTING AND EXIT TIME TRANSFORMS 273
is a martingale. Denote by Ta the hitting time of Wt and by Tab the exit time
of Wt . Applying theorem 8.567 , we find
2
αWTa − α 2Ta
h i
1 = E [X0 ] = E XTa = E e ,
2
We write θ = α2 . Notice that the expectation on the right-hand side is exactly
the Laplace transform of the random variable Ta :
h i Z∞
LTa (θ) = E e −θTa
= e−θt fTa (t)dt.
0
Recall that we already have derived the density of hitting times, namely
|a| a2
fTa (t) = √ e− 2t .
2πt 3
h i
Using the fact that E XTa = 1 = eαa LTa (θ), we easily find that
Z∞ √
LTa (θ) = e−θt fTa (t)dt = e−αa = e− 2θa .
0
Suppose that Tb < Ta , such as in figure 8.17. Then for the Brownian motion
to hit a, it first needs to move from b to a, which of course is independent
of {Ws | s ≤ Tb }. Applying the Strong Markov property, property 8.10, we can
rewrite
n o
Ta | (Tab < Tb ) = Tb + inf s > 0 | WTb +s = a
Stat. Trans. Prob. n o
= Tb + inf s > 0 | WTb +s − WTb = a − b
= Tb + T̃a−b ,
where T̃a−b is the hitting time for the state a−b of the Brownian motion W̃t that
starts in 0 at Tb , see figure 8.18.
Notice that the times Tb and T̃a−b are independent. Furthermore, since
b < 0 < a, notice that a − b > a.
8.9. HITTING AND EXIT TIME TRANSFORMS 275
√ h i h i √
2θb −θTa − 2θ(a−b)
e = LTb (θ) = E e −θTb
I(Tb < Ta ) + E I(Tb ≥ Ta )e e .
| {z } | {z }
x2 x1
√ √
− 2θa = x + x e− 2θ(a−b)
e√
1 2 √
.
e 2θb = x2 + x1 e− 2θ(a−b)
√ √ √
e−2θb −e 2θb sinh(− 2θb)
x1 = =
√ √ √
e 2θ(a−b)
√ −e √
− 2θ(a−b) sinh( √2θ(a−b)
− 2θa −e 2θa sinh( 2θa)
.
x2 = √ e =
√ √
e 2θ(a−b) −e− 2θ(a−b) sinh( 2θ(a−b)
h i
LTab (θ) = E e−θTb I(Tb < Ta ) + e−θTa I(Ta < Tb )
= x1 + x2
√ √
sinh(− 2θb) + sinh( 2θa)
= √ .
sinh( 2θ(a − b))
In order to find the density of the hitting times, one can apply an inverse
Laplace transform.
[AM] Andrey Markov, Wikimedia Commons. Retrieved August 20, 2022 from
https://commons.wikimedia.org/wiki/File:Andrei_Markov.jpg
279