0% found this document useful (0 votes)
16 views279 pages

Stochastic Models

Uploaded by

boelenbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views279 pages

Stochastic Models

Uploaded by

boelenbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 279

Stochastic Models

Gregory van Kruijsdijk


Jan De Spiegeleer
Wim Schoutens

June 18, 2023


2
Contents

1 Probability Theory 13

1.1 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.1.1 Probability Spaces . . . . . . . . . . . . . . . . . . . . . 14

1.1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . 19

1.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.2.1 Discrete Random Variables . . . . . . . . . . . . . . . . 26

1.2.2 Examples of Discrete Random Variables . . . . . . . . . 27

1.2.3 Continuous Random Variables . . . . . . . . . . . . . . 29

1.2.4 Examples of continuous random variables . . . . . . . . 31

1.2.5 Multivariate Random Variables . . . . . . . . . . . . . . 36

1.2.6 Transformations of Random Variables . . . . . . . . . . 38

1.3 Expectations, moments and Moment Generating Functions . . . 41

1.4 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2 Poisson Processes 55

3
4 CONTENTS

2.1 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.2 Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.2.1 First Definition . . . . . . . . . . . . . . . . . . . . . . . 62

2.2.2 Second Definition . . . . . . . . . . . . . . . . . . . . . . 65

2.2.3 Third Definition . . . . . . . . . . . . . . . . . . . . . . . 66

2.2.4 Conditionals of Poisson Processes . . . . . . . . . . . . . 71

2.2.5 Thinning and Superposition of Poisson Processes . . . . 72

3 Extensions 77

3.1 Non-homogeneous Poisson Processes . . . . . . . . . . . . . . . 77

3.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.2.1 Mixed Poisson Processes . . . . . . . . . . . . . . . . . . 82

3.2.2 Bernoulli Mixture Model . . . . . . . . . . . . . . . . . . 86

4 Renewal Processes 89

4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.2 Long-term laws . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.2.1 Strong law of Large Numbers and the CLT . . . . . . . . 91

4.3 Stopping Times . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3.1 Wald’s Equation . . . . . . . . . . . . . . . . . . . . . . . 93

4.4 Renewal-Reward Theorem . . . . . . . . . . . . . . . . . . . . . 96


CONTENTS 5

5 Markov Processes 101

5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2 Multi-step dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.2.1 Chapman-Kolmogorov Equations . . . . . . . . . . . . . 113

5.3 Simulating Markov Chains . . . . . . . . . . . . . . . . . . . . . 117

5.4 Calibration of Markov Chains . . . . . . . . . . . . . . . . . . . 119

5.5 Structure of Markov Chains . . . . . . . . . . . . . . . . . . . . . 121

5.5.1 Equivalence Classes . . . . . . . . . . . . . . . . . . . . . 121

5.5.2 Hitting and Passage Times . . . . . . . . . . . . . . . . . 125

5.5.3 Recurrence and Transience . . . . . . . . . . . . . . . . 133

5.5.4 Strong Markov property and Recurrence revisited . . . . 138

5.5.5 Expected Returns . . . . . . . . . . . . . . . . . . . . . . 144

5.5.6 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.5.7 Canonical Decomposition of Markov Chains . . . . . . . 148

5.6 Absorption dynamics . . . . . . . . . . . . . . . . . . . . . . . . 152

5.7 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.7.1 Finding the stationary distribution . . . . . . . . . . . . 163

5.8 Branching Processes . . . . . . . . . . . . . . . . . . . . . . . . . 168

6 Hidden Markov Models 179


6 CONTENTS

6.1 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . 180

6.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . 184

6.2.1 Likelihood: The Forward Algorithm . . . . . . . . . . . 186

6.2.2 Decoding: The Viterbi Algorithm . . . . . . . . . . . . . 190

6.2.3 Learning: The Forward-Backward Algorithm . . . . . . 194

7 Gaussian Processes 199

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

7.2 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

7.3 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . 205

7.3.1 Introduction to Regression . . . . . . . . . . . . . . . . . 205

7.3.2 Bayesian Regression . . . . . . . . . . . . . . . . . . . . 208

7.3.3 Gaussian Process Regression . . . . . . . . . . . . . . . . 211

8 Brownian Motion 217

8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

8.2 Stochastic Calculus . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.2.2 Stochastic Integrals . . . . . . . . . . . . . . . . . . . . . 228

8.2.3 Itô’s lemma and variations . . . . . . . . . . . . . . . . . 233

8.3 Reflection Principle and Hitting Times . . . . . . . . . . . . . . 238

8.4 Related distributions . . . . . . . . . . . . . . . . . . . . . . . . 244


CONTENTS 7

8.4.1 Maximum of a Brownian Motion . . . . . . . . . . . . . 244

8.4.2 Zeroes of Brownian Motion . . . . . . . . . . . . . . . . 247

8.4.3 Times of maximum processes . . . . . . . . . . . . . . . 251

8.5 Jump-Diffusion Processes . . . . . . . . . . . . . . . . . . . . . . 253

8.6 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

8.7 Related Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 261

8.7.1 Ornstein-Uhlenbeck Process . . . . . . . . . . . . . . . . 261

8.7.2 Geometric Brownian Motion . . . . . . . . . . . . . . . . 266

8.8 Optional stopping and First Exits . . . . . . . . . . . . . . . . . 268

8.9 Hitting and Exit time transforms . . . . . . . . . . . . . . . . . . 272


8 CONTENTS
Introduction

All models are wrong, but some are


useful

George Box

Someone who wishes to understand the world around them has two options:
they either become religious or a mathematician.

Our physical world is so complex that trying to model even the simplest
processes can be quite difficult. One reason for this is that there is a vast amount
of variables that can influence the outcome of an experiment: temperatures,
pressure, air humidity, wind, electromagnetic signals, ... . These variables need
to be taken into account in some way or another.

The main way to deal with this is to keep the covariates fixed. Often ex-
periments are done in a sterile lab or a vacuum. However, there is always one
variable that we will never be able to fix, namely time. Unfortunately, time is
correlated with almost everything and hence we cannot just ignore time as a
variable.

Since we can’t keep it fixed nor ignore it we are forced to take it into account
more directly. For example, suppose we want to consider the evolution of a
stock price. We don’t consider this evolution as one value, e.g.

S = $14,

9
10 CONTENTS

but we consider it instead as a function of time

St : R≥0 → R≥0 : Time t 7→ Price of the stock at time t.

Just like nature, stock prices have a huge amount of variables that drive
their evolution. Even if one could theoretically pinpoint what all these vari-
ables are, it would be practically impossible to determine their values1 and how
they impact the stock prices. This is where we enter the world of randomness
and probabilities: instead of trying to find a deterministic function for St , one
defines a function that tells us how likely it is to observe a given value for St .

P (St ) : R≥0 ×R≥0 → [0, 1] : (t, x) 7→ Probability that the stock at time t has value x

This random function St is, as we will see later, an example of a stochastic


process! Essentially, stochastic processes are "just" a sequence of variables all
of which are random. However, as the discussion above shows, they pop up in
many natural questions we ask ourselves when we try to model the world around
us. They form a fundamental tool for mathematicians, physicists, engineers,
quants, and many more.

Even though they are crucial tools, we hope that the reader will appreciate
their beauty as a topic in itself. We will see in the following chapters that
stochastic processes have fascinating properties and results that, in contrast
to a lot of other domains in mathematical modeling, often require little to no
technical results. The proofs do not require deep topological, algebraic, or
analytical results making the topic surprisingly accessible.

However, as the name and the introduction suggests, the reader does need
at least some basic probability theory. Therefore we will start by covering some
probability theory, mainly focusing on concepts and results which we will need
in later chapters.

We will then look at Poisson processes and their extensions. They are one of
the most well-known and studied stochastic processes, and for a reason. They
arise naturally in a lot of different scenarios.
1 For
example, stock prices are heavily determined by human thinking, which is arguably the
most unpredictable variable of them all.
CONTENTS 11

We will also look at Markov processes, which a lot of us will probably already
be familiar with. Our focus is mainly on the dynamical properties of Markov
processes.

Afterwards, Gaussian processes are introduced. We will mainly focus on


Brownian motion. We will consider also consider some processes that are
closely related to Brownian motion, such as geometric Brownian motion and
Ornstein-Uhlenbeck processes.
12 CONTENTS
Chapter 1

Probability Theory

How dare we speak of the laws of chance? Is


not chance the antithesis of all law?

Joseph Bertrand

Probability theory is the natural mathematical setting when dealing with un-
certainties or randomness. The first results in this field were already obtained
in the sixteenth century when most of the efforts were made in modeling gam-
bling games. However, it was only in the twentieth century that the first formal
mathematical framework of probabilities was constructed by Kolmogorov. In
his celebrated paper Foundations of the theory of probability, he developed the
axiomatic system of probability theory that is still used to date.

We will use Kolmogorov’s axiomatic system as our starting point for this
chapter. We will look at some important concepts, such as conditional proba-
bilities and random variables. This chapter is mainly for those who never had
a mathematical course on probability theory or those that need a refresher.

13
14 CHAPTER 1. PROBABILITY THEORY

Figure 1.1: Andrej Kolmogorov, founder of modern probability theory

1.1 Probabilities

1.1.1 Probability Spaces

When thinking about probabilities, we often think of throwing a die or flipping


a coin. The reason for this is that we can repeat the same trial but obtain
different, unpredictable outcomes. We will call such processes experiments.

Definition 1.1. An experiment is a process whose outcome is not known in advance.

Many scientific experiments are also experiments in the above sense. Even
though one often tries to account for the covariates that influence the outcome,
it is often not feasible or even impossible to control everything. This can lead
to small fluctuations in the set-up or the procedure of the experiment which can
in turn lead to large differences in the outcomes. 1

Given such an experiment, we want to be able to construct a probabilistic


model that is able to convey the uncertainty regarding the outcomes. In order
to do this, we will have to first define what exactly we mean by ’outcomes’. We
will call the set of all outcomes the sample space or universe.
1 If
we know the exact weight distribution of a coin, the angular momentum of the flip, and
even the density of the air, we could in theory predict the outcome of the coin flip. Since in
general, we do not possess this information, the outcome becomes uncertain.
1.1. PROBABILITIES 15

Definition 1.2. The sample space Ω of an experiment is the set of all possible
outcomes ω of the experiment.

Example 1.1.

1. Consider the experiment where we flip a coin. Then the sample space is
given by
Ω = {H, T }
Here, the outcome H denotes the outcome where the coin lands on heads.
2. Consider the experiment where we flip a coin three times in a row. Then,
the sample space is given by
Ω = {HHH, HHT , HT H, HT T , T HH, T HT , T T H, T T T } .

3. In a clinical trial, we want to consider the time elapsed from the moment
a patient was treated and the time when the patient is cured. Then, the
sample space is given by
Ω = R≥0 .

Notice that in the first two examples, the sample space is finite. In the last
example, the set is uncountable.

In the next step, we want to associate probabilities to the outcomes of the


experiments. For reasons that will become clear later, it is in general not suf-
ficient to only assign probabilities to singletons in the sample space. We also
want to obtain probabilities for sets of outcomes, which we will call events.
Definition 1.3. An event A is a subset of the sample space Ω.

Example 1.2.

1. Consider the game where we throw three coins. An example of an event


is ’the first toss is a head’. As a set, this is given by
A = {HHH, HHT , HT H, HT T } .
16 CHAPTER 1. PROBABILITY THEORY

2. Suppose we roll a die. An example of an event is ’the die lands on an


even number’. As a set, this is given by
A = {2, 4, 6} .

3. Given a sample space, there are a total of 2|Ω| possible events, where |Ω|
represents the cardinality of the sample space.

A somewhat surprising result from measure theory is that it is not always


possible to assign a probability to any possible subset of the sample space. For
the familiar Lebesgue measure2 , one can show that there are sets that cannot
be given a unique measure3 . Instead, we sometimes have to restrict ourselves to
a special substructure of the power set, called a σ −algebra. The exact details
are not that important to us right now, however, it is a necessary ingredient for
probability theory.
Definition 1.4. A σ −algebra F is a set of events such that

1. ∅ ∈ F
2. If A ∈ F then AC ∈ F
S∞
3. If Ai ∈ F , i = 1, 2, ... then i=1 Ai ∈F.

It is easily seen that a σ −algebra F is a subset of the power set.

Example 1.3. Consider the sample space Ω = {a, b, c, d} , then a possible


σ −algebra is
F = {∅, {a, b} , {c, d} , {a, b, c, d}} .

On these σ −algebras, we will assign probabilities. Evidently, we will use a


function to assign to each event A ∈ F a probability. These functions are called
probability functions or probability measures.
2 This is a measure function, a concept closely related to probability measures.
3 An example of such a set is the Vitali set.
1.1. PROBABILITIES 17

Definition 1.5. A probability function P is a function

P : F → [0, 1] : A 7→ P (A)

such that

• P (Ω) = 1

• For any event A ∈ F , we have 0 ≤ P (A) ≤ 1

• For any countable sequence of disjoint events (Aj ∩ Ak = ∅, j , k), we have


 
[  X
P  Ai  = P (Ai ) .
i i

This property is called the σ −additive property.

Intuitively, the properties of a probability function say:

• The probability of the experiment having any outcome is 1.

• For any event, the probability is between 0 and 1.

• If the two events consist of two non-overlapping sets of outcomes, then


the probability of either of them happening is the same as the sum of
their respective probabilities.

From this definition, we can immediately show that probability measures


satisfy the following properties.

Property 1.6. A probability function has the following properties.

 
1. P AC = 1 − P (A)

2. P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

3. P (∪i Ai ) ≤ i P (Ai ) (Boole’s Inequality)


P
18 CHAPTER 1. PROBABILITY THEORY

Figure 1.2: Visualization of property 1.6

Proof. 1. Using σ −additivity, we find


   
P AC + P (A) = P AC ∪ A = P (Ω) = 1,

from which the result follows.

2. Using σ −additivity and the fact that A ∪ B = A ∪ (B \ (A ∩ B)), one can


show

P (A ∪ B) + P (A ∩ B) = P (A) + P (B \ (A ∩ B)) + P (A ∩ B) .

Using σ −additivity again, we find

P (A) + P (B \ (A ∩ B)) + P (A ∩ B) = P (A) ∪ P (B) .

Combining the above equalities yields the desired result.

3. Left as an exercise for the reader. (Tip: Show this using induction.)

Combining the above concepts, we have a so-called probability space.

Definition 1.7. The probability space is the structure (Ω, F , P).

Thus, a probability space consists of a universe Ω of possible outcomes,


together with a set of events F for which we can assign probabilities using P.
1.1. PROBABILITIES 19

We have seen that by the axioms of probability theory, any probability func-
tion P satisfies P(Ω) = 1. This can also hold for other events A, which we will
define as follows.
Definition 1.8. Let A ∈ F such that P (A) = 1. Then we say that A holds almost
surely, denoted A a.e. The complement of an event that holds almost surely is called
a null set.

We will see examples of null sets later.

1.1.2 Conditional Probability

We start this section using an example from insurance.

Example 1.4. Insurance companies sometimes offer flood insurance to home-


owners. This is a contract in which the homeowner pays a premium but in
return is protected against possible damages from flooding. From historical
data, they have found that for two homes A and B, the individual flood risks
are quite low
P (AFlood ) ≈ 0 and P (BFlood ) ≈ 0.
However, for the insurance company, it is also important to know the probability
that both houses flood at the same time since this will lead to a large loss.
Suppose that A and B are neighbours, then

P (AFlood ∩ BFlood ) ≈ P (AFlood ) >> P (AFlood ) P (BFlood ) .

This shows that for many applications, it is important to include information


regarding the relationship between events!

We now introduce the notion of conditional probabilities, often referred to


as Bayes’ Theorem.
Definition 1.9. If P (B) ≥ 0, the conditional probability that A occurs given that B
occurs is given by
P (A ∩ B)
P (A | B) = .
P (B)
20 CHAPTER 1. PROBABILITY THEORY

Alternatively, we have
P (A ∩ B) = P (A | B) P (B) .

Suppose we have that P (A | B) > P (A). This implies that if we know that B
happened, it becomes more probable to also observe A. For example, suppose
A denotes whether a person is infected by some disease and B denotes the
outcome of a test. Then generally speaking we would say that the person has a
higher probability of being infected when the test is positive.

If the probability remains the same, we say that the events are independent.

Definition 1.10. Two events A and B are independent if P (A | B) = P (A) .

Example 1.5. Suppose we throw two dice one after another.

• The events
A = The sum of the two throws is 12
and
B = The first throw is a six
are dependent.

• The events
A = The first throw is a six
and
B = The second throw is a four
are independent.

The following result concerning conditionals is called the law of total prob-
ability. It will be used a lot in the following chapters.

Property 1.11. Suppose {B1 , B2 , ...} satisfies the following properties:

1. Bi ∩ Bj = ∅ for all i , j
1.1. PROBABILITIES 21

2.
F
i Bi = Ω.

Then, for any event A in the σ −algebra F we have


X
P (A) = P (A | Bi ) P (Bi ) .
i

Figure 1.3: An example of an exhaustive set {B1 , ..., Bn }

Proof. Using Bayes’ theorem, we have


X X
P (A | Bi ) P (Bi ) = P (A ∩ Bi ) .
i i

Since the Bi are disjoint, the sets A ∩ Bi are also disjoint. Hence, by the
σ −additive property
X X
P (A | Bi ) P (Bi ) = P (A ∩ Bi )
i i
    
X [  X  [ 
= P  Ai ∩ Bi  = P A ∩  Bi 
i i i i
= P (A ∩ Ω) = P (A) .
22 CHAPTER 1. PROBABILITY THEORY

Sometimes, events are only independent conditional on another event. This


is called conditional independence.

Definition 1.12. Events A and B are conditionally independent given C if

P (A ∩ B | C) = P (A | C) P (B | C) .

Example 1.6. A trivial example of events that are unconditionally dependent


and conditionally independent is the following. Suppose we throw three dice
after each other, and we have the following events.

A = The sum of the first three dice is 18

B = The second throw is 6

C = The sum of the first two throws is 11.

Then notice that


1
P (B) =
6
and
P (B | A) = 1.

Hence, A and B are dependent. Notice that P (A ∩ B | C) = 0 = P (A | C), and


hence the events are conditionally independent.

Example 1.7. Suppose a train’s braking system consists of two brakes, a


brake A, and an emergency brake. Both brakes have a common component C,
whose probability of failure is pC = P (FC ). Brake A fails with probability pA =
P (FA ). Whenever component C fails, the emergency brake still has component
B which has a probability of failure pB = P (FB ). We assume that the component
B is independent of the component C. We are interested in computing the
probability that system 1 fails but system 2 works.
1.2. RANDOM VARIABLES 23

Figure 1.4: Braking system

Since both systems have component B, the events that both brakes fail are
not independent. However, conditioning on B they are independent. Denoting
by Wi (resp. Fi ) if component or system i works (resp. fails), we get

P (F1 ∩ W2 ) = P (F1 ∩ W2 | Wc ) P (Wc ) + P (F1 ∩ W2 | Fc ) P (Fc )


= pA (1 − pC ) + (1 − pB )pC .

1.2 Random Variables

We are often not interested in the exact outcome of the experiment but rather
in some value that is associated with the outcome. This brings us to the concept
of random variables. We start by giving an example.

Example 1.8. When surveying a country’s population, the outcomes are

Ω = {People living in the country} .

However, our main interest is often a specific (numerical) feature, such as salary,
height, age, etc. For each feature, we have a corresponding random variable,
eg.
X : Ω → R : ω 7→ Salary of ω.
24 CHAPTER 1. PROBABILITY THEORY

Definition 1.13. A random variable X on a probability space (Ω, F , P) is a func-


tion
X : Ω → R : ω 7→ X(ω)
such that for all a ∈ R,

{ω|X(ω) ≤ a} = X −1 ([−∞, a]) ∈ F .

A random variable gives rise to a probability measure on R in a straightfor-


ward fashion.

Property 1.14. Given a probability space (Ω, F , P) and a random variable X, we


can define the probability function PX on R via
 
PX (t) = P (ω ∈ Ω | X(ω) = t) = P X −1 (t) .

Notation 1.15. In the following, we will often omit the random variable from the
notation, meaning that we will write PX as P.

To make things more precise, we make the following technical remark.

Remark. Since P (A) is only defined for events A ∈ F , the induced probability
measure in property 1.14 only exists if for all events B in the associated σ -algebra on
R we have
X −1 (B) ∈ F ⊂ Ω.
In measure theory, we then call X F −measurable. In the remainder of this book, we
will always assume that the random variables we cover are F −measurable.

Associated with this induced probability function is the so-called distribu-


tion function.

Definition 1.16. The distribution function FX of X is defined by

FX (x) = PX (X ≤ x)
= P (ω | X(ω) ≤ x)
1.2. RANDOM VARIABLES 25

Figure 1.5: An example of a distribution function

This distribution function gives the probability that the observed random
variable is bounded above by a given value x. Notice that

P (a < x ≤ b) = F(b) − F(a).

Hence, using the distribution function we can find the probability for X on any
bounded half-open interval as well.

Equivalent to the distribution function is the tail function. It is defined as


follows.

Definition 1.17. The tail function F (x) of a random variable X is defined

F (x) = 1 − FX (x).

Thus, the tail function of a random variable X is the tail of the distribution
function. In order to understand these concepts a bit better, we look at some
examples of random variables.
26 CHAPTER 1. PROBABILITY THEORY

1.2.1 Discrete Random Variables

We start by discussing the simplest of random variables, namely the discrete


random variables.

Definition 1.18. A discrete random variable is a random variable whose image is a


countable subset of R.

We first briefly discuss the notion of the countability of sets.

Remark. Examples of countable subsets are all finite sets, but also countably in-
finite sets. A set is countably infinite if there exists a bijection (i.e a one-to-one
correspondence) between the set itself and the natural numbers N.

In other words, countable sets are those sets that are either finite or ’as large as
N’.

Example 1.9.

• The rational numbers Q are countable.

• The real numbers R are not countable.

• Any non-empty open interval of R is not countable.

The reason for their simplicity lies in the fact that probability functions satisfy
σ −additivity. Indeed, we then have the following result.

Property 1.19. Suppose X is a discrete random variable. For any event A in the
σ −algebra F of the probability space, we have
X
P (X ∈ A) = px ,
x∈A

where px = P (X = x). Thus, it suffices to define the probabilities of the singletons to


define the full probability function of a discrete random variable.
1.2. RANDOM VARIABLES 27

Proof. Since A is a subset of the image of R, it is contained in a countable set


and is thus itself countable. We can therefore write A as the disjoint countable
union of its elements. The result then follows from σ −additivity.

This seemingly innocent result is what separates discrete random variables


from their continuous counterpart, which we will cover later.

There are many different types of discrete random variables. In the next
sections, we will cover some of the most important ones.

1.2.2 Examples of Discrete Random Variables

Definition 1.20. Let X denote the number of successes in one experiment with given
success probability p. A success is denoted by X = 1. We denote the probability of a
failure as P (X = 0) = q = 1 − p. Then, we say that X is Bernoulli distributed with
success probability p.
Notation 1.21. We write X ∼ Bernoulli(p).

A straightforward extension is the number of successes in more than one


experiment. This is the Binomial distribution.
Definition 1.22. Let X denote the number of successes in n independent experiments
of the same experiment with success probability p. Then we say that X has a binomial
distribution.
Notation 1.23. We write X ∼ B(n, p) where n denotes the number of independent
experiments and p the success probability of the experiment.

The probability function of X ∼ B(n, p) is given by


!
n
P (X = k) = · pk (1 − p)n−k ,
k
where !
n n!
=
k k!(n − k)!
is the number of different possible ways to choose k items from a set of n items.
28 CHAPTER 1. PROBABILITY THEORY

Definition 1.24. Suppose we have an experiment with success probability p. Let


X denote the number of failures before the first success occurs. Then we say that X
follows a geometric distribution.

Notation 1.25. We write X ∼ Geometric(p).

The probability function of X ∼ Geometric(p) is given by

P (X = k) = (1 − p)k · p.

Property 1.26. Suppose X is geometrically distributed with success parameter p.


Then
P (X ≥ k) = (1 − p)k .

Proof. It suffices to show that P (X < k) = 1 − (1 − p)k = 1 − qk . Notice that


k−1
X k−1
X
P (X < k) = P (X = k) = qn p.
n=0 n=0

Hence

P (X < k) = p + pq + ... + pqk−1


= p(1 + q + q2 + ... + qk−1 )
= (1 − q)(1 + q + q2 + ... + qk−1 )
= 1 − q + q − q2 + q2 − ... − qk−1 + qk−1 − qk
= 1 − qk .

This shows the desired result.

Definition 1.27. Let X have the following probability function:


k
−λ λ
P (X = k) = e , k = 0, 1, 2, . . .
k!
Then we say that X is Poisson distributed.
1.2. RANDOM VARIABLES 29

Notation 1.28. We write X ∼ P oisson(λ).

This distribution is often used to model the number of events that happen
in a given time interval.

The Poisson distribution also arises as a limit of the binomial distribution.


Proposition 1.29. Suppose X ∼ B(n, p) with p → 0 and np → λ as n → +∞.
Then,
λk
P (X = k) → e−λ .
k!

Proof. Consider the sequence of random variables Xn such that


λ
Xn ∼ B(n, p = )
n
. Then n
λ

P (Xn = 0) = 1 − → e−λ . (n → +∞)
n
Furthermore,
P (Xn = k + 1) (n − k) λn λ
= → (n → +∞).
P (Xn = k) (k + 1)(1 − λn ) k+1
Combining the above equalities, one can iteratively show the desired result.

We now consider continuous random variables.

1.2.3 Continuous Random Variables

The idea that knowing P on singletons is enough information to know P for


all events A in the σ −algebra F such as in property 1.19 no longer holds in the
case of continuous random variables.

Luckily, we still have an alternative called the density function. Just like
one sums the probability over the outcomes in an event for the countable case,
the density function recovers the probability by taking the integral over the
outcomes in the event.
30 CHAPTER 1. PROBABILITY THEORY

Definition 1.30. A random variable X is continuous if there exists a non-negative


function fX (x) called the density function of X such that
Zx
P (X ≤ x) = FX (x) = fX (u)du.
−∞

Figure 1.6: The density function and distribution function of continuous random
variables

For any interval [a,b], we find that the probability of a < X ≤ b is given by
Zb
P (a < X ≤ b) = F(b) − F(a) = fX (u)du.
a

Notice that in particular, we will have


Zx
P (X = x) = fX (u)du = 0.
x

Instead of working with the probability function, we often work with the
density function.
Definition 1.31. A function f is a density function if the following conditions hold:

• f is non-negative fX (x) ≥ 0
1.2. RANDOM VARIABLES 31
R∞
• f (u)du
−∞ X
= 1.

A random variable is fully characterized by its density function.

In the next section, we consider some of the most important families of


continuous random variables that we will need in the later chapters.

1.2.4 Examples of continuous random variables

Definition 1.32. A random variable X is uniformly distributed on the interval


(a, b) if it has as density function

1
 b−a a < x < b


fX (x) =  .
0 everywhere else

One can show that the distribution function of a uniformly distributed ran-
dom variable is given by
x−a
FX (x) = , a < x < b.
b−a
Notation 1.33. We write X ∼ U (a, b).

Figure 1.7: Distribution function and density function of a U [0, 1]-distributed


random variable

Example 1.10. Suppose that exactly every 12 minutes, a bus passes your bus
stop. Without checking the schedule, you wait at the bus stop for the next bus
32 CHAPTER 1. PROBABILITY THEORY

to arrive. Then the waiting time T can be modeled as a uniformly distributed


random variable T ∼ U (0, 12).

Definition 1.34. A random variable X is exponentially distributed with (hazard)


rate λ if it has as density function

λe−λx x ≥ 0


fX (x) =  .
0 everywhere else

Notation 1.35. We write X ∼ exp(λ).

Exponentially distributed random variables are used when modeling waiting


times between events, default times of bonds, ... .

Figure 1.8: Distribution function and density function of an exp(1)-distributed


random variable

Property 1.36. The exponential distribution has the Lack Of Memory property, i.e

P (T > t + s | T > t) = P (T > s) .

Proof. This is straightforward and left as an exercise for the reader.

In the context of the waiting time between events, this means that if you have
already waited for t units, the probability that you have to wait an additional s
time units is the same as if you had not been waiting at all. For exponentially
distributed times, it hence does not make sense to expect something to happen
just because it hasn’t happened for a long period.
1.2. RANDOM VARIABLES 33

Definition 1.37. Let X be a random variable with density function

1 − 1
(x−µ)2
fX (x) = √ e 2σ 2 , −∞ < x < ∞.
2πσ 2
Then we say that X is normally distributed with mean µ and standard deviation σ .

Notation 1.38. We write X ∼ N (µ, σ 2 ).

Definition 1.39. A random variable X is standard normally distributed if it is


normally distributed with µ = 0 and σ = 1. The corresponding distribution function
is often denoted
FX (x) = Φ(x).

Figure 1.9: The distribution function and density function for a standard nor-
mally distributed random variable

A distribution closely related to the standard normal distribution is the


χ2 −distribution.

Definition 1.40. Let X1 , X2 , ...., Xn be independent4 standard normally distributed


random variables. Consider then the random variable

Y = X12 + X22 + ... + Xn2 .

The distribution of Y is denoted by χ2 (n). In this context, we often call n the degrees
of freedom.
4 Just like events, random variables can be independent. We will see what this means in a
later section.
34 CHAPTER 1. PROBABILITY THEORY

Figure 1.10: The distribution function and density function for a χ2 (3)-
distributed random variable

Using the definition, the following property is quite easy to show.

Property 1.41. Let X ∼ χ2 (n) and Y ∼ χ2 (m) be independent random variables.


Then X + Y ∼ χ2 (n + m).

Definition 1.42. A random variable X is said to be Gamma distributed with shape


N > 0 and scale λ > 0 if its density function is given by

λN xN −1 −λx
fX (x) = e .
Γ (N )

Here, Γ is the Gamma function.

When N ∈ N, the density function takes the form

λN xN −1 −λx
fX (x) = e .
(N − 1)!

Notation 1.43. We write X ∼ Gamma(N , λ).

Remark. Some authors define the scale to be λ1 . Especially when using programming
packages, it is important to always check the convention that is used.
1.2. RANDOM VARIABLES 35

Figure 1.11: The distribution function and density function for a Gamma(2, 1)-
distributed random variable

Property 1.44. The Gamma distribution satisfies:

1. If N = 1, we recover the exponential distribution:

Gamma(1, λ) = exp(λ).

2. If N = k
2 and λ = 12 , we recover the χ2 (k) distribution:
!
k 1
Gamma , ∼ χ2 (k).
2 2

Proof. Exercise. You can use the fact that the density of the χ2 (k) distribution
is given by
1 k x
fχ2 (k) (x) = k   x 2 −1 e− 2 .
2 2 Γ 2k

Definition 1.45. A random variable is Pareto distributed with scale k > 0 and
shape α if it has the following density function:
 α
 xα+1 if x ≥ k

 αk
fX (x) =  .
0 if x < k

Notation 1.46. We write X ∼ P areto(k, α).


36 CHAPTER 1. PROBABILITY THEORY

The Pareto distribution is often used to model quantities with very fat right
tails.

Figure 1.12: The distribution function and density function for a P areto(1, 2)-
distributed random variable

Property 1.47. Let X be an exponentially distributed random variable with rate λ.


Then, keX ∼ P areto(k, λ).

1.2.5 Multivariate Random Variables

In practice, we are often interested in more than one feature of a given obser-
vation. For example, when studying how wealth is distributed within a given
population we might want to record both the age and the salary of the members
of the population. In order to record this information, we can use a stochastic
vector.

Definition 1.48. Let X1 , X2 , ..., Xn be random variables defined on the same prob-
ability space (Ω, F , P). Define the mapping

X : Ω → Rn : ω 7−→ X(ω) := (X1 (ω), X2 (ω), ..., Xn (ω)).

Then X is called a stochastic vector of size n.

Notation 1.49. Instead of writing X = (x), we often write out the vectors in full:

(X1 , X2 , ..., Xn ) = (x1 , x2 , ..., xn )

or
(X1 = x1 , X2 = x2 , ..., Xn = xn ).
1.2. RANDOM VARIABLES 37

Here, we write xi = Xi (ω). In most cases, symbols written in capital denote the
random variable itself whilst those written in lowercase denote their realization,
x ∈ R.

Just like in the univariate case, a stochastic vector gives rise to a probability
function.

Definition 1.50. Let X1 , X2 , ..., XN be discrete random variables. Then the joint
(probability) density function is given by

f (x1 , x2 , ..., xN ) = P (X1 = x1 , X2 = x2 , ..., XN = xN ) .

For any Xi , i = 1, ..., N , we can recover the marginal density from the joint density
via
XX XX X
fXi (xi ) = ··· ··· P (X1 = x1 , X2 = x2 , ..., Xi = xi , ..., XN = xN ) .
x1 x2 xi−1 xi+1 xN

In the continuous case, we have the following definition.

Definition 1.51. The joint density function of a set of continuous random variables
X1 , X2 , .., XN is such that
(
P ((X1 , ..., XN ) ∈ A) = f (u1 , ..., uN )du1 · · · duN
A

for any set A in the σ −algebra F .

Just like in the univariate case, we can also define the distribution function.

Definition 1.52. The joint distribution function of a set of continuous random


variables X1 , ..., XN is defined by
Z x1 Z xN
F(x1 , ..., xN ) = ··· f (u1 , ..., f uN )du1 · · · duN .
−∞ −∞

As mentioned before, random variables can be independent. We now make


this concrete.
38 CHAPTER 1. PROBABILITY THEORY

Definition 1.53. For any two random variables X, Y , we say that they are indepen-
dent if
P (X ∈ A, Y ∈ B) = P (X ∈ A) P (X ∈ B)
for any two sets A, B in the σ −algebra F .

In the discrete case, this boils down to checking

P (X = x, Y = y) = P (X = x) P (Y = y) , ∀x, y

and in the continuous case, this boils down to checking

fX,Y (x, y) = fX (x)fY (y), ∀x, y.

1.2.6 Transformations of Random Variables

In practice, one often considers transformations of random variables. These


yield new random variables. In this section, we will consider how this trans-
formation impacts the underlying probability distribution of the new random
variable.

We start by considering the sum of independent random variables.

Property 1.54. Let X, Y be independent random variables. Define Z = X + Y , then


X
P (Z = z) = P (X = x) P (Y = z − x) , if X,Y are discrete
Zx ∞
fZ (z) = fX (x)fY (z − x)dx, if X,Y are continuous.
−∞

We thus recover the convolution product of the two densities.

Example 1.11. If X ∼ P oisson(λX ) and Y ∼ P oisson(λY ) are independent,


then
X + Y ∼ P oisson(λX + λY ).
Indeed, by proposition 1.54, we have
1.2. RANDOM VARIABLES 39

n
X
P (X + Y = n) = P (X = m) P (Y = n − m)
m=0
n
X λmX −λY λY
n−m
= e−λX e
m! (n − m)!
m=0
n !
−(λX +λY ) 1 n m n−m
X
=e λ λ
n! m X Y
m=0

Using the binomial theorem, we can rewrite this as


(λX + λY )n
P (X + Y = n) = e−(λX +λY ) .
n!
This shows the desired result.

Example 1.12. Suppose T1 , T2 , ..., TN are independent and Exponential(λ)


distributed. Then,
T1 + T2 + ... + TN ∼ Gamma(N , λ).
This can be shown using induction and is left as an exercise for the reader.

Frequently, we will scale, square, or take logarithms of the random vari-


ables. The following result tells us how the density changes when applying
these transformations.
Definition 1.55. Let X be a continuous random variable and define Y = g(X), a
continuous function defined on X. Assume that g is strictly increasing. Then the
density of Y in terms of fX is given by
d −1
fY (y) = fX (g −1 (y)) g (y)
dy

Proof. Since g is strictly increasing and continuous,n the inverse ois defined on the
image of g. Furthermore, we have that the events X ≤ g −1 (y) and {g(X) ≤ y}
40 CHAPTER 1. PROBABILITY THEORY

are the same. Hence

FY (y) = P (Y ≤ y)
= P (g(X) ≤ y)
 
= P X ≤ g −1 (y)
= FX (g −1 (y)).

Differentiating the above equality and using the chain rule, we find
d −1
fY (y) = fX (g −1 (y)) g (y).
dy

One can show a similar result for strictly decreasing functions.

Example 1.13. We can also consider more general functions f . Consider


for example the function f (z) = z2 . This is not strictly increasing or strictly
decreasing. In fact, it isn’t even injective. However, notice that if Y = X 2 then

Fy (y) = P (Y ≤ y)
 √ √ 
=P − y≤X≤ y
 √   √ 
= P X ≤ y −P X ≤ − y
√ √
= FX ( y) − FX (− y).

Deriving the above result, we find using the chain rule


 √ √  1
fY (y) = fX ( y) + fX (− y) √ .
2 y

Example 1.14. Suppose X is a standard normally distributed random vari-


able X ∼ N (0, 1). We have already seen that the square Y = X 2 is then
χ2 (1)−distributed. Notice that by example 1.13, we have
√ √ 1
fY (y) = (fX ( y) + fX (− y)) √ .
2 y
1.3. EXPECTATIONS, MOMENTS AND MOMENT GENERATING FUNCTIONS41

One can easily show that this gives

1 y 1
fY (y) = √ e− 2 √ .
2π y

Example 1.15. Suppose X is normally distributed X ∼ N (µ, σ ). Let Y = eX . It


is easy to show (do this!) that

1 (log(y)−µ)2

fY (y) = √ e 2σ 2 .
2πyσ

We call Y log-normally distributed Y ∼ ℓN (µ, σ 2 ).

1.3 Expectations, moments and Moment Generat-


ing Functions

Associated with a random variable are its moments. These values contain
some interesting information regarding the behavior of the variable. Probably
the most well-known one is the expected value.

Definition 1.56. The expected value of a discrete random variable X with distri-
bution fX (k) is given by X
E [X] = kfX (k),
k
provided that k |k|fX (k) < ∞. Similarly, the expected value of a continuous random
P
variable X with density function fX (x) is given by
Z
E [X] = xfX (x)dx,
R
R
provided that R
|x|fX (x)dx < ∞.
42 CHAPTER 1. PROBABILITY THEORY

There are generalizations of the expected value, called moments.


Definition 1.57. The k-th moment of a random variable X is defined
h i
µk = E X k .

The k-th central moment is defined


h i
µck = E (X − µ1 )k
h i
= E (X − E [X])k .

Notation 1.58. The second central moment µc2 is often called the variance and will
be denoted as Var (X).

The variance has following useful property.


Property 1.59. The variance Var (X) can be written as
h i
Var (X) = E X 2 − E [X]2 .

Proof. This is an exercise left for the reader.

Closely related to the variance is its multivariate cousin, the covariance.


Definition 1.60. The covariance of two random variables X and Y is given by
cov(X, Y ) = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ] .

Using the covariance function, we can state the following result.


Property 1.61. The expected value is a linear function. Thus, for any set of random
variables X1 , ..., Xn and constants b, a1 , ..., an ∈ R, we have
 
X  X
E  ai Xi + b = ai E [Xi ] + b.
i i

The variance is not linear but has the following identity


 
X  X X
Var  ai Xi + b = a2i Var (Xi ) + 2 ai aj cov(Xi , Xj ).
i i i<j
1.3. EXPECTATIONS, MOMENTS AND MOMENT GENERATING FUNCTIONS43

Proof. Left as an exercise for the reader.

We also have a result relating independence with expected values.


Proposition 1.62. Suppose X, Y are independent random variables. Then cov(X, Y ) =
0 or equivalently
E [XY ] = E [X] E [Y ] .

Proof. For continuous random variables, we have


Z ∞Z ∞
E [XY ] = xyfX,Y (x, y)dxdy.
−∞ −∞
Since X and Y are independent, this becomes
Z ∞Z ∞
E [XY ] = xyfX (x)fY (y)dxdy
−∞ −∞
which can be rewritten as
Z ∞ Z ∞
E [XY ] = xfX (x) yfY (y)dydx
Z−∞

−∞

= xfX (x)dxE [Y ]
−∞
= E [X] E [Y ] .

Intuitively, the covariance measures the linear relationship between the two
variables. If the covariance is positive, large values of X tend to correspond
with large values of Y . A negative covariance shows the opposite effect: large
values of X correspond to small values of Y .

The magnitude of the covariance cannot easily be interpreted because it


strongly depends on the variance of the random variables. Hence, one often
normalizes the covariance.
Definition 1.63. The (Pearson) correlation coefficient between two random variables
X and Y is given by
cov(X, Y )
ρ(X, Y ) = p .
Var (X) Var (Y )
44 CHAPTER 1. PROBABILITY THEORY

We now consider the moments of distributions we have covered.

Example 1.16. Assume X ∼ P oisson(λ). Then the mean and the variance are
λ.

Indeed, notice that



X λk
E [X] = k e−λ
k!
k=0

X λk
= e−λ
(k − 1)!
k=1

X λk−1 e−λ

(k − 1)!
k=1

X λr e−λ
=λ = λ.
r!
r=0

The last equality follows from the Taylor expansion of ex :


∞ k
x
X x
e = .
k!
k=0

Furthermore, we find
∞ ∞
h
2
i X λj 2 −j
X λj−1
E X = j e =λ je−λ
j! (j − 1)!
j=0 j=1
 
 
 
 
∞ j
 ∞ j

λ λ −j 
X  X
−λ
=λ (j + 1)e = λ 1 + je  = λ(1 + λ).

j!  j! 
j=0  j=0 
 | {z } 
 
E[X]

Hence using property 1.59, we find


h i
Var (X) = E X 2 − E [X]2 = λ + λ2 − λ2 = λ.
1.3. EXPECTATIONS, MOMENTS AND MOMENT GENERATING FUNCTIONS45

Example 1.17. Assume X ∼ B(n, p). Then, the mean is np and the variance is
npq = np(1 − p).

For this, notice that a Binomial random variable can be written as the sum
of n independent Bernoulli random variables. Using this, try to give a proof
using property 1.61.

Example 1.18. Suppose X, Y are independent normally distributed variables


with
X ∼ N (µX , σX2 ), Y ∼ N (µY , σY2 ).
Then X + Y is normally distributed with mean µX + µY and variance σX2 + σY2 .

Example 1.19. Suppose Y ∼ ℓN (µ, σ 2 ) as in example 1.15. Show that


1 2
E [Y ] = eµ+ 2 σ
and
2 2
Var (Y ) = e2µ+2σ − e2µ+σ .

An important concept is that of the indicator function. This is a special


function from the σ −algebra to {0, 1} ⊂ R, and hence a random variable. It is
defined as follows.
Definition 1.64. Let A ∈ F be any event in the σ −algebra. The indicator function
of A is defined as 
1 if A is true


I(A) =  .
0 everywhere else

In other words, we have



1

 if ω ∈ A
I(A) : Ω → {0, 1} ⊂ R : ω 7→ I(A)(ω) =  .
0
 everywhere else
46 CHAPTER 1. PROBABILITY THEORY

The reason why this is interesting follows from the following observation.
Property 1.65. Let A be any event in the σ −algebra F and let I(A) be the associ-
ated indicator function which is a random variable. Then the probability function
associated with this random variable is

P (A) if i = 1


P (I(A) = i) = 
1 − P (A) if i = 0.

Thus in particular,
E [I(A)] = P (A) .

Proof. Since I(A) is either 0 or 1, it suffices to check the pre-image of those two
values. Notice that
I(A)−1 (1) = {ω ∈ Ω | I(A)(ω) = 1} = A,
and
I(A)−1 (0) = {ω ∈ Ω | I(A)(ω) = 0} = Ω \ A.
Therefore
P (I(A) = 1) = P ({ω ∈ Ω | I(A)(ω) = 1}) = P (A)
and
P (I(A) = 0) = P ({ω ∈ Ω | I(A)(ω) = 0}) = P (Ω \ A) .
Notice that indeed
E [I(A)] = 1 · P (A) + 0 · (1 − P (A)) = P (A) .

For the expected value of non-negative random variables, we have the fol-
lowing very handy result.
Property 1.66. Let X be a non-negative continuous random variable. Then
Z∞ Z∞
E [X] = xfX (x)dx = F (x) dx,
0 0

where F (x) = 1−F(x) is the tail function. For non-negative integer-valued variables,
this becomes
X∞
E [X] = F (k)
k=0
1.3. EXPECTATIONS, MOMENTS AND MOMENT GENERATING FUNCTIONS47

Proof. We first give the proof for the discrete integer-valued non-negative ran-
dom variables. We have that

X
E [X] = kpk .
k=0

Notice now that



X ∞
X ∞ X
X ∞
F (k) = P (X > k) = pj
k=0 k=0 k=0 j=k+1

X j
∞ X ∞
X
= pj = jpj = E [X] .
j=0 k=1 j=0

For continuous random variables, we proceed as follows. Let g be an integrable


function defined on a random variable X with g(0) = 0. Then, using integration
by parts
Z∞ Z∞
E [g(X)] = g(u)f (u)du = g(u)dF(u)
Z∞0 0

= −g(u)d(1 − dF(u))
0 Z∞

= [−g(u)(1 − F(u))]0 − (1 − F(u))d(−g(u))
| {z } 0
=0
Z ∞ Z ∞
=− (1 − F(u))d(−g(u)) = F (u) g ′ (u)du.
0 0

Consider now in particular g(X) = X. Then, this reduces to


Z∞
E [X] = F (u) du,
0

showing the desired result.

To demystify this equality, we give the following visualization for the discrete
case.
48 CHAPTER 1. PROBABILITY THEORY

Suppose X takes on the values 0 < x1 < ... < x5 . Then, notice that for any
value x < x1 , we have P (X > x) = 1. For x1 < x ≤ x2 , we have P (X > x) =
1 − P (x1 ) . For values x2 < x ≤ x3 , we have P (X > x) = 1 − P (x1 ) − P (x2 ) and
so forth.

We can subdivide the area under the curve of F in rectangles as in figure


1.13. Each rectangle has as width xi and as height F(xi ) − F(xi−1 ) = P (xi ).

Thus, the area under the curve can be given by the sum of the rectangles
Z∞ X5
F(s)ds = xi P (xi ) = E [X] .
0 i=1

Figure 1.13: Visualization of the tail function formula

Closely related to the moments of a random variable is the so-called moment-


generating function (MGF).
Definition 1.67. The moment-generating function (MGF) of a random variable X
is defined as h i
φX (t) = E etX , ∀t ∈ R
1.3. EXPECTATIONS, MOMENTS AND MOMENT GENERATING FUNCTIONS49

for all t for which the expectation is defined.

Moments up to any order can be found using differentiation, as shown by


the next result.
Property 1.68. Let X be a random variable. Suppose that φX (t) exists for some t.
Then
dk
µk = k φX (t)|t=0 .
dt

Proof. Recall that the exponential has a Taylor expansion


∞ k k
X t X
etX = .
k!
k=0

Hence, the j−th derivative is given by



dj X t k−j X k
φ X (t) = .
dt j (k − j)!
k=j

The result then easily follows.

Another useful property of the moment-generating function is that it be-


haves nicely under the sum of random variables.
Property 1.69. If X and Y are independent, then
φX+Y (t) = φX (t)φY (t).

Proof. Since X, Y are independent, so are etX and etY . By proposition 1.62, we
find
h i h i h i h i
φX+Y (t) = E et(X+Y ) = E etX etY = E etX E etY = φX (t)φY (t),

proving the desired result.

For non-negative integer-valued random variables, the generating function


is sometimes preferred over the moment-generating function. The generating
function is defined as follows.
50 CHAPTER 1. PROBABILITY THEORY

Definition 1.70. The generating function (GF) of a non-negative integer-valued X


is given by
h i X ∞
X
γX (s) = E s = sk P (X = k) .
k=0

Notice that the generating function fully describes the probability distribu-
tion of the random variable, since for any k ∈ N,

1 dk
P (X = k) = γ (s)| .
k! dsk X s=0

Hence, if two random variables have the same generating functions, they
are equal.

Example 1.20. The moment-generating function for a normally distributed


random variable X ∼ N (µ, σ 2 ) is given by
1 2 t2
ΦX (t) = etµ+ 2 σ .

Indeed, notice that


Z ∞ (x−µ)2
1 −
ΦX (t) = √ etx e 2σ 2 dx
2πσ −∞
Z∞ −x2 +2xµ−µ2 +2σ 2 tx
1
=√ e 2σ 2 dx.
2πσ −∞

Completing the square gives us


Z ∞ (x−(µ+σ 2 t))2 σ 4 t 2 +2µσ 2 t
1 −
ΦX (t) = √ e 2σ 2 e 2σ 2 dx,
2πσ −∞

from which it evidently follows that


1 2 t2
ΦX (t) = etµ+ 2 σ .
1.4. CONDITIONING 51

Example 1.21. Suppose X is a Poisson random variable X ∼ P oisson(λ).


Then the generating function

X
γX (s) = sk P (X = k) = eλ(s−1) .
k=0
Indeed, notice that

X λk e−λ
γX (s) = sk
k!
k=0

X (sλ)k e−λ
=
k!
k=0

X (sλ)k
=e −λ
= e−λ esλ = eλ(s−1) .
k!
k=0

1.4 Conditioning

Just like events, we can talk about conditionals of random variables.


Definition 1.71. If X and Y are jointly discrete, then the conditional mass function
of X given y is
P (X = x, Y = y) fX,Y (x, y)
fX|Y (x | y) = P (X = x | Y = y) = = ,
P (Y = y) fY (y)
for all y such that fY (y) , 0.

The conditional expectation of X given Y = y is given by


X
E [X | Y = y] = xfX|Y (x | y).
x

For jointly continuous random variables X,Y the conditional density of X given
y is defined as
fX,Y (x, y)
fX|Y (x | y) = . (∀y : fY (y) > 0)
fY (y)
52 CHAPTER 1. PROBABILITY THEORY

The conditional expectation of X given that Y = y is


Z
E [X | Y = y] = xfX|Y (x | y)dx.
R

We now mention some handy results regarding conditional expectations.


Property 1.72. Given two jointly discrete random variables X, Y , we have
X
fx (x) = fX|Y (x | y)fY (y).
y

For any two jointly continuous random variables X, Y , we have


Z
fx (x) = fx|y (x | y)fY (y)dy.
R
This result is called the Partition Lemma.

The following result is often referred to as the Tower Rule.


Property 1.73. Let X and Y be random variables. Then
E [X] = E [E [X | Y ]] .

Proof. This proof is straightforward and is left as an exercise for the reader.

Using the tower rule, we can prove the following result.


Property 1.74. Let X, Y be two random variables. Then
Var (X) = E [Var (X | Y )] + Var (E [X | Y ]) .

Proof. Using the Tower Rule, we can write


h i
Var (X) = E X 2 − E [X]2
h h ii
= E E X 2 | Y − E [E [X | Y ]]2
h i
= E Var (X | Y ) + E [X | Y ]2 − E [E [X | Y ]]2
= E [Var (X | Y )] + Var (E [X | Y ]) .
1.4. CONDITIONING 53

Another result we will need later on is the so-called chain rule for condi-
tional probabilities.

Property 1.75. Let A0 , A1 , ..., Ak be events in the σ −algebra F . Then


 k   k−1
  k−2

\   \   \ 
P  Ai  = P Ak | Ai  P Ak−1 | Ai  ...P (A1 | A0 ) P (A0 ) .
i=0 i=0 i=0

Proof. We will use induction to prove this. First, assume k = 1, i.e we have two
events A0 , A1 . Then it follows by definition that

P (A1 ∩ A0 ) = P (A1 | A0 ) P (A0 ) .

Suppose now that the property holds for all values 1, ..., k − 1. Then
 k    k−1 
\   \ 
P  Ai  = P Ak ∩  Ai  .
i=0 i=1
Tk−1
Since i=0 Ai is an event in F , we can use the base case to write
 k   k−1
  k−1 
\   \  \ 
P  Ai  = P Ak | Ai  P  Ai  .
i=0 i=1 i=1

By the inductive assumption, we can write


 k−1   k−2

\   \ 
P  Ai  = P Ak−1 | Ai  ...P (A1 | A0 ) P (A0 ) .
i=0 i=0

Combining the above, we find


 k   k−1
  k−2

\   \   \ 
P  Ai  = P Ak | Ai  P Ak−1 | Ai  ...P (A1 | A0 ) P (A0 ) ,
i=0 i=0 i=0

which is the desired result.


54 CHAPTER 1. PROBABILITY THEORY
Chapter 2

Poisson Processes

When nothing is sure, everything is possible.

Margaret Drabble

In this chapter, we introduce the notion of stochastic processes. In order to


get a feeling for this new concept, we look at some of the simplest examples of
stochastic processes. Afterwards, we look at Poisson processes. We will mainly
be interested in demystifying the underlying dynamics. The chapter serves as a
stepping stone for almost all later chapters, so understanding the main ideas is
crucial!

Figure 2.1: Baron Siméon Denis Poisson

55
56 CHAPTER 2. POISSON PROCESSES

2.1 Stochastic Processes

2.1.1 Introduction

In this chapter, we introduce the notion of stochastic processes. If we have a


system with a feature (such as temperature) that changes deterministically over
time, we often encode this information using a time-dependent function:
T : R → R : t 7−→ T (t) = Temperature at time t
When the feature of the system is no longer deterministic but instead stochastic,
we obtain stochastic processes. Thus, at each time t we obtain a random
variable Tt and encode the information with a time-dependent function
T : T → X : t 7−→ Xt .
Here, X represents the space of all random variables defined on the probability
space and T is some index set. We will interchangeably denote these processes
as {Xt | t ∈ T } , Xt or by X(t).

Figure 2.2: Day-to-day temperatures for Alpena, Michigan in December [C1]


2.1. STOCHASTIC PROCESSES 57

These processes are not just some mathematical toy but arise naturally in
all kinds of domains.

Example 2.1. Some examples of stochastic processes are

• The price of a stock St

• The amount of infections I(t) of a disease throughout time

• The amount of bankruptcies Dt in a given portfolio

• The amount of deaths Mt due to bush fires for each year

Just like we have continuous and discrete random variables, we can distin-
guish different types of stochastic models.
Definition 2.1. Let {Xt | t ∈ T } be a stochastic process, then

• If T is countable, then the stochastic process is said to be discrete. We denote


this by {Xn | n ≥ 0}.

• If T = R+ , then the stochastic process is said to be continuous.

• If Ω is discrete (eg. Ω = N), then the stochastic process is said to be a chain.

Perhaps the simplest of stochastic processes are the IID (Independent Identically
Distributed) processes. They are defined as follows.
Definition 2.2. Let {Xn | n ∈ N} be a discrete process. Then this stochastic process
is called an IID process if the following hold:

• {Xn | n ∈ N} are mutually independent, i.e Xj and Xk are independent stochas-


tic variables when j , k

• The Xn all have the same distribution function for all n.

• X is not identically zero.


58 CHAPTER 2. POISSON PROCESSES

Hence, we can consider the process as a sequence of independent repeats of the same
experiment.

Using these processes, one can generate a new type of process called random
walks.
Definition 2.3. Let {Xn | n ∈ N} be an IID process. Define the discrete process
{Sn | n ∈ N} via
Sn+1 = Sn + Xn+1 , S0 = 0,
the cumulative sum with as increments the elements Xn . We call such a discrete
process a random walk generated by X.

Figure 2.3: Graphical representation of a random walk

Contrary to IID processes, the elements of random walks are no longer


independent. However, the dependence is still relatively weak:

P (Sn+1 ≤ s | S0 = 0, S1 = s1 , ..., Sn = sn ) = P (Sn+1 ≤ s | Sn = sn ) .

Since Sn+1 = Sn + Xn+1 , this becomes

P (Sn+1 ≤ s | S0 = 0, S1 = s1 , ..., Sn = sn ) = P (X ≤ s − sn ) .

Notice that the value of Sn+1 only depends on the value of the previous obser-
vation Sn , and not how it got there. We call this form of dependency Markov
dependency.
2.1. STOCHASTIC PROCESSES 59

A special family of stochastic processes is the counting process. As the


name suggests, these processes can be used to model the amount (or count) of
events that happen in a given time interval.

Definition 2.4. A process Nt is said to be a counting process if it is non-negative,


integer-valued, and non-decreasing in t.

We will always assume that N0 = 0, meaning that we start counting from


0 onward. Notice that using this characterization, the number of events in the
time interval (s, t] is given by Nt − Ns , which is non-negative since the process
is non-decreasing.

An easy example of a counting process is the so-called binomial process.


These can be characterized as the random walk generated by an IID process
with as underlying distribution the Bernoulli distribution.

Definition 2.5. Let {Bn | n ∈ N} be an IID process with B ∼ Bernoulli(p). Define


the random walk {Xt | t ∈ N} generated by B, i.e

t
X
Xt = Bi = Number of successes in the t trials.
i=1

Then we call Xt a Binomial process.

As an easy exercise, check that the Binomial process is indeed a counting


process.

Example 2.2. A basketball coach knows from experience that each team
member has the same probability of scoring a free throw, namely p. Each
minute, a player attempts a free throw. Each such attempt can be seen as a
Bernoulli experiment with success probability p.

As a function of time, the amount of successful attempts is then modeled by


a Binomial process Xt .
60 CHAPTER 2. POISSON PROCESSES

Figure 2.4: Graphical representation of the counting process in example 2.2

The following property shows why this process is called a binomial process. It
also shows that the time between two jumps in the random walk is geometrically
distributed.

Property 2.6. Let Xt be a binomial process generated by the IID process Bt . Then

• Xt ∼ B(n, p), motivating the name binomial process.

• The time between jumps in X is a geometric distribution with probability p.

Proof. The first property follows immediately from the fact that a binomial
random variable is a sum of IID Bernoulli random variables. Let us now show
the second point. For this, let k ≥ 0 and denote

Xh = min {Xn = k} ,
n

i.e the time of the k−th arrival. We will show that the time until the (k + 1)−th
arrival is geometrically distributed with probability p. For this, we will once
2.1. STOCHASTIC PROCESSES 61

again write Xh = hi=1 Bi with Bi the IID process generating X. In particular,


P
for any j ≥ 0, we have
h+j
X h
X h+j
X
Xh+j − Xh = Bi − Bi = Bi .
i=1 i=1 i=h+1

The time between the k−th arrival and the (k + 1)−th arrival is then given by
 

 h+j
X 

n o  
Tk+1 = min j|Xh+j − Xh = 1 = min  j| B = 1 .
 
i 
j j 
 i=h+1


One can easily show that this is indeed geometrically distributed.

Notice that in the case of binomial processes, it is assumed that the time
between experiments is fixed: each time unit should consist of one and only
one experiment. As a consequence, the value of the binomial process can only
change at discrete times t = 1, 2, .... This is shown by the fact that the time
between jumps has a discrete distribution.

In a lot of contexts where we use counting processes, we cannot ensure


that the experiments happen with fixed times between them. For example, we
would like to model the number of customers arriving at a store as a counting
process. However, customers can enter the store at any given time t, not just at
discrete times. In the next section, we will discuss a process that can deal with
continuous time intervals between events, namely the Poisson process.
62 CHAPTER 2. POISSON PROCESSES

2.2 Poisson processes

Many processes in everyday life that count the events up to a particular point
in time can be adequately modeled using the Poisson distribution we have seen
in chapter one. These special counting processes, which we will call Poisson
processes, possess many desirable properties as we will soon discover.

In the next sections, we will give three different, but ultimately equivalent
definitions of Poisson processes. Each definition gives a characterization of a
different flavor. Knowing and understanding these definitions can not only lead
to deeper understanding but can also greatly simplify calculations.

2.2.1 First Definition

For the first definition, we take a look at the case of random variables. There,
we have seen in proposition 1.29 that a Poisson random variable can be seen as
the limit of Bernoulli random variables. We will start by considering the analog
in the case of processes.

Consider a binomial process with time intervals ∆ and success probability


λ. The underlying IID process is a Bernoulli process. Just like for random
variables, we will consider a new process X (k) generated by Bernoulli processes
Bk with probability λk and change the time interval to ∆k . Using these scalings,
the expected number of arrivals per unit time ∆ is then E [Ek ] = λ since Ek ∼
Bin(k, λk ).

We see that for any k,



 

1 − λk i = 0
(k) λ

P X∆ = i =  i=1 .
k k

0 i > 1

In the limit, one can show that this process converges to a process with the
following properties, which we will call a Poisson process.
Definition 2.7. The counting process {Nt | t ≥ 0} is a Poisson process with rate
λ > 0 if it satisfies the following requirements
2.2. POISSON PROCESSES 63

1. It has stationary increments, i.e

P (Nt+∆t − Nt = k) = P (Ns+∆t − Ns = k) ∀k ∈ N, ∀∆t > 0

2. It has independent increments:

Nt1 − Nt0 , Nt2 − Nt1 , ..., Ntk − Ntk−1

are independent for all 0 ≤ t0 ≤ t1 ≤ ... ≤ tk .

3. For small h, we have





1 − λh + o(h) i = 0

P (Nh = i) = λh + o(h) i = 1



o(h) i > 1

The following result shows why these processes are called Poisson processes.
Property 2.8. For a Poisson process Nt with rate λ, the number of events observed
in any interval with length t satisfies

(λt)k
P (Nt = k) = P (Ns+t − Ns = k) = e−λt , k = 0, 1, ....
k!
In other words, Ns+t − Ns is Poisson(λt) distributed.

Proof. We will show that Nt ∼ P oisson(λt). For this, we will write

fk (t) = P (Nt = k) , t ≥ 0.

The probability that no event happens in some time interval t + h is then given
by f0 (t + h). Using all three properties of the Poisson processes, we get

f0 (t + h) = P (Nt+h = 0) = P (Nt = 0, Nt+h − Nt = 0)


= P (Nt = 0) P (Nt+h − Nt = 0) = P (Nt = 0) P (Nh = 0)
= f0 (t)(1 − λh + o(h)).

Hence,
f0 (t + h) − f0 (t) o(h)
= −λf0 (t) + f (t).
h h 0
64 CHAPTER 2. POISSON PROCESSES

Taking the limit for h → 0, we obtain

df0 (t)
= −λf0 (t).
dt
For k ≥ 1, we will use the law of total probability (property 1.11). We obtain

fk (t + h) = P (Nt+h = k | Nt = k) P (Nt = k)
+ P (Nt+h = k | Nt = k − 1) P (Nt = k − 1)
k
X
+ P (Nt+h = k | Nt = k − j) P (Nt = k − j) .
j=2

Rewriting this, one obtains


k
X
fk (t + h) = f0 (h)fk (t) + f1 (h)fk−1 (t) + fj (h)fk−j (t).
j=2

However, by the third property of Poisson processes, for small h, this becomes

fk (t + h) = (1 − λ + o(h))fk (t) + (λh + o(h))fk−1 (t) + o(h).

Just like before, we write

fk (t + h) − fk (t) o(h)
= λ(fk−1 (t) − fk (t)) + .
h h
Taking the limit, this becomes

dfk (t)
= λ(fk−1 (t) − fk (t)) k ≥ 1.
dt
We thus need to solve the differential equations for all fk .

For this, we use a clever trick. We first consider the generating function of
Nt .
h i X ∞
γNt (s) = E sNt = sk fk (t).
k=0
Since the generating function uniquely determines the distribution, it suffices to
show that the generating function coincides with the generating function for a
2.2. POISSON PROCESSES 65

P oisson(λt) distributed random variable. We have already shown in example


1.21 that the latter is given by

γX (s) = e−λt(1−s) .

Notice now that


∂γNt (s, t) X df (t)
= sk k
∂t dt
k=0

X
= −λf0 (t) + sk λ(fk−1 (t) − fk (t))
k=1
X∞
= λf0 (t) + λs sk−1 (fk−1 (t) − fk (t))
k=1

X ∞
X
k
= −λ s fk (t) + λs sk fk (t)
k=0 k=0
= −λ(1 − s)γNt (s)

It is easy to see that the solution of this differential equation is

γNt (s) = e−λt(1−s) ,

which is exactly the generating function of a Poisson(λt) distributed variable.

In fact, this result can be used to define Poisson processes in an alternative


way. This will be shown in the next section.

2.2.2 Second Definition

Using the previous result, we give a different definition of Poisson processes.


Then we will show that the new definition is equivalent to the previous defini-
tion.
66 CHAPTER 2. POISSON PROCESSES

Definition 2.9. Let {Nt | t ≥ 0} be a counting process. Then Nt is a Poisson process


with rate λ > 0 if the following conditions are satisfied.

1. It has independent increments


2. The number of events in any interval of length t is P oisson(λt) distributed.

Notice that the stationarity of the increments is a direct consequence of the


second property in the definition.

We will now show that the second definition implies the first. The opposite
direction has already been shown in property 2.8 in the previous section.

For this, consider the Taylor expansion of P (Nt = k). This can easily be
shown to be equal to
(λt)k (λt)k+1
P (Nt = k) = − + ...
k! k!
Then for h small, we get



 1 − λh + o(h) k=0

P (Nh = k) =  λh + o(h) k=1



o(h) k>1

Which is the same as the first definition.

The upshot of this definition is of course that it is much more intuitive.

In the third and final characterization, we will focus on the time between the
events instead of the counting process directly. These times are called jump (or
arrival) times.

2.2.3 Third Definition

Suppose {Nt | t ∈ T } is any counting process. Instead of focusing on the incre-


ments, we can also focus on the time between two increments. We introduce
the following concept.
2.2. POISSON PROCESSES 67

Definition 2.10. Let Nt be a counting process. We then define the k-th jump (or
arrival) time
tk = min {t | Nt ≥ k} .
t≥0

Thus, the k-th arrival time denotes the time at which we observe the k-th event.

It is easy to see that the arrival times tk are random variables.

Closely related to arrival times are the waiting times.

Definition 2.11. Let Nt be a counting process. Then define the k-th waiting time

Ti = ti − ti−1 .

Just like the arrival times, the waiting times are random variables. They are
easily seen to be non-negative.

Some counting processes allow for multiple events to happen at the same
time with non-zero probability. We call these clustered processes.

Definition 2.12. Let Nt be a counting process. We say that a process is clustered if

P (Ti = 0) > 0 for some i > 0.

Equivalently, a process is clustered if

P (ti = ti−1 ) > 0 for some i > 0.

Some counting processes also allow an infinite amount of events to occur in


a finite time interval with non-zero probability. If this is the case, we say that
the process is explosive.

Definition 2.13. Suppose Nt is a counting process. We say that a counting process


is stable if ∀C > 0, we have

P (NC < ∞) = 1.

If a counting process is not stable, we call it explosive.


68 CHAPTER 2. POISSON PROCESSES

Notice that if we know the arrival times, we know all the information about
the counting process, and vice versa. Hence, we know that there must be a
characterization using only the arrival times. For this, we have the following
result.

Property 2.14. Let Nt be a Poisson process. Then the waiting times Ti are i.i.d
exponential(λ) distributed.

Proof. Setting s0 = 0, we find that


 
P (Ti > si − si−1 ) = P Nsi − Nsi−1 = 0 = e−λ(si −si−1 ) .

Hence,

P (Ti ≤ si − si−1 ) = 1 − P (Ti > si − si−1 ) = 1 − e−λ(si −si−1 ) .

From this, we conclude that the waiting times are exponential(λ) distributed.

We now show how we can define a Poisson process using the waiting times
Ti .

Definition 2.15. Let T1 , T2 , ... be i.i.d exponential(λ) random variables. Define


their partial sums
Xn
tn = Tk , t0 = 0.
k=1

Then
Nt = max {n | tn ≤ t}
is a Poisson process with rate λ.

Recall that the sum of n independent exponential(λ) random variables is


Gamma(n, λ) distributed. Hence

λn t n−1 −λt
ftn (t) = e .
(n − 1)!
2.2. POISSON PROCESSES 69

Figure 2.5: An example of a Poisson process and its associated times

We first show that this indeed defines a Poisson process that corresponds to
the other definitions.

Proposition 2.16. Definition 3 and definition 2 are equivalent.

Proof. We have already shown that definition 2 implies definition 3. To show


the other direction, notice that

P (Nt = n) = P (tn ≤ t) − P (tn+1 ≤ t) = Ftn (t) − Ftn+1 (t).

Notice that
t
λn xn−1 −λx
Z
Ftn (t) = e dx.
0 (n − 1)!
70 CHAPTER 2. POISSON PROCESSES

We will give an alternative formula using repeated integration by parts.

t
λn
Z
Ftn (t) = xn−1 e−λx dx
(n − 1)!
 0"
 xn−1 e−λx t n − 1 t n−2 −λx 

λn
# Z
= − + x e dx
(n − 1)!  λ 0 λ 0
n−1 n−1 Zt
(λt) λ
=− e−λt + xn−2 e−λx
(n − 1)! (n − 2)! 0
= ···
n−1
X (λt)i −λt
=− e + 1.
i!
i=0

Hence,

P (Nt = n) = Ftn (t) − Ftn+1 (t)


(λt)n −λt
= e .
n!

The independence of increments follows from the independence in definition


3.

Notice that from this result, we can show that Poisson processes behave quite
nicely.

Property 2.17. A Poisson process Nt is stable and is not clustered.

Proof. We first show that it is stable. For this, notice that for any C,

e−λC (λC)k
P (NC = k) = → 0 as k → ∞
k!

To show that Poisson processes are not clustered, recall that T is exponentially
distributed, which is a continuous distribution. Hence, P (T = 0) = 0.
2.2. POISSON PROCESSES 71

2.2.4 Conditionals of Poisson Processes

In this section, we consider the conditional distributions of Poisson processes.


To motivate this, consider the following example.

Example 2.3. Suppose travelers arrive at a bus stop according to a Poisson


process Nt with rate λ and that this bus leaves at time t. What is the total
cumulative waiting time of all the passengers arrived?

To answer this question, we would need to calculate

N 
Xt 
E  (t − ti ) .
i=1

One would be tempted to use the linearity on the sum in the expectation, but
notice that this cannot be done since Nt is a random variable. We will see that
one way to circumvent this issue is by using the Tower Rule, for which we will
need conditionals.

Suppose that the arrival of events is modeled by a Poisson process. The


following result tells us that if we have an event for which we only know that it
has taken place somewhere before t, all times in that interval are equally likely
to be the true event time.

Theorem 2.18. Suppose that Nt = 1 for some given t. Then conditionally on that
information, the time of arrival t1 is uniformly distributed over (0, t].

Proof. Using the definition of conditional probability we have for any s ∈ (0, t]

P (t1 < s, Nt = 1)
P (t1 < s | Nt = 1) = .
P (Nt = 1)
72 CHAPTER 2. POISSON PROCESSES

This can be rewritten as


P (t1 < s, Nt = 1)
P (t1 < s | Nt = 1) =
P (Nt = 1)
P (Ns = 1) P (Nt − Ns = 0)
=
P (Nt = 1)
−λs
λse e −λ(t−s) s
= −λt
= .
λte t

Let us now revisit the example, and notice how this easy result makes a
complicated exercise almost trivial.

Example 2.4. With the same setting as before, notice that we can use the
linearity of the expected value if we condition on the count. Using the Tower
Rule,
N   N 
Xt   X t 
E  (t − ti ) = E E  (t − ti ) | Nt = n
i=1 i=1
t λt 2
 
= E Nt t − Nt = .
2 2
hP i
Here, we use the fact that E ni=1 ti | Nt = n = E ni U[0,t] , where U[0,t] is a
P 
uniformly distributed random variable.

2.2.5 Thinning and Superposition of Poisson Processes

In this section, we study how Poisson processes can be decomposed into other
Poisson processes. We start by considering a motivating example.

Example 2.5. An insurance company has modeled the car crashes of their
clients as a Poisson process. They found that the rate at which these crashes
happen is given by λ and that this rate is constant through time.
2.2. POISSON PROCESSES 73

However, they would like to make a distinction between heavy car crashes
and lighter car crashes. Suppose that they see that the probability of a car crash
being a heavy car crash is around p1 , and thus the probability of the crash being
a small car crash is 1 − p1 = p2 . It would be logical for the insurance company
to split the original Poisson process into two different Poisson processes. This
is called thinning.

Property 2.19. Let N (t) be a Poisson process with rate λ. Suppose that the events
are either of type 1 or type 2. Suppose furthermore that the probability of an event
being type 1 is given by p1 . Then the process N1 (t) counting the type 1 events is also
a Poisson process with rate p1 λ and a similar result holds for process N2 (t)

Proof. Since events are either type 1 or 2, it is clear that N (t) = N1 (t) + N2 (t).
We start by calculating the joint probability distribution of N1 (t) and N2 (t).
P (N1 (t) = m, N2 (t) = n) = P (N1 (t) = m, N2 (t) = n | N (t) = m + n) P (N (t) = m + n)
!
m+n m n
= p1 p2 P (N (t) = m + n)
m
m + n m n −λt (λt)m+n
!
= p1 p2 e
m (m + n)!
(m + n)! 1
= (p1 tλ)m (p2 tλ)n e−(p1 +p2 )λt
m!n! (m + n)!
m
(p tλ) −p2 tλ (p2 tλ) n
= e−λp1 t 1 e
m! n!
Thus, the marginal distribution of N1 (t) is given by
X ∞
P (N1 (t) = m) = P (N1 (t) = m, N2 (t) = n)
n=0

X (p1 tλ)m −p2 tλ (p2 tλ)n
= e−λp1 t e
m! n!
n=0

mX n
−λp1 t (p1 tλ) −p2 tλ (p2 tλ)
=e e
m! n!
n=0
(p1 tλ)m
= e−λp1 t .
m!
74 CHAPTER 2. POISSON PROCESSES

We find that N1 (t) and N2 (t) are independent and Poisson processes with rates
p1 λ and p2 λ respectively.

In queuing theory, we often model the incoming customers using a Poisson


process with rate λ. Here, we also would like to make a distinction between
customers that are already serviced N1 (t), and those that are yet to be serviced
N2 (t). Just like before, we would like to split the process into two smaller
processes. In this case, the Poisson process will no longer be homogeneous (we
will see more of this later). In what follows, we will assume that there are an
infinite amount of servers so that customers do not have to wait for the service
to be started.

More generally, suppose events can change from type 2 to type 1. Once it is
type 1, it remains to be of type 1. Denote by H the random variable that repre-
sents the time between the event’s arrival and the event occurring. In queuing,
this is called the service time. Denote by FH the cumulative distribution func-
tion. Then FH (x) represents the proportion of customers that are serviced after
an elapsed time x, or more generally the probability of conversion happening
before time x.

Denote now by p1 (t) the probability of an event to be type 1 at time t given


that it occurred before time t. Using theorem 2.18, this is then given by

p1 (t) = P (Type 1 event at t | Arrival before t)


Zt
= P (Type 1 event at t | Arrival at s) P (Arrival at s | Arrival before t) ds
0
Zt
1
= P (Type 1 event at t | Arrival at s) ds
t
Z0t Zt
1 1
= FH (t − s) ds = FH (s) ds.
0 t 0 t

Similar to before, we find two processes N1 (t), N2 (t). Suppose that the
customers arrive according to a Poisson process with constant rate λ, then the
distributions of the processes N1 (t) and N2 (t) are given by

(λtp1 (t))m −λtp1 (t)


P (N1 (t) = m) = e
m!
2.2. POISSON PROCESSES 75

and
(λt(1 − p1 (t))m −λt(1−p1 (t))
P (N2 (t) = m) = e .
m!
Notice that the processes no longer have constant arrival rates. The expected
number of customers served at time t is given by
Zt
E [N1 (t)] = λtp1 (t) = λ FH (s)ds,
0

and similarly Z t
E [N2 (t)] = λ (1 − FH (s))ds.
0
Letting t → ∞, the right-hand side becomes the integral of the tail function
which we know is equal to the average. Hence, we find that in the long run, the
expected number of people waiting to be serviced is equal to

E [H] Average service time


lim E [N2 (t)] = λE [H] = = .
t→+∞ E [T ] Average inter-arrival time

Indeed, recall that T ∼ exponential(λ) and the latter has mean λ1 .

Just like we can split a Poisson process into several Poisson processes, we
can also combine Poisson processes. This is called a superposition of Poisson
processes. We need the following theorem.
Theorem 2.20. Let s < t and 0 ≤ m ≤ n, then the conditional distribution of Ns
given Nt = n is the binomial distribution B(n, st ).

Proof. This proof is left as an exercise for the reader.


Definition 2.21. Let {N1 (t) | t ≥ 0} , {N2 (t) | t ≥ 0} , ..., {Nk (t) | t ≥ 0} be indepen-
dent Poisson processes with corresponding rates λ1 , ..., λk . Then the superposition

N (t) = N1 (t) + N2 (t) + ... + Nk (t)

is a Poisson process with rate λ = λ1 + ... + λk .

As an exercise, show that this process is indeed a homogeneous Poisson


process for k = 2. For this, use theorem 2.20.
76 CHAPTER 2. POISSON PROCESSES
Chapter 3

Extensions

Time changes everything, except something within


us which is always surprised by change

Thomas Hardys

In this chapter, we consider a generalization of Poisson processes which al-


lows for a non-constant rate λ(t). We will describe their underlying probability
distributions as well, and give a way to go from a non-homogeneous Poisson
process to a homogeneous Poisson process and vice versa.

We will also consider compound Poisson processes and several mixture


models. In these models, our rate λ is not even deterministic.

3.1 Non-homogeneous Poisson Processes

In the previous chapter, we have briefly covered a Poisson process with a non-
constant arrival rate. These occur quite naturally in many situations.

Example 3.1. Some examples where a homogeneous Poisson process might


not be suitable is

77
78 CHAPTER 3. EXTENSIONS

• Counting the number of customers entering a supermarket. The rate is


much higher after working hours.

• The number of tropical storms and hurricanes. These tend to be more


frequent at higher temperatures and hence occur more often around sum-
mertime.

Figure 3.1: Most customers go to restaurants during breakfast, dinner, and lunch
times

We define these special stochastic processes.

Definition 3.1. Let {Nt | t ≥ 0} be a counting process, starting from N0 = 0. This


is a non-homogeneous Poisson process with (deterministic) rate function λ(t) if for
all t ≥ 0, h > 0, it satisfies the following requirements.
3.1. NON-HOMOGENEOUS POISSON PROCESSES 79

1. It has independent increments

2. One has 


1 − λ(t)h k=0

P (Nt+h − Nt = k) = λ(t)h + o(h) k=1.



o(h) k≥2

Notice that the stationary increment property no longer holds, since the
distribution depends on λ(t).

In order to capture the dynamics of these processes, we would like to find


the distribution of P (Nt − Ns ). This is the content of the next theorem, which
shows that these increments follow a Poisson distribution with an integrated
rate function.
Theorem 3.2. Let Nt be a non-homogeneous Poisson process with rate function
λ(t). The number of events in any given interval (s, t] is Poisson distributed with
mean Z t
m(s, t) = λ(x)dx.
s
Thus, it follows that for any 0 < s < t and k ∈ N

m(s, t)k
P (Nt − Ns = k) = e−m(s,t) .
k!

Proof. We adopt a similar strategy as for the homogeneous case. We again


denote
fk (t) = P (Nt − Ns = k) .
Then it is easy to see that

f0 (t + h) = P (Nt+h − Ns = 0)
= P (Nt − Ns = 0) P (Nt+h − Nt = 0) (independent increments)
= f0 (t)(1 − λ(h) + o(h)).

From this, it follows that


f0 (t + h) − f0 (t) o(h)
= −λ(t)f0 (t) + f (t)
h h 0
80 CHAPTER 3. EXTENSIONS

which in the limit h → 0 gives


df0 (t)
= −λ(t)f0 (t).
dt
In the next step, we will obtain the differential equations for fk with k ≥ 1. For
this, we once again use the law of total probability (property 1.11) to obtain
k
X
fk (t + h) = P (Nt+h − Nt = i | Nt − Ns = k − i) P (Nt − Ns = k − i)
i=0
= P (Nt+h − Nt = 0 | Nt − Ns = k) P (Nt − Ns = k)
+ P (Nt+h − Nt = 1 | Nt − Ns = k − 1) P (Nt − Ns = k − 1)
= (1 − λ(t)h)fk (t) + λ(t)hfk−1 (t) + o(h),
where we used that the terms with P (Nt+h − Nt > 1) are o(h). As before, one
can easily find that this leads to the following set of differential equations
dfk (t)
= λ(t)(fk−1 (t) − fk (t)) k ≥ 1.
dt
We will solve this system using the generating function of the increments Nt −Ns ,
which is given by

X ∞
X
k
γNt −Ns (v) = v P (Nt − Ns = k) = v k fk (t).
k=0 k=0

Using a similar approach as in the homogeneous case, we now consider the


partial differential of the generating function with respect to t. Recall that the
generating function of a Poisson distributed variable Y with rate λ is given by

γY (t) = e−λ(1−s) .
In our case, we find

∂γNt −Ns (v) X df (t)
= vk k
∂v dt
k=0

X ∞
X
= λ(t)v v k−1 fk−1 − λ(t) v k fk (t)
k=1 k=0
= −λ(t)(1 − v)γNt −Ns (v).
3.1. NON-HOMOGENEOUS POISSON PROCESSES 81

with initial condition



X
γNt −Ns (1) = P (Nt − Ns = k) = 1.
k=0

One can show that the solution of this differential equation is given by
Rt
λ(u)du
γNt −Ns (v) = e−(1−v) s ,

which shows the desired result.

Notation 3.3. In the financial literature, one often denotes


Z t
m(0, t) = λ(t) = Λ(t).
0

We call this the cumulative arrival rate. Notice that in particular,

E [Nt ] = Λ(t).

We now show the bridge between non-homogeneous Poisson processes and


their homogeneous counterpart. We first introduce the following concept.

Definition 3.4. Let Nt be a homogeneous Poisson process. Let H be any continuous


function. We define the H−accelerated process NtH by
n o n o
NtH | t ≥ 0 = NH(t) | t ≥ 0 .

An H-accelerated process is sometimes called a time-changed Poisson process.

We have the following result.

Proposition 3.5. Let {Nt | t ≥ 0}


n be a non-homogeneous
o Poisson process with cumu-
lative arrival rate Λ(t) and let Ñt | t ≥ 0 be the homogeneous Poisson process with
constant rate λ = 1. Then the Λ−accelerated process ÑtΛ is equal to the process Nt
in distribution.
82 CHAPTER 3. EXTENSIONS

Proof. Notice that


e−Λ(t) Λ(t)k
P (Nt = k) = .
k!
Similarly, the Λ−accelerated process ÑtΛ satisfies
    e−Λ(t) Λ(t)k
Λ
P Ñt = k = P ÑΛ(t) = k = .
k!
Hence, the two processes are equal in distribution.

3.2 Mixture Models

We have already seen that the amount of car crashes can be modeled using a
Poisson distribution. Here, λ equals the expected amount of car crashes for a
given population in one time unit. However, this λ need not be representative
of every member of the population: some people are better drivers and hence
have a much lower λ.

Every person has a different driving skill, and hence we could consider λ
as a random variable. Similarly, we could consider a bond portfolio where
we model the number of defaults. Companies with an investment-grade credit
rating will have a lower probability of default than others. Once again, we could
see the expected rate of defaults as a random variable.

This is the idea of a mixture model: our model consists of a mixture of


different underlying distributions (one for each member) that amalgamates into
one probability model.

3.2.1 Mixed Poisson Processes

We construct a Poisson mixture model as follows. Suppose a population has


an average rate λ. We then multiply this λ with a positive random variable Θ,
which is often called the risk level.

This Θ is used as a multiplier that encodes the risk that a certain member
of the population has relative to the average of the population. A good driver
3.2. MIXTURE MODELS 83

will have θ < 1 whilst a bad driver will have θ > 1. In order to preserve the
population average, it is common to choose E [Θ] = 1.

Denote by fθ the density function of Θ, describing how the multipliers


are distributed among the population. We can then define the mixed Poisson
distribution as follows.

Definition 3.6. Let Θ be a positive random variable with E [Θ] = 1, and let
λ > 0. Then we say that X is mixed Poisson distributed if

(λtθ)k
Z
P (X = k) = e−λtθ f (θ)dθ.
0 k! θ

Notation 3.7. We write X ∼ MP oisson(λ, Θ).

A straightforward generalization of Poisson processes can now be made.

Definition 3.8. Let Θ be a positive random variable and suppose E [Θ] = 1. A


process {Nt | t ≥ 0} is a mixed Poisson process with rate λ and multiplier Θ if

• it has independent increments

• For all t > 0, we have

Nt ∼ MP oisson(λt, Θ).

Let us consider what happens with the moments of mixed Poisson processes.

Proposition 3.9. The mean and the variance of a mixed Poisson process are given
by

E [Nt ] = λt
Var (Nt ) = λt + λ2 t 2 Var (Θ)

Proof. We start by showing the equality for the mean. Using the Tower Rule, we
find
E [Nt ] = E [E [Nt | Θ]] = E [λtθ] = λt.
84 CHAPTER 3. EXTENSIONS

Furthermore, using the variance decomposition law (property 1.74), we have


Var (Nt ) = E [Var (Nt | θ)] + Var (E [Nt | Θ])
= E [λtΘ] + Var (λtΘ)
= λt + λ2 t 2 Var (Θ)

Notice that the variance is larger than the mean, which we call overdisper-
sion. This is due to the fact that we add another source of variance by allowing
the parameter to vary over the population.

For the next property, we will need an important inequality from measure
theory called Jensen’s inequality.
Definition 3.10. A function g is convex if ∀x, y
g(x) ≥ g(y) + (x − y)g ′ (y),

Figure 3.2: An example of a convex function

Intuitively, it means that if you connect two points with a straight line, the
function will lie below the straight line. When the function always lies above
the straight line, the function is called concave.
3.2. MIXTURE MODELS 85

An easy way to check for the convexity of the function is by using the second
derivative.

Proposition 3.11. Let g be a smooth function. Then g is convex if for all x,

g”(x) ≥ 0.

Proposition 3.12. Let X be a random variable, and suppose g is a convex function


defined on the image of X. Then

E [g(X)]] ≥ g(E [X]).

This inequality is called Jensen’s inequality.

Proof. By the convexity of g, we know that

g(x) ≥ g(y) + (x − y)g ′ (y), ∀x, y > 0.

Set x = X and y = E [X], then

g(X) ≥ g(E [X]) + (X − E [X])g ′ (E [Y ]).

Taking the expectation of both sides, we obtain the desired result.

Using Jensen’s inequality, we can consider the probability that no event


happens. The next result shows that under a mixed Poisson model, a zero-claim
(i.e no events taking place) is more frequent than under a Poisson process.

Property 3.13. Let {Nt | t ≥ 0} be a mixed Poisson process with rate λ and risk
level
n Θ. Then
o this process has an excess of zeroes compared to the Poisson process
Ñt | t ≥ 0 with constant λ:
 
P (Nt = 0) ≥ P Ñt = 0 .

Proof. Notice that the exponential function is convex, since for all x > 0:

d2 x
2
e = ex ≥ 0.
dx
86 CHAPTER 3. EXTENSIONS

Notice that
Z ∞
P (Nt = 0) = E [P (Nt = 0) | Θ] = e−λtθ fθ (θ)dθ.
0
Using Jensen’s inequality, we find
R∞
−λt fθ (θ)dθ
P (Nt = 0) ≥ e 0
 
= e−λt = P Ñt = 0 ,
proving the desired result.

In practice, we indeed often find that models that do not account for the
variability in the population tend to underestimate the number of zero-claims.
Of course, the difficulty of using a mixed Poisson model is that one needs to
find a good model for fθ .

Sometimes, mixed Poisson processes are called doubly stochastic processes


or Cox processes.

3.2.2 Bernoulli Mixture Model

Just like in the previous section, we can also use the notion of mixtures in a
Bernoulli setting. In particular, suppose we have a stochastic vector X of size
N consisting of independent (but not identically distributed) Bernoulli random
variables. The probability distribution function is then given by
n
Y
x
P (X | p) = pi i (1 − pi )1−xi .
i=1

The mean of the stochastic vector X is the vector


E [X] = (p1 , p2 , ..., pn )
and the variance is the diagonal matrix with entries vi i = pi (1 − pi ).

Suppose now that we have a mixture of two populations. Just like before, we
would then have two probability vectors, one for each population. We denote
these by p1 and p2 , and their k-th component by p1k and p2k respectively.
3.2. MIXTURE MODELS 87

The component pik represents the probability that someone from population
i has a success for the experiment associated with the k-th random variable in
the stochastic vector. We illustrate this with an example.

Example 3.2. To evaluate whether a person has a mental illness, many experts
use questionnaires. Suppose we have such a questionnaire where the patient
is presented with N statements and records whether they agree or disagree.
Each question can then be seen as a Bernoulli experiment with X = 1 if the
respondent agrees and X = 0 otherwise.

Generally speaking, there are two sub-populations in the respondent pool:


people with the mental illness in group 1 and those without the mental illness
in group 2. The full questionnaire can be encoded as a stochastic vector of size
N . Then, p1i gives the probability that a person with the mental illness agrees
with statement i.

Another example of where Bernoulli mixture models occur is in classifica-


tion problems. A famous example is the MNIST database.

Example 3.3. The MNIST database is a widely-used data set that contains
handwritten digits ranging from 0 to 9. It is often used to test the performance
of classifiers that are designed to recognize these digits. Each image in the
database consists of a 28x28 grid of pixels, which are either black or white. We
can treat each pixel as a Bernoulli experiment, where a value of 1 indicates that
the pixel is black and a value of 0 indicates that it is white.

In this context, we can consider a mixed model where each digit is its own
population. This gives us 10 probability vectors, one for each digit, with a size
of 784 (corresponding to the number of pixels in each image). For example,
the probability vector p3i represents the probability of the ith pixel being black
given that the digit is a 3.

To classify an image, we can estimate these probability vectors using part of


the MNIST database and then use a classifier, such as a Naive Bayes classifier,
to determine the probability vector that is most likely to have produced a given
image. We will discuss this process in more detail in a later chapter.
88 CHAPTER 3. EXTENSIONS

Figure 3.3: A sample of handwritten digits in the MNIST database


Chapter 4

Renewal Processes

Gambling, the sure way of getting nothing for


something

Wilson Mizner

Renewal processes are a special type of counting process for which the
waiting times are independent and identically distributed. They have many
interesting properties, whilst still being quite general. In fact, a lot of the
processes covered so far are examples of renewal processes.

4.1 Definition

In the second chapter, we were introduced to binomial processes, which are


random walks that are generated by an independent and identically distributed
(IID) process. These processes involve independent, positive time intervals be-
tween jumps that follow a geometric distribution.

A renewal process is a more general version of this concept. It is defined as


follows.
Definition 4.1. A stochastic process {Xn | n ∈ N} is a renewal process if it is a

89
90 CHAPTER 4. RENEWAL PROCESSES

strictly positive IID process. In other words

• Xn > 0 for all n

• Xi and Xj are independent and have the same distribution.

As we have done before, we can construct counting processes using these


renewal processes.

Definition 4.2. Let Xn be a renewal process. Consider the random walk generated
by this process, i.e the partial sums
n
X
Sn = Xi .
i=1

Then the renewal counting process Nt generated by Xn is the counting process defined
by
Nt = max {n | Sn ≤ t} .

The reason that this process is called a renewal process is because the
process loses all its memory (i.e it is ’renewed’) whenever an event occurs.

Example 4.1. A Poisson process is a renewal counting process with X ∼


Exponential(λ). Therefore all the results that we will show in later sections
also apply to Poisson processes.

We also present an alternative definition.

Definition 4.3. A renewal counting process is a counting process {Nt | t ≥ 0} where


the sequence of inter-arrival times {Tn | n ≥ 1} forms a renewal process.

Notation 4.4. The average of the process is often denoted

m(t) = E [Nt ] .

It counts the expected amount of renewals that occurred up until time t.


4.2. LONG-TERM LAWS 91

4.2 Long-term laws

4.2.1 Strong law of Large Numbers and the CLT

In the classical strong law of numbers, it is stated that under mild conditions the
long-term average of a sequence of IID random variables Xi converges almost
surely to the expected value of the distribution, µ, as the number of elements
in the sequence approaches infinity:
Pn !
i=1 Xi
lim P = µ = 1.
n→∞ n

In renewal theory, a similar result holds for a renewal counting process Nt


with inter-arrival times Xn , provided that the expected value of Xn is finite.
Theorem 4.5. Suppose Nt is a renewal counting process with inter-arrival times
Xn . Suppose E [Xi ] = µ < ∞, then
Nt 1
lim = , with probability 1.
t→∞ t µ

Proof. Let Sn denote the associated random walk. For all t, we have
SNt ≤ t ≤ SNt +1 .
Therefore,
tNt t tN tN +1 Nt + 1
≤ < t+1 = t .
Nt Nt Nt Nt + 1 Nt
By the law of large numbers, it follows that
!
tNt
lim P = µ = 1,
t→∞ Nt
and !
tNt +1
lim P = µ = 1.
t→∞ Nt + 1
Additionally,
Nt + 1
lim = 1.
t→∞ Nt
The result then follows from the sandwich theorem.
92 CHAPTER 4. RENEWAL PROCESSES

Remark. Notice that we can also write this theorem as


t
lim = µ with probability 1.
t→∞ Nt

We can interpret Nt t as the average time between events in our sample. The theorem
thus states that, in the long run, the sample average time between events converges
almost surely to the true average time between events µ.

Another limiting result that we will briefly mention is the central limit the-
orem for renewal counting processes.
Theorem 4.6. Let Nt be a renewal counting process and suppose that the associated
inter-arrival times Xn satisfy E [X] = µ < ∞ and Var (X) < ∞. Then
Nt − tµ
p →D N (0, 1).
σ µ3 t

4.3 Stopping Times

A stopping time is a special kind of random variable associated with a process.


They are defined as follows.
Definition 4.7. An integer-valued random variable N is a stopping time for the
process {Xn | i ≥ 1} if the event N = n is independent of {Xn+m | m ≥ 1}.

In other words, the value of the stopping time is determined solely by the
events that have occurred up until that point and are not influenced by events
that occur afterwards.

Example 4.2. Consider a gambler with an initial budget B0 , and let {Bn | n ≥ 0}
be the process representing the gambler’s budget after playing the n-th game.

• Playing until the budget is at its highest is not a stopping time, since it
depends on the future values of the budget.
• Playing until 10 games have been played is a stopping time, as it is deter-
mined solely by the events that have occurred up to that point.
4.3. STOPPING TIMES 93

• Playing until the budget runs out is also a stopping time, as it is deter-
mined by the events that have occurred up to that point.

4.3.1 Wald’s Equation

For processes Xt with stopping times N and finite expectations, we have a


strong result called Wald’s equation. This equation allows us to more easily
calculate the expectation of a sum of a random quantity of random variables.
n o
Theorem 4.8. Let Xj | j ≥ 0 be a process of IID random variables with finite
expectation. Suppose N is a stopping time for Xn , with E [N ] < ∞. Then
N 
X 
E  Xi  = E [N ] E [X] .
i=1

Proof. Start by defining 


1 i≤N


Ii =  .
0
 i>N
Notice that
N
X ∞
X
Xi = Xi Ii .
i=1 i=1
Taking the expectation yields
N  ∞ 
X  X 
E  Xi  = E  Xi Ii 
i=1 i=1

X
= E [E [Xi Ii | X1 , ..., Xi−1 ]]
i=1

Notice that Ii = 1 − I(N > i − 1)1 , and the latter is fully described by X1 , ..., Xi−1
since N is a stopping time. Hence, Ii is determined in the conditional so we
1 Recall that N is integer-valued
94 CHAPTER 4. RENEWAL PROCESSES

can write
N  ∞ ∞
X  X X
E  Xi  =
 E [Ii E [Xi |X1 , ..., Xi−1 ]] = E [Ii E [Xi ]] .
i=1 i=1 i=1

This can be written as


N  ∞ ∞
X  X X
E  Xi  = E [Ii ] E [Xi ] = P (N ≥ i) E [Xi ] .
i=1 i=1 i=1

Using the formula for the tail from property 1.66, this becomes
N 
X 
E  Xi  = E [X] E [N ] .
i=1

Example 4.3. Suppose we play a game in which we throw a six-sided die. The
amount of eyes the die lands on is the number of times we will throw another
die. For example, if our first die lands on three, we will throw another die three
times. Not counting the initial throw, what is the total amount of eyes that we
can expect?

Notice that N is a stopping time for the second batch of throws, and hence
we can use Wald’s equation. This gives
N 
7 7 49
X    
E  Xi  = E [N ] E [X] =
 = = 12.25.
2 2 4
i=1

The above game could very well be played in a gambling setting. Then we
would not be so interested in the outcome of the game itself but more in the
payouts associated with the gamble. In the next section, we consider so-called
renewal-reward processes that are more appropriate to use in this setting.
4.3. STOPPING TIMES 95

Figure 4.1: Abraham Wald (1942-1950) (Figure from [HET])


96 CHAPTER 4. RENEWAL PROCESSES

4.4 Renewal-Reward Theorem

In many physical scenarios, it is useful to consider not only the number of


events occurring over time but also the associated rewards or costs associated
with each event. For example, an insurance company may be interested in
modeling the value of incoming claims rather than just the number of claims
received.

Just like the renewal counting process Nt counts the number of events
through time, we can define a reward process Rt to count the (total) reward
due to the events through time. We define it as follows.
Definition 4.9. Let Nt be a renewal counting process for the inter-arrival times
Xn . Let Rn be a sequence of IID random variables. We call Rn the reward of the
renewal Xn whenever Ri is independent of Xj for j , i.

It is worth noting that, in general, the rewards Rn and the inter-arrival times
Xn may be dependent on each other.

Example 4.4.

• Let Xi denote the duration of the i-th taxi ride. Denote the associated
fare of this ride by Ri , then Xi and Ri are dependent.
• let Xi denote the inter-arrival time of buses on a given bus stop, and Ri
the number of passengers entering the associated bus. Then again Ri and
Xi are dependent.

Given a sequence of renewals and their associated rewards (Xn , Rn ), it is


often of interest to consider the total cost associated with the renewal process.
There are several different ways to define this cost, depending on the context
and the timing of the rewards.
Definition 4.10. Let (Xn , Rn ) be a renewal process and the associated costs. Let
Nt be the renewal counting process associated with Xn . Then we define cumulative
reward process {Ct | t ≥ 0} as follows.
4.4. RENEWAL-REWARD THEOREM 97

• If the reward is paid at the end of the interval, we define


Nt
X
Ct = Ri
i=1

• If the reward is paid at the beginning of the interval, we define


N
X t +1

Ct = Ri
i=1

• If the reward is being paid throughout the interval, we define


Nt
X
Ct = Ri + P RNt +1
i=1

where P R denotes the partial reward.

We give some examples of the different types of cumulative reward pro-


cesses.

Example 4.5.

• Suppose Xn denotes the duration of the nth taxi ride and Rn the fare of
the associated drive. As the fare is only paid at the end of the drive, the
cumulative reward process is given by
Nt
X
Ct = Ri .
i=1

• A touring company organizes multiple boat trips every day. Denote by Xn


the duration of the nth trip, and Ri the total sales for each trip . Tickets
for the next trip are sold at the starting location. In this case, the reward
is paid before the renewals and the cumulative reward process is given by
N
X t +1

Ct = Ri .
i=1
98 CHAPTER 4. RENEWAL PROCESSES

• The owner of an internet café bills its users continuously using a credit
system. For every time unit (eg. a minute), the customer pays 10 cents.
Denote by Xn the duration that a customer uses the service. Assum-
ing that there is only one computer available, the associated cumulative
reward process is given by
Nt
X
Ct = 0.1 · (ti − ti−1 ) + 0.1 · (t − tNt ),
i=1
where ti is the time of the i − th renewal.

There is a well-known saying The house always wins, referring to gambling.


This of course depends on what is meant by always: if no one could ever win
in a gamble then there would be no reason to go there.

However, this saying should be seen as a ’long-term’ statement: in the long


run, the house (eg. a casino) tends to make a certain profit per game. This is
reflected by the renewal-reward theorem.
Theorem 4.11. Suppose that (Xn , Rn ) is a renewal process with associated costs. Let
Ct be the cumulative cost process. Suppose that E [X] < ∞ and 0 < E [R] < ∞, then
Ct E [R]
lim = .
t→∞ t E [X]

Ct
Proof. We can rewrite t as
PNt
Ct i=1 Ri Nt
= .
t Nt t
By the strong law of large numbers, we know that with probability 1,
Nt 1
lim =
t→∞ t E [X]
Furthermore, by the classical law of large numbers, we know that with proba-
bility 1
PNt
Ri
lim i=1 = E [R]
t→∞ Nt
4.4. RENEWAL-REWARD THEOREM 99

Hence, it follows from Slutsky’s theorem that

Ct E [R]
lim = .
t t E [X]

Since E [R] is strictly positive, we thus find that Ct is strictly positive as well.
Evidently, the larger E [R] the faster Ct grows on average. On the other hand if
E [R] is too large then the game is not as attractive for the gambler.

Remark. In one time unit, we have on average by the law of large numbers Nt t →
1
E[X]
games. For each of those E[X]
1
games, we have on average an associated cost
E [R]. Thus, per time unit we would expect a cost

1
E [R] × .
|{z} E [X]
Average cost per game
|{z}
Average amount of games per time unit

Ct
Theorem 4.11 states that in the long run, the ratio tends to the quantity
E[R]
t E[X]
.
100 CHAPTER 4. RENEWAL PROCESSES
Chapter 5

Markov Processes

The future depends on what you do today

Mahatma Gandhi

In almost all the stochastic processes we have seen so far, the dependency
between the different states was quite weak. The main reason for this is that al-
lowing stronger dependencies can quickly lead to incredibly complex processes
for which even the most basic results become very difficult to prove.

The easiest case would be total independence, i.e the state of some process
Xn does not depend on any of the previous states:

P (Xn = x | X0 = x0 , X1 = x1 , ..., Xn−1 = xn−1 ) = P (Xn = x) .

In other words, any information regarding the past of the system is irrelevant
for the prediction of the next state. Of course, in many cases, this type of
simplification cannot be justified since complete independence is seldom present
in physical examples.

Instead of assuming complete independence, the next step is to consider


dependencies of one level deeper. This type of dependency is called Markov
dependency. In this chapter, we will consider models that satisfy this property.

101
102 CHAPTER 5. MARKOV PROCESSES

Figure 5.1: Andrey Markov (1856-1922) (Figure from [AM])

5.1 Definition

Definition 5.1. Let {Xn | n ≥ 0} be a discrete chain. This process is a discrete time
Markov chain if it has the Markov property, i.e
P (Xn+1 = j | Xn = i, Xn−1 = in−1 ..., X0 = i0 ) = P (Xn+1 = j | Xn = i)
for all j, i, in−1 , ..., i0 and n ≥ 0.

In other words, the probability distribution of a state only depends on the


previous state. Any additional information on the past is irrelevant.
Notation 5.2. Let Xn be a discrete-time Markov chain. We then denote
n,n+1
P (Xn+1 = j | Xn = i) = pij .
n,n+1 m,m+1
This is called a transition probability. If pij = pij for all n, m, we call the
Markov chain homogeneous.

For the sake of notation, we will often just write pij even when the Markov
chain is not homogeneous.
5.1. DEFINITION 103

It is important to note that the homogeneity of the transition probabilities


is not a consequence of the Markov property. It is not difficult to find examples
of discrete-time Markov chains that do not have homogeneous probabilities.
For the remainder of this book, we will assume that our Markov chains have
homogeneous transition probabilities unless stated otherwise.

A graphical representation of a Markov chain can be found in figure 5.2.

Figure 5.2: A graphical representation of a discrete-time Markov chain

An even stronger criterion than homogeneity is that of stationarity.

Definition 5.3. A stochastic process {Xn | n ≥ 0} is said to be stationary if any pair


of stochastic vectors (X0 , ..., Xl ), (Xm , ..., Xm+l ) have the same distribution, i.e for all
l, m and all i0 , ..., il we have

P (X0 = i0 , ..., Xl = il ) = P (Xm = i0 , ..., Xm+l = il ) .

We will see examples of Markov chains with homogeneous transition prob-


abilities that are non-stationary in a later section.

For any given ordered pair of states i, j, we have a corresponding probability


pij . We can store all the probabilities of all states using a matrix, called the
transition matrix.
104 CHAPTER 5. MARKOV PROCESSES

Definition 5.4. We define the transition matrix P as

P = (pij )
 
p11 p12 p13 ... p1n 
p p p ... p2n 
 
=  21 22 23 .
 ... ... ... ... ... 
pn1 pn2 pn3 ... pnn
 

Thus the j-th column of the i-th row pij gives the transition probability to go from
state i to state j.

In order to make things more readable, we introduce the following notation.


Notation 5.5. We will denote the transition from state i to state j as i → j.

The following property tells us what transition matrices look like, and also
how we can construct new ones quite easily.
Property 5.6. The transition matrix P of a Markov chain is a stochastic matrix,
i.e a matrix that satisfies
X
pij = 1, ∀i = 1, ..., n
j

and pij ≥ 0 for all i, j. The opposite also holds: any stochastic matrix M gives rise
to a unique Markov chain with homogeneous probabilities.

By construction, the transition matrix P contains all the transition dynamics


of the Markov chain. The only missing piece of information in order to fully
determine the Markov chain dynamics is the initial state. This observation is
the content of the following result.
Proposition 5.7. Let Xn be any discrete-time Markov chain. Then for any integer
k > 0 and any sequence of states i0 , ..., ik we have

P (X0 = i0 , X1 = i1 , ..., Xk = ik ) = P (X0 = i0 ) pi0 i1 pi1 i2 · · · pik−1 ik .

Thus the probability distribution of the chain is fully determined by P (X0 = k) and
P.
5.1. DEFINITION 105

Proof. Applying the chain rule of conditional probabilities (property 1.75), we


find
P (X0 = i0 , X1 = i1 , ..., Xk = ik )
= P (Xk = ik | Xk−1 = ik−1 , ..., X0 = i0 ) P (Xk−1 = ik−1 | Xk−2 = ik−2 , ..., X0 = i0 )
...P (X1 = i1 | X0 = i0 ) P (X0 = i0 )
= P (Xk = ik | Xk−1 = ik−1 ) P (Xk−1 = ik−1 | Xk−2 = ik−2 ) · · · P (X1 = i1 | X0 = i0 ) P (X0 = i0 )
= P (X0 = i0 ) pi0 i1 pi1 i2 · · · pik−1 ik .
Hence, we recover the desired result.

Thus, we can decompose the probability of observing a chain


P (i0 → i1 → i2 → ... → ik → ...)
= P (X0 = i0 , X1 = i1 X2 = i2 , ..., Xk−1 = ik−1 , Xk = ik , ...)
as the product of the probabilities
P (i0 ) · P (i0 → i1 ) · P (i1 → i2 ) · ...P (ik−1 → ik ) · ...
= P (i0 ) · pi0 ,i1 · pi1 ,i2 · ... · pik−1 ,ik · ...

5.1.1 Examples

In this section, we give several examples of discrete-time Markov chains.

Example 5.1. Suppose we count the number of times since we last rolled a 6
on a six-sided die. This can be modeled using a Markov chain with
5 1
pi,i+1 = , pi,0 = .
6 6
For any amount of throws n, the transition matrix is given by
0 1 2 ... n
 

 0 16 56 0 0 0 
 
 1 1 0 5 0 0 
 
Pn =  6 6  .
 2 16 0 0 0 0 

... 1 0 0 5 0 
 
 6 6 
n 16 0 0 0 56
106 CHAPTER 5. MARKOV PROCESSES

Graphically, this game can be represented as follows

Figure 5.3: Graphical representation of example 5.1

Notice that the above example is an example of a non-stationary Markov chain


with homogeneous transition probabilities. Indeed, since the game starts
  at 0,
we know that P (X0 = 2) = 0. However, we have that P (X2 = 2) = 6 · 56 > 0.
5

Example 5.2. Let Bn be a sequence of independent random variables with


probability distribution
P (Bn = 1) = p, P (Bn = −1) = q = 1 − p.
Consider then the random walk generated by this sequence of random variables,
n
X
Sn = Bn .
i=1

Graphically, this discrete-time Markov chain can be represented as

Figure 5.4: Graphical representation of example 5.2


5.1. DEFINITION 107

Alternatively, we can also represent it using a tree diagram.

Figure 5.5: Tree diagram of example 5.2

This is an example of a Markov chain with independent increments. In fact, we


have the following result.
Proposition 5.8. Any process with independent increments is a Markov process.

Proof. Let Xn be a process with independent increments, i.e the random vari-
ables
Xn1 − Xn0 , Xn2 − Xn1 , ..., Xnk − Xnk−1
are independent for any 0 ≤ n0 < n1 < ... < nk . In particular, we have
P (Xn = j | Xn−1 = jn−1 , Xn−2 = jn−2 , ..., X0 = j0 )
= P (Xn − Xn−1 = j − jn−1 | Xn−1 = jn−1 , Xn−2 − Xn−1 = jn−2 − jn−1 , ..., X0 − Xn−1 = j0 − jn−1 )
Since Xn − Xn−1 is independent of Xn−1 − Xn−i for any i , 0, we have that for
any n
P (Xn = j | Xn−1 = jn−1 , Xn−2 = jn−2 , ..., X0 = j0 ) = P (Xn − Xn−1 = j − jn−1 |Xn−1 = jn−1 ) .
From this, we can deduce
P (Xn = j | Xn−1 = jn−1 , Xn−2 = jn−2 , ..., X0 = j0 ) = P (Xn = j | Xn−1 = jn−1 ) ,
and since we assumed n to be random this proves the desired result.
108 CHAPTER 5. MARKOV PROCESSES

However, the converse result does not hold.

Example 5.3. Consider the Gambler’s ruin problem. A casino holds a game
in which one can either win $1 with probability p or lose $1 with probability
q = 1 − p.

Assume that we either play until we have some predetermined budget $N


or until we run out of money. Denote by Xn the budget after n games.

The transition probabilities are






p j = i +1

q

 j = i −1
pij = 
1

 i=j =N


1
 i=j =0

In the case where N = 4 and p = 0.4, we have the following transition matrix.

0 1 2 3 4 
 

0 1 0 0 0 0 

1 0.6 0 0.4 0 0 
 
P =  
2 0 0.6 0 0.4 0 
3 0 0 0.6 0 0.4
 
4 0 0 0 0 1
 

The game can be represented graphically as follows.

Figure 5.6: Graphical representation of the Gamblers ruin


5.1. DEFINITION 109

Using the tree diagram, we can also represent the game as follows:

Figure 5.7: Tree diagram of the Gambler’s ruin. Blue transitions have probability
0.4, red transitions have probability 0.6 and green transitions have probability
1

In the previous example, once we reached the 0-state or the N −state, we could
never leave. These are examples of absorbing states.
Definition 5.9. Let Xn be a discrete-time Markov chain. Suppose that there is a
state i such that pii = 1, then we call this state i absorbing.

Chains with absorbing states have interesting dynamics which we will cover
in more detail in a later section.

In the remainder of this section, we will use the language developed so far
to introduce and derive the Kelly criterion. This criterion is used to determine
the betting or investment strategy that maximizes the growth of wealth over
long periods.

The value of a portfolio with initial wealth or endowment S0 can be repre-


sented as the random walk
Xn
Sn = S0 + Xi ,
i=1
110 CHAPTER 5. MARKOV PROCESSES

where each Xi represents the outcome of an investment, which can be positive


(if the investment leads to a profit) or negative (if the investment leads to a loss).
Each investment is done using part of the value of the portfolio, and the return
is pro rata this investment.

There are several different investment strategies, each with its benefits and
their downsides. For example, maximizing the highest possible short-term re-
turns is achieved by investing the whole budget, but this of course carries the
greatest risk for ruin1 . On the other hand, minimizing the risk of ruin is achieved
by never investing at all.

Suppose that we have n bets, and each one of them is associated with a
cash flow Bi . A loss creates a negative cash flow, given by −Bi . Then, the total
wealth after the n bets is given by
n
X
Sn = S0 + Xi Bi ,
i=1

where Xi is either 1 or −1 depending on the outcome and Bi is the amount


engaged in the i-th bet.

Suppose that according to the investor, the probability to win on the bet is
given by p. Then the expected portfolio value after n bets is given by
n
X n
X
E [Sn ] = S0 + E [Xi Bi ] = S0 + (2p − 1) E [Bi ] .
i=1 i=1

Instead of betting the full wealth (Bi = Si ), suppose we bet a fraction of the
current wealth. Denoting this fraction by fi , we hence get

Bi = f Si−1 0 < f < 1.

Then if we have nS successes and nF = n − nS failures we would get a portfolio


value of
Sn = S0 (1 + f )nS (1 − f )nF .
1 This
means that the value of the portfolio is 0 at some time n, i.e Sn = 0 for some n. Notice
that in this case, bets can no longer be made and thus the portfolio remains at 0 for all m > n.
5.1. DEFINITION 111

The Kelly criterion is obtained by trying to optimize the exponential growth


rate Gn (f ), written as
Sn = S0 eGn (f )n .
We can rewrite this growth rate as
! n1
S nS n
Gn (f ) = log n = log(1 + f ) + F log(1 − f ).
S0 n n

Notice that maximizing the expected value of Gn (f ) is equivalent to maximizing


the expected logarithmic wealth, which more heavily punishes ruin compared
to the expected wealth.

Taking the expectation of the equation above, we find that the expected
exponential growth rate is given by
 ! n1 
S
E [Gn (f )] = E log n  = p log(1 + f ) + (1 − p) log(1 − f ).
 
S0

Maximizing this with respect to f , we need to solve


∂Gn (f )
= 0.
∂f
We get
∂Gn (f ) p 1−p
= −
∂f 1+f 1−f
p − pf − 1 + p − f + pf
=
(1 − f )(1 + f )
2p − f − 1
=
(1 + f )(1 − f )
which has as a solution f = 2p − 1. Thus, for the case where the gain is the
same as the bet, the optimal fraction (or Kelly fraction) in order to optimize the
exponential growth rate is given by f = 2p − 1.

Example 5.4. Suppose that the bet is uneven, i.e the gain when winning the
bet is not the same as the loss when losing even with the same input f Si .
112 CHAPTER 5. MARKOV PROCESSES

Denote by b the gain for every unit that is bet. We assume that in a loss, the
full input of the bet is lost. For example, if the house is offering a 2-to-1 odds
then b = 2, i.e you get twice your initial investment.

Then the portfolio value after n = nF + nS bets is given by

Sn = S0 (1 + bf )nS (1 − f )nF .

As an exercise, show that


(1 + b)p − 1
f =
b
maximizes the expected exponential growth rate using arguments similar to the
case of even bets.
5.2. MULTI-STEP DYNAMICS 113

5.2 Multi-step dynamics

So far, we have only considered the dynamics of single transitions. We extend


our theory to multi-step transitions. We need the following definition.
Definition 5.10. Let Xn be a discrete-time Markov chain. The m-step transition
(m)
probability pij is defined as the probability of going from state i to state j in m ≥ 1
steps
(m)
pij = P (Xn+m = j | Xn = i) .
(m)
Remark. Be careful of the notation. It is not difficult to see that in general pij ,
m
pij .

5.2.1 Chapman-Kolmogorov Equations

Any multi-step transition is given by a chain of single-step transitions. However,


it is clear that two different chains of single-step transitions might still end up in
the same state. In this section, we will see how to calculate multi-step dynamics.
We start with an easy example.

Example 5.5. Consider the random walk on the integers as in example 5.2.
Suppose that
P (Bn = 1) = 0.5 = P (Bn = −1) .
(2)
Let us try to calculate the probability of the two-step transition p00 , the prob-
ability of being back at zero after two moves.

The four possible paths with length two are given by (0 → 1 → 2),(0 →
1 → 0), (0 → −1 → 0) and (0 → −1 → −2).

It is easy to see that


(2)
p00 = P ((0 → −1 → 0) ∪ (0 → 1 → 0))
= P ((0 → −1 → 0)) + P ((0 → 1 → 0))
Using the Markov property, this can be rewritten as
(2)
p00 = p0(−1) p(−1)0 + p01 p10
114 CHAPTER 5. MARKOV PROCESSES

This shows that the two-step dynamics can be obtained by summing over the
relevant one-step dynamics.

One can generalize the above observation to multi-step transitions with


more than two steps. Furthermore, instead of splitting the multi-step transitions
in the one-step chain one can use the multi-step transitions with one step less.

This generalization gives rise to the Chapman-Kolmogorov equations, a


recursive way to calculate multi-step transition probabilities.

Theorem 5.11. Let i, j be any two states and let m, n > 0. Then

(m+n)
X (m) (n)
pij = pik pkj .
k

Intuitively, this just tells us that in any path where we go from i to j in m + n


steps, we first go from i to some state k in m steps and then go from k to j in
the remaining n steps. The Markov property allows us to split these paths into
two independent subpaths.

Proof. Using the law of total probability, we can write

(m+m)
X
pij = P (Xm+n = j | X0 = i) = P (Xm+n = j, Xm = k | X0 = i) .
k

Using the definition of conditional probabilities, this can then be further rewrit-
5.2. MULTI-STEP DYNAMICS 115

ten as

(m+n)
X P (X = j, Xm = k, X0 = i)
m+n
pij =
P (X0 = i)
k
X P (X = k, X = i) P (X
m 0 m+n = j, Xm = k, X0 = i)
=
P (X0 = i) P (Xm = k, X0 = i)
k
X
= P (Xm = k | X0 = i) P (Xm+n = j | Xm = k, X0 = i)
k
X
= P (Xm = k | X0 = i) P (Xm+n = j | Xm = k)
k
X (m) (n)
= pik pkj
k

which shows the desired result.

As we have already pointed out, all the dynamical information of the tran-
sitions of a Markov chain is contained in the transition matrix P. It should
therefore not be surprising that the Chapman-Kolmogorov equations can also
be formulated in terms of the transition matrix. This is shown in the next
proposition.

(m)
Proposition 5.12. The matrix of m−step transition probabilities P(m) = (pij ) is
the m−th power of the transition matrix P(m) = Pm .

Proof. We will prove this using induction. Notice that if we set n = 1 in the
Chapman-Kolmogorov equations, we obtain

(m+1)
X (m)
pij = pik pkj .
k

Furthermore, suppose we have a matrix P(m) of m-step probabilities P(m) =


116 CHAPTER 5. MARKOV PROCESSES

(m)
(pij ). Then the matrix product P(m) × P is given by
   
 ... ... ...  ... p1j ...
 (m) (m) 
(P(m) × P)ij = pi1 ... pin  × ... ... ...
  
 ... p
... ... ... nj ...
 
X (m) (m+1)
= pik pkj = pij ,
k

from which the result can easily be deduced.

Finally, we show that the k−step transition matrix is stochastic. For this, we
use the following proposition.

Proposition 5.13. Suppose X and Y are two stochastic matrices of the same size.
Then XY is also stochastic.

Proof. We need to show that = 1 for all j. For this, notice that
P
j (XY )ij
X XX
(XY )ij = xik ykj
j j k
 
XX X X 
= xik ykj = xik  ykj 
 
 
k j k j
X
= xk xik = 1.

Using this proposition, one can easily show the following result.

Proposition 5.14. For all n, P(n) is a stochastic matrix.

Proof. This is an immediate consequence of proposition 5.13 and proposition


5.12 and is left as an exercise for the reader.
5.3. SIMULATING MARKOV CHAINS 117

Example 5.6. In this example, we will consider the multi-step dynamics for
the Gambler’s Ruin, see example 5.3. Recall that the transition matrix was
given by
 1 0 0 0 0 0 
 
0.6 0 0.4 0 0 0 

 0 0.6 0 0.4 0 0 
 
P =   .
 0 0 0.6 0 0.4 0 
 0 0 0 0.6 0 0.4
 
0 0 0 0 0 1
 

Using proposition 5.12, we can find the two-step transition probabilities and
find (check this!)

 1 0 0 0 0 0 
 
 0.6 0.24 0 0.16 0 0 

0.36 0 0.48 0 0.16 0 
 
P(2) =  .
 0 0.36 0 0.48 0 0.16
 0 0 0.36 0 0.24 0.4 
 
0 0 0 0 0 1
 

As an exercise, calculate P(10) 2 and interpret the results. What can you say
about the transitions between non-absorbing states?

5.3 Simulating Markov Chains

In this section, we briefly cover the basic steps on how to simulate a Markov
chain. Recall that from proposition 5.7, it follows that it suffices to know the
initial distribution α and the transition matrix P.

We assume that the state space S (Xn ) is countable. Suppose that after
labeling, we can write the state space as

S (Xn ) = {0, 1, 2, 3, 4, ...} .


2 Use a calculator!
118 CHAPTER 5. MARKOV PROCESSES

We can then define the sequence of real numbers


αk ≡ α(k) = P (X0 = k) .
Using these values, we can sample from the initial distribution as follows:

• Generate a random sample u0 from U0 ∼ U [0, 1].


• Find the value k corresponding to
 k−1 k

X X 
u0 ∈  αi , αi  .
i=0 i=0

• We set X0 = k, where k is the value obtained in the previous step.

To see why this method is valid, observe that


  k−1 k

 X X 
P u0 ∈  αi , αi  = αk = P (X0 = k) .
i=0 i=0

In order to fully simulate a Markov chain, we also need a method that


generates the transitions i → j according to P, starting from any state i. This
can be achieved as follows:
P∞
• Starting from the state i ∈ S (Xn ), we know from property 5.6 that j=0 pij =
1.
• Generate a random sample un from Un ∼ U [0, 1].
• Find the value j corresponding to
 
Xj−1 j
X 
un ∈  pij , pij  .
 
 
i=0 i=0

• We set Xn = j. In order to generate Xn+1 , we repeat this process starting


from the state j.

As an exercise, check whether the obtained Markov chain has the same tran-
sition probabilities as governed by P and check whether the obtained chains
satisfy the Markov property.
5.4. CALIBRATION OF MARKOV CHAINS 119

5.4 Calibration of Markov Chains

We will assume that the transition probabilities are homogeneous. Suppose that
we have observed the evolution of a Markov chain up to time n. Thus, we have
as data the path
Pn = (X0 = i0 , X1 = i1 , ..., Xn = in ).

Using proposition 5.7, one can easily show that the likelihood of this path is
given by
n−1
Y
L(Pn ) = P (X0 = i0 ) pik ,ik+1 .
k=0
Denote by nij the amount of times the one-step transition i → j is observed in
the path Pn . Then since we assumed homogeneous transition probabilities we
can rewrite the likelihood as
Y n
ij
L(Pn ) = P (X0 = i0 ) pij .
i,j

The maximum likelihood estimator for the transition probabilities P ˜ is the


transition matrix that maximizes the likelihood for Pn . Since the logarithm
is a monotone increasing function, this is equivalent with maximize the log-
likelihood3 X
ℓ(Pn ) = log(P (X0 = i0 )) + nij log(pij ).
i,j

However, recall that the solution must correspond to a stochastic matrix.


We introduce the Lagrangian multipliers λi in our constrained log-likelihood
function:  
X X 
ℓC (Pn ) = ℓ(Pn ) − λi  pij − 1 .
 
 
i j
Each state has its own Lagrangian multiplier. The maximum likelihood estima-
˜ satisfies
tor P
∂ℓC (Pn ) nij
= − λi .
∂pij p̃ij
3 The reason why one prefers to use the log-likelihood over the likelihood comes from the
fact that the multiplications change to additions after applying the logarithm. This often makes
the maximization problem analytically more tractable.
120 CHAPTER 5. MARKOV PROCESSES

Hence, we obtain
nij
p̃ij = .
λi
The constraints give rise to
X X nij
p̃ij − 1 = −1 = 0 ∀i ∈ S (Xn ) ,
λi
j j

from which it follows that


X
λi = nij ∀i ∈ S (Xn ) .
j

Therefore, the maximum likelihood estimators for the transition probability pij
are given by
nij Number of observed transitions i → j
p̃ij = P = .
j nij Number of transitions starting from i
5.5. STRUCTURE OF MARKOV CHAINS 121

5.5 Structure of Markov Chains

It is clear from the previous sections that Markov chains have a lot of structure.
Due to the possibly large amount of states and transitions between them, it
can be quite difficult to fully understand the structure. In this section, we will
develop tools and concepts that make this a lot easier.

5.5.1 Equivalence Classes

In the following sections, we will consider the more global structure of Markov
chains. For this, we will need the following concepts.

Definition 5.15. Let Xn be a Markov chain. We then define the state space S (Xn )
to be the state of all possible values of Xn .

Example 5.7.

• In example 5.1, the state space S (Xn ) = N.

• In example 5.2, the state space is S (Xn ) = Z.

• In example 5.3, the state space is S (Xn ) = {0, 1, ..., N − 1, N }.

We have put a lot of focus on states i and j that have paths between them.
However, it is just as interesting to consider states with no paths between the
two.

Definition 5.16. Let i, j ∈ S (Xn ) be any two states in the state space. We say i
communicates with state j if there is a non-zero probability of ever reaching j starting
from i:
(n)
pij > 0 for some 0 ≤ n < ∞.
We say the states i and j intercommunicate if they communicate with each other.
122 CHAPTER 5. MARKOV PROCESSES

Notation 5.17. If i communicates with j, we will write i ↷ j. If i and j intercom-


municate, we will write i ∼ j

As the notation suggests, intercommunication is an equivalence relation.

Proposition 5.18. Intercommunication is transitive: if i ∼ j and j ∼ k, then i ∼ k

(n)
Proof. Since i ∼ j, we know that there exists an n such that pij > 0. Similarly,
(m)
we can find an m such that pjk > 0. From the Chapman-Kolmogorov equation,
we find
(m+n)
X (m) (n) (m) (n)
pik = pil plk ≥ pij pjk > 0.
l

Thus, we find that there exists a non-zero probability to go from i to k. This


argument can be reversed to obtain the opposite direction.

It is easy to see that the relationship is also reflexive and symmetric. The
nice thing about equivalence relationships is that they can be used to decompose
a set into its equivalence classes. In this case, we can decompose the state space
S (Xn ) as follows:

1. Pick a state i ∈ S (Xn )

2. Denote by Ci the set of all states j such that i ∼ j

3. Pick a state k ∈ S (Xn ) \ Ci

4. Denote by Ck the set of all states l such that k ∼ l

5. Pick a state t ∈ S (Xn ) \ (Ci


F
Ck )

6. Denote by Ct the set of all states w such that t ∼ w

7. Repeat the procedure until all the states have been assigned to a class.
By construction, this gives a disjoint and exhaustive set of equivalence
classes of the state space S (Xn ).
5.5. STRUCTURE OF MARKOV CHAINS 123

Example 5.8. In the case of the gambler’s ruin, we have three different
equivalence classes: {0} , {N }, and {1, ..., N − 1}.

Notice that in the previous example, the obtained equivalence classes have a
different dynamical flavor: we have the sets {0} , {N } that we can only enter but
never leave and we have the set {1, ..., N − 1} that we can leave.
Definition 5.19. Let Xn be a Markov chain. A subset A ⊂ S (Xn ) of the state space
is said to be closed (with respect to the Markov chain Xn ) if
pij = 0, ∀i ∈ A, j < A.
In other words, it is impossible to leave A.

Some examples of closed sets are absorbing states and the state space itself.

Another interesting dynamical property is that of (ir)reducibility.


Definition 5.20. A subset A ⊂ S (Xn ) of the state space is said to be irreducible if
∀i, j ∈ A it follows that i ∼ j.

In other words, an irreducible set consists only of intercommunicating


states.

Even though the state space S (Xn ) is always closed, it is not necessarily
irreducible. We introduce the following terminology.
Definition 5.21. Let Xn be a Markov chain. If the state space S (Xn ) is irreducible
then we call the Markov chain irreducible.

Intuitively, in an irreducible chain, we can go from any state to any other


state in the state space.

Example 5.9. The Markov chains in examples 5.1 and 5.2 are both irreducible.
The Markov chain in example 5.3 is not.

There is a notion stronger than irreducibility, called regularity.


124 CHAPTER 5. MARKOV PROCESSES

Definition 5.22. Let Xn be a discrete-time Markov chain. Then we say that the
chain is regular if there exists an n0 > 0 such that
(n )
pij 0 > 0∀i, j.
In other words, if we start in i and make n0 transitions, we can be at any possible
state in the state space.

To see why an irreducible chain need not be regular, we have the following
example.

Example 5.10. Consider the random walk discussed in example 5.2. This is
easily seen to be an irreducible Markov chain. However, notice that if i is even,
(2n)
pi,2k+1 = 0∀n, k > 0
and
(2n+1)
pi,2k = 0∀n, k > 0.
This implies that our chain can not be regular.

Just like all other dynamical properties, we can define the notion of chain
regularity using the transition matrix.
Property 5.23. A chain Xn with transition matrix P is regular if and only if there
exists an n0 > 0 such that all elements in Pn0 are strictly positive.

Proof. Left as an easy exercise.

Example 5.11. Consider the Markov chain generated by the transition matrix
 
0.75 0.25 0 
P =  0 0.5 0.5 .
 
0.6 0.4 0
 

This is a regular Markov chain, since


 
0.5625 0.3125 0.125
P2 =  0.3 0.45 0.25  .
 
0.45 0.35 0.2
 
5.5. STRUCTURE OF MARKOV CHAINS 125

If we consider even larger powers of P, we find


 
0.45 0.37 0.18
P8 = 0.45 0.37 0.18
 
0.45 0.37 0.18
 

It seems like the rows for a regular transition matrix converge. We will see that
this is a general result for regular chains in a later section.

5.5.2 Hitting and Passage Times

Given a Markov chain Xn with state space i, j ∈ S (Xn ) such that i ↷ j, then we
only know that i can reach j. We could instead be interested in knowing how
long it takes to go from i to j in terms of the number of transitions in the path
from i to j.

This of course depends on the path, which is random. Therefore, this


quantity is also random. We define it as follows.
Definition 5.24. Let i, j ∈ S (Xn ), then we define the first-passage time
Tij = min {n ≥ 1 : Xn = j | X0 = i} .
When X0 = j, we call it the first return time or first recurrence time of j
Tjj = Tj = min {n ≥ 1 : Xn = j | X0 = j} .

Notice that if iHH i.e i does not communicate with j, then Tij = ∞. How-
↷j,

ever even if i ↷ j, it can still happen that Tij = ∞.

Example 5.12. Consider the Gambler’s ruin example discussed earlier. Notice
that if we start at state 1, we have that 1 communicates with 2 (1 ↷ 2). However,
if we have the transition 1 → 0, then (T12 | X1 = 0) = ∞ since 0 is an absorbing
state.

We define the following probabilities.


126 CHAPTER 5. MARKOV PROCESSES

(n)
Definition 5.25. The passage probability fij is the probability that the chain first
visits state j on the nth step, starting from some state i
(n)
fij = P (X1 , j, X2 , j, ..., Xn−1 , j, Xn = j | X0 = i) .

The probability fjj that the chain starting in state j ever returns to this state is then,
by the law of total probability

X (n)
   
fjj = fjj = P Tj < ∞ = 1 − P Tj = ∞ .
n=1

Again, by the law of probability, we can also define the probability fij that a chain
starting in state i ever reaches state j as

X (n)
 
fij = fij = P Tij < ∞ .
n=1

Notice that by definition,

fij > 0, i , j ⇐⇒ i ↷ j.

Using the above definition, we can define the average time of recurrence as
follows.
Definition 5.26. The mean recurrence time of a state j is

X (n)
h i
µj = nfjj = E Tj .
n=1

It represents the average time it takes for a path starting at j to return to j.


Remark. Notice that the mean recurrence time of a state s that communicates with
an absorbing state is infinite.

Example 5.13. In example 5.1, the mean recurrence time of the state 0 is given
by
X (n) X 1  5 n−1
µ0 = nfjj = n· · .
6 6
n≥1 n≥1
5.5. STRUCTURE OF MARKOV CHAINS 127

This can easily be rewritten in the form of an arithmetico-geometric sequence:


    X  n
1 6 5
µ0 = · · n
6 5 6
n≥1

which has as solution4


       2
1 6 5 6
µ0 = · · · = 6.
6 5 6 1
In other words, the mean recurrence time is six throws.

In the definition of first passage and first return, we deliberately do not count
the initial state X0 . When we are interested in knowing the time it takes for a
state to occur in the chain, we use the hitting time.

Definition 5.27. The hitting time of a state j from a state i is defined as

Hij = min {n ≥ 0 : Xn = j | X0 = i} .

We can define the hitting time of a subset S ⊂ S (Xn ) from state i as

Hi,A = min {n ≥ 0 : Xn ∈ A | X0 = i} .

Their expected values are denoted by


h i
E Hij = Hi,j and E Hi,A = Hi,A .
 

Notice that in particular, we always have that Hj ≡ Hjj = 0.

Using the notation above, we can give a different formulation for concepts
we have already seen.
4 Recall that for an arithmetico-geometric sequence, we have

X r
kr k =
(1 − r)2
k=0
128 CHAPTER 5. MARKOV PROCESSES

Proposition 5.28. A set A is closed if and only if


 
P Hi,AC = ∞ = 1 ∀i ∈ A.

Proof. =⇒ : Suppose that A is closed and P Hi,Ac = ∞ < 1 for i ∈ A. Then



there is a chain such that X0 = i and Xn = j < A with n < ∞. Without loss
of generality, assume that n is the hitting time of the subset Ac . By definition,
we then know that Xn−1 ∈ A. We denote Xn−1 = r, then prj > 0 but this is in
contradiction with the definition of closed sets. Hence P Hi,Ac = ∞ = 1.


 
⇐= : Suppose P Hi,AC = ∞ = 1. Then for any i ∈ A, j < A we have
     
pij = P Hij = 1 ≤ P Hi,j < ∞ ≤ P Hi,Ac < ∞ = 1 − P Hi,Ac = ∞ = 0.

It follows that A is closed.

We also have the following result for absorbing states.

Proposition 5.29. If a state j is absorbing, then Hj = 0 and Tj = 1.

Proof. It is evident that Hj = 0. For Tj , notice that


 
P Tj = 1 = pjj = 1.

Just like before, we can define the following associated probability.

Definition 5.30. The hitting probability is the probability


 
hij = P Hij < ∞ .

Notice that thus in particular, hjj = 1, and

i ↷ j ⇐⇒ hij > 0.
5.5. STRUCTURE OF MARKOV CHAINS 129

The hitting probability hij is the probability that the chain hits the state j
starting from a given state i in a finite amount of time.

Now that we have defined these concepts, the next question is how one can
work with them. We consider an example.

Example 5.14. We use the same setting as in example 5.3, and suppose that
p = q = 0.5. We want to answer the following questions:

• What is the probability of ruin, i.e the probability that we enter the state
0?

• What is the expected hitting time of the absorbing states?

We will answer this question using a method called the first step analysis. No-
tice that the one-step transitions satisfy the following

• h1,0 > P H1,0 = 1 = q.




• hN −1,N > P HN −1,N = 1 = p




• h0,0 = 1 = P H0,0 = 1 by proposition 5.29.




• hN ,0 = 0.

We first try to answer the question regarding the probability of ruin. We start
from state i, and we use the law of total probability and the Markov property
to obtain

hi,0 = P (Hi0 < ∞ | X0 = i)


= P (Hi0 < ∞ | X0 = i, X1 = i − 1) P (X1 = i − 1)
+ P (Hi0 < ∞ | X0 = i, X1 = i + 1) P (X1 = i + 1)
 
= P Hi−1,0 < ∞ | X0 = i − 1 q + P Hi+1,0 < ∞ | X0 = i + 1 p
= hi−1,0 q + hi+1,0 p.

Here, we used the fact that the probability of ever hitting 0 from i given that
we transitioned to i − 1 is the same as the probability of ever hitting 0 starting
130 CHAPTER 5. MARKOV PROCESSES

from i − 1 (check this!). Since p = q = 12 , we can rewrite this as

hi−1,0 + hi+1,0
hi,0 = ∀i = 1, ..., N − 1.
2
Rewriting this set of equations gives us the equivalent set of equations
h0,0 hi+1,0 i
hi,0 = + ∀i = 1, ..., N − 1.
i +1 i +1
Notice that we have h0,0 = 1 and hN ,0 = 0. Hence, we find that

0
*
1 (N − 1)h 
N ,0
hN −1,0 = + 
N  i + 1
Thus, we find that
1 N −2 1
hN −2,0 = +
N −1 N −1N
N + N − 2 2(N − 1) 2
= = = .
(N − 1)N (N − 1)N N
Using the same reasoning above, one can show (do this!) that
N −i
hi,0 = .
N
This solves the first problem. Notice how conditioning on the first step allowed
us to find a system of equations in terms of the hitting probabilities. This
first-step analysis can be used for the next problem as well.

We again condition on the first step. Notice that during this conditioning
we increase the hitting times by one. We get

Hi,0∪N = 1 + qHi−1,0∪N + pHi+1,0∪N ∀i = 1, ..., N − 1.


Hi−1,0∪N + Hi+1,0∪N
= 1+
2
One can show (do this!) that we can rewrite this set of equations as
H0,0∪N iHi+1,0∪N
Hi,0∪N = i + + ∀i = 1, ..., N − 1.
i +1 i +1
5.5. STRUCTURE OF MARKOV CHAINS 131

Using H0,0∪N = 0 and HN ,0∪N = 0, we find

HN −1,0∪N = N − 1.

Thus, we find

(N − 2)HN −1,0∪N
HN −2,0∪N = N − 2 +
N −1
(N − 2)(N − 1)
= N −2+ = 2(N − 2).
(N − 1)

Using the same calculation as above, we can find

Hi,0∪N = i(N − i).

Remark. Suppose we have a chain Xn with two absorbing states b1 , b2 ∈ S (Xn ).


Suppose that i ↷ b1 , i ↷ b2 . Then notice that the expected values Hi,b1 and Hi,b2
do not exist, i.e are non-finite.

Remark. As an exercise, try to show that in the Gambler’s ruin problem, the hitting
probabilities satisfy hi,N = Ni . We thus find that hi,N + hi,0 = 1, or in other words
that the Markov chain always ends up in an absorbing state. Is this surprising?

The first step analysis used in the previous example can be generalized to
the following result.

Theorem 5.31. The hitting probabilities hij are the minimal non-negative solution
of
X
hi,j = pik hkj , i , j, hjj = 1.
k∈S(Xn )

The expected hitting times Hi,j are the minimal non-negative solution of
X
Hij = 1 + pik Hkj , i , j, Hjj = 0.
k,j∈S(Xn )
132 CHAPTER 5. MARKOV PROCESSES

Example 5.15. Consider a rat in a maze with four cells. We label each cell
with an integer from 1 to 4. The maze has an exit which we denote by 0, the
’free’ state. The rat is placed randomly in a cell and then moves throughout the
maze until it finds its way out. We will assume that the rat is ’Markovian’, in
the sense that the transitions between cells chosen by the rat are not influenced
by past choices. We also assume that the rat has no preference when choosing
the next cell, in the sense that at each move the rat has an equal probability to
go to any of the neighboring cells. We denote by Xn the cell visited right after
the n−th move. We then have that S (Xn ) = {0, 1, 2, 3, 4}.

The transition matrix is given by

0 1 2 3 4 
 

0 1 0 0 0 0 

1 0 0 0.5 0.5 0 
 
P =  .
2 0 0.5 0 0 0.5
3 0 0.5 0 0 0.5
 
1 1 1
4 0 0
 
3 3 3

Graphically, this problem can be represented as follows.

Figure 5.8: Graphical representation of example 5.15


5.5. STRUCTURE OF MARKOV CHAINS 133

We will calculate the expected escape time for each of the starting positions
using theorem 5.31. We obtain the following set of equations

H2,0 H3,0
H1,0 = 1 + +
2 2
H1,0 H4,0
H2,0 = 1 + +
2 2
H1,0 H4,0
H3,0 = 1 + +
2 2
H2,0 H3,0
H4,0 = 1 + +
3 3
Solving these equations gives us the solution

H1,0 = 13, H2,0 = 12, H3,0 = 12, H4,0 = 9.

Note that even though a rat starting in the fourth cell can free itself in only one
move, it will on average need around 9 moves in order to exit the maze!

5.5.3 Recurrence and Transience

An interesting dynamical property that certain states have is recurrence: once


a chain has visited a recurrent state i, it will almost surely return to that state.

Definition 5.32. A state j is recurrent (or persistent) if the chain returns to j in a


finite number of steps almost surely:
 
fjj = P Tj < ∞ = 1.

If a state is not recurrent, we call it transient.

Remark. It is not true that a recurrent state must be hit by all chains almost surely.
Can you find an example?

We distinguish two types of recurrent states.


134 CHAPTER 5. MARKOV PROCESSES
h i
Definition 5.33. Let j ∈ S (Xn ) be a recurrent state. Then, if E Tj is finite we
h i
call the state j a positive recurrent state. If E Tj is infinite, we call the state j a
null recurrent state.

Example 5.16. It can be shown that all states in example 5.2 are null recurrent.
The proof of this lies outside the scope of this book.

Example 5.17. In example 5.3 the only recurrent states are the absorbing
states {0} , {N }. All other states are transient.

Notice that we have the following equivalence.



X (k)
A state i is recurrent ⇐⇒ fii = 1 ⇐⇒ fii = 1.
k=1

Just like we have done previously, we can use a first-step analysis to calculate
(n)
the values of fii .
Theorem 5.34. The passage probabilities satisfy the following recurrence relation

(n) pij

 n=1
fij =  P (n−1) .
 k,j pik fkj
 n>1

Proof. Notice that


(1)
fij = pij
and for all n > 1, we have
(n)
fij = P (X1 , j, X2 , j, ..., Xn−1 , j, Xn = j | X0 = i)
X
= P (X1 , j, X2 , j, ..., Xn−1 = k, Xn = j | X0 = i)
S(Xn )∋k,j
5.5. STRUCTURE OF MARKOV CHAINS 135

Using the Markov property and property 1.75, this yields

(n)
X
fij = P (Xn = j | Xn−1 = k) P (X1 , j, X2 , j, ..., Xn−1 = k | X0 = i)
k,j

From this, the result easily follows.

Define now the stochastic vector


(n) (n) (n)
fj = [f1j , ..., fjj , ....]T .

(n)
Then the formulas for fij can be rewritten as

(n) (n−1)
fj = P(j) fj ,

where P(j) is the transition matrix where we set the j−th column equal to zero
in order to ensure i , j in the sums.

We get
(n) (1)
fj = P(j) · · · P(j) ·fj .
| {z }
n-1 times

Example 5.18. A famous UK study of occupational mobility across gen-


erations was conducted after World War II. Three occupational levels were
identified.

• Upper level: Executive, managerial, high administrative, and professional

• Middle level: Supervisor, skilled manual

• Unskilled labor

The main interest was to see how transitions between the different levels oc-
curred.
136 CHAPTER 5. MARKOV PROCESSES

Using surveys, the transition probabilities were estimated to be as follows,


denoting the upper, middle, and lower levels by 1,2, and 3 respectively.
 
 1 2 3 
1 0.45 0.48 0.07
 
P =  
2 0.05 0.7 0.25
3 0.01 0.5 0.49
 

The following questions were asked:

1. What is the probability for an unskilled person to hold an upper-level


position in exactly 3 years?
2. Is an upper-level position safe? In other words, is the upper-level position
a recurrent state?

(3)
The first question can be obtained by calculating f31 . Notice that we have
for n = 1 T  T
(1)

f1 = p11 p21 p31 = 0.45 0.05 0.01
For n = 2, this becomes
   
 0 0.48 0.07  0.45 
(2) (1)   
f1 = P(1) f1 = 0 0.7 0.25 · 0.05

0 0.5 0.49 0.01
   
 
0.0247
= 0.0375
 
0.0299
 

Similarly, we obtain
 
0.02009
(3) 
f1 = 0.03372 .
 
0.0334
 

(3)
Thus, we obtain that f31 = 3.34% of the low-skilled workers make a transition
to an upper-level position in exactly three years.

For the second question, we need to calculate


(1) (∞)
f11 = f11 + · · · + f11 .
5.5. STRUCTURE OF MARKOV CHAINS 137

This is numerically less straightforward. Using theorem 5.34, this can be ob-
tained by calculating

X (1)
f1 = Pn(1) f1
n=0
One way to evaluate this is by truncating the sum and evaluating the partial
sum. The results of this can be found in figure 5.9

Figure 5.9: Numerical evaluation of the series in example 5.18

Another way to check solve this numerically is by using the fact that if the
largest eigenvalue is smaller than one (check that this is the case for P(1) ), we
have that

X (1) (1)
f1 = P(1) f1 = (I − P(1) )−1 f1
j=0
 
1
= 1
 
1
 

For this method, no (direct) approximation needs to be made. We obtain using


both methods that f11 = 1. Notice that we obtained that all states are recurrent.
We will see later that this is no coincidence, but a direct result of theorem 5.55.
138 CHAPTER 5. MARKOV PROCESSES

5.5.4 Strong Markov property and Recurrence revisited

In the chapter on renewal processes, we introduced the concept of stopping


times. These were random variables whose realization T = m depended only
on the values of X0 , ..., Xm .

We now introduce these random variables in the context of Markov chains.


Notice that the definition is very similar to the original definition.

Definition 5.35. A non-negative integer-valued random variable T is called a


stopping time with respect to a Markov chain X if for any n ≥ 0, the event T = n is
completely determined by the values of the process up to time n, i.e (X0 , X1 , ..., Xn ).

In particular, a stopping time is independent of whatever happens after the


stopping time: the event {T = n} does not depend on {Xn+m | m > 0}.

Example 5.19. It is not difficult to show that the following are examples of
stopping times.

• Any deterministic time T

• Recurrence times,
n o
Tj = n ⇐⇒ {X0 = j, X1 , j, ..., Xn−1 , j, Xn = j}

• Hitting times
n o
Hij = n ⇐⇒ {X0 , j, X1 , j, ..., Xn−1 , j, Xn = j} .

We now define a random variable that is not a stopping time.


5.5. STRUCTURE OF MARKOV CHAINS 139

Definition 5.36. Let Xn be a Markov chain and let A ⊂ S (Xn ) be any set of states.
Then the last exit time LA is the time

LA = sup {n ≥ 0 | Xn ∈ A} .

In other words, it is the last time that the chain is at a state in A.

It is not difficult to see that LA is not a stopping time.

The following theorem, often called the Strong Markov Property (SMP),
shows the importance of stopping times.

Theorem 5.37. Let Xn be a Markov chain, and suppose T is a stopping time for Xn .
Then, conditionally on T < ∞ and XT = j, any other information about X0 , ..., XT
is irrelevant for predicting the future and the sequence {XT +n | n ≥ 0} is a Markov
chain that behaves like X started at j.

Proof. It suffices to show that

P (XT +1 = k|XT = j, T = n) = pjk .

Denote by Vn the set of vectors (x1 , x2 , ..., xn ) such that if X0 = x0 , X1 =


x1 , ..., Xn = xn then T = n and XT = j. By definition, we then have
X
P (XT +1 = k, XT = j, T = n) = P (X1 = x1 , ..., Xn = xn , Xn+1 = k)
x∈Vn
X
= P (Xn+1 = k | Xn = j, ..., X0 = x0 ) P (Xn = xn , ..., X0 = x0 )
x∈Vn
X
= pjk P (Xn = xn , ..., X0 = x0 ) = pjk P (T = n, XT = j) .
x∈Vn

Therefore, we find that

P (XT +1 = k, XT = j, T = n)
pjk = = P (XT +1 = k | XT = j, T = n) .
P (T = n, XT = j)
140 CHAPTER 5. MARKOV PROCESSES

Notice that the strong Markov property does not hold when T is not a
stopping time. Indeed, suppose that pii > 0 and consider Lxi , the last passage
time of xi . Then
P (XLxi +1 = i | XLxi = i, Lxi = n) = 0,
by definition of the last passage time. Thus,

0 = P (XLxi +1 = i | XLxi = i, Lxi = n) , pii > 0,

violating the SMP.

Using the strong Markov property, we can show the following interesting
result.

Property 5.38.
 The
n probability that a chain returns n times to a given state j is
given by P Tj < ∞ .

Proof. This is an immediate corollary from the strong Markov property.

Another interesting result that uses the SMP is the following, which relates
the first-passage probabilities with the multi-step transition probabilities.

Notice that even though it seems like a straightforward and intuitive result,
proving it depends crucially on the Strong Markov Property.

Proposition 5.39. Let Xn be a Markov chain. For any i, j ∈ S (Xn ),


n
(n)
X (k) (n−k)
pij = δij δn0 + fij pjj ,
k=1

where δxy is the Kronecker delta5 .

Proof. It is trivial to see that for n = 0, we have


(0)
pij = δij .
5 The Kronecker delta δxy is a function that equals 0 whenever x , y and equals 1 if x = y.
5.5. STRUCTURE OF MARKOV CHAINS 141

For n ≥ 1, we condition on the first passage time Tij = k. Using the law of total
probability, we get

(n)
pij = P (Xn = j | X0 = i)
X∞  
= P Xn = j, Tij = k | X0 = i .
k=1

Notice that for k > n it follows by definition of the first passage time that
 
P Xn = j, Tij = k | X0 = i = 0

and hence we obtain


n
(n)
X  
pij = P Xn = j, Tij = k | X0 = i .
k=1

Using the chain rule for conditionals (property 1.75) and the strong Markov
property, we obtain
n
(n)
X  
pij = P Xn = j, Tij = k | X0 = i
k=1
 
 
Xn    
 
= P Xn = j | Tij = k, X0 = i  P Tij = k | X0 = i

 }
k=1   | {z
Xk =j
n
X (k)
= P (Xn = j | Xk = j) fij
k=1
n
X (n−k) (k)
= pjj fij ,
k=1

from which the result easily follows.

We can express the previous result using generating functions (see definition
142 CHAPTER 5. MARKOV PROCESSES

1.70). For the first passage times Tij , this generating function is given by
h i ∞
X  
Tij
γTij (s) = E s = P Tij = n sn
n=0

X (n)
= fij sn .
n=0

We denote γTij (s) = Fij (s). Similarly, we write the generating function for the
multi-step transition probabilities as

X (n)
Pij (s) = sn pij .
n=0

Using proposition 5.39, we now show the following result.


Proposition 5.40. Let Pij and Fij be defined as above. Then
Pij (s) = δij + Pjj (s)Fij (s).

Proof. Notice that



X ∞
(n)
X (n)
Pij (s) = sn pij = δij + sn pij .
n=0 n=1

Furthermore, notice that


∞  ∞ 
X (n)  X (n) 
Pjj (s)Fij (s) =  sn pjj   sn fij 
 n=0 ∞ n=1
 ∞ 
 X (n)  X (n) 
= 1 + sn pjj   sn fij  .
  
n=1 n=1

Grouping terms with the same order, we find


∞ 
 n−1

X  (n) X (n−k) (k) 
Pjj (s)Fij (s) = fij + p fij  sn

 jj
n=1 k=1

 n 
X X (n−k) (k) 
= 
 pjj fij  sn
n=1 k=1
5.5. STRUCTURE OF MARKOV CHAINS 143

Using proposition 5.39, we find that



X (n)
Pjj (s)Fij (s) = sn pij = Pij (s) − δij
n=1
and hence
Pij (s) = δij + Pjj (s)Fij (s).

In particular, for i = j we get that


Pjj (s)
,
1 + Pjj (s)Fjj (s)
or solving for Pjj (s) we find
1
Pjj (s) = .
1 − Fjj (s)
Recall that a state was recurrent if fjj = 1. In terms of the generating func-
tions, this is equivalent to Fjj (1) = 1. Using proposition 5.40, this leads to the
following alternative condition for recurrence.
Property 5.41. A state j is recurrent if Pjj (1) = ∞, or equivalently Fjj (1) = 1. If
Fjj (1) < 1, then j is transient.

P∞ (n)
Proof. Suppose Fjj (1) = 1, then = 1. Hence
n=1 fjj
 
P Tjj < ∞ = 1
and thus j is recurrent.

We also have the following result.


Proposition 5.42. Suppose we have two states i, j ∈ S (Xn ) such that fij > 0. Then
a state j is recurrent if and only if
Pij (1) = ∞.

Proof. This is an immediate result of proposition 5.40 and is left as an exercise


to the reader.
144 CHAPTER 5. MARKOV PROCESSES

5.5.5 Expected Returns

If a state is recurrent, then we know by definition that our chain will revisit this
state almost surely in a finite amount of time. Intuitively, this should imply that
we expect to revisit a recurrent state infinitely often. In this section, we will
make this concrete. We will need the following definition.
Definition 5.43. Let Xn be a Markov chain, then we define the random variable

X ∞
X
Nj = I(Xn = j) = In .
n=1 n=1

Remark. The random variable Nj essentially counts the number of times the chain
visits a given state j.

We are interested in the expected amount of visits to j in our chain, starting


at state i. Using the linearity of the expected value, this is given by
h i X∞
E Nj | X0 = i = E [In | X0 = i]
n=1

Using property 1.65, this equals


h i X∞
E Nj | X0 = i = P (Xn = j | X0 = i)
n=1

X (n)
= pij = Pij (1).
n=1

Using proposition 5.42, we obtain the following result.


Proposition 5.44. A state j is recurrent if and only if it expects to visit j infinitely
many times starting from a state i ↷ j i.e
h i
E Nj | X0 = i = ∞ ∀i ↷ j.

Proof. Left as an exercise for the reader.


5.5. STRUCTURE OF MARKOV CHAINS 145

An immediate
h consequence
i of proposition 5.44 is that the amount of ex-
pected E Nj | X0 = i visits to a transient state j is finite. The next proposition
tells us how we can calculate this quantity.

Proposition 5.45. Suppose j is a transient state. Then the expected amount of visits
to j is given by
 
h P Tij < ∞
i
E Nj | X0 = i =  .
1 − P Tj < ∞

Furthermore, the random variable Nj | X0 = j is geometrically distributed with


mean 1−f
1
:
jj

 
P Nj = k | X0 = j = (1 − fjj )fjjk .

(k)
Proof. Denote by Tij the time of the k−th visit to the state j for the chain
starting in i. For the first visit k = 1, we have

 
(1)
P Tij < ∞ = fij ,

by definition. Notice that by the strong Markov property, we have

 
(k+1) (k) (m)
 
P Tij = n + m | Tij = n = P Tjj = m = fjj .

(k)
Conditioning on the stopping time Tij = n essentially resets the Markov chain
146 CHAPTER 5. MARKOV PROCESSES

at j. We get for all k > 1


   
(k) (k) (k−1)
P Tij < ∞ = P Tij < ∞, Tij <∞

X  
(k) (k−1)
= P Tij < ∞, Tij =n
n=1
X∞    
(k) (k−1) (k−1)
= P Tij < ∞ | Tij = n P Tij =n
n=1
X∞ X∞    
(k) (k−1) (k−1)
= P Tij = n + m | Tij = n P Tij =n
n=1 m=1
X∞ X ∞  
(m) (k−1)
= fjj P Tij =n
n=1 m=1
X ∞    
(k−1) (k−1)
= fjj P Tij = n = fjj P Tij <∞ .
n=1

Using a recursive argument, one can find that


   
(k) (1)
P Tij < ∞ = fjjk−1 P Tij < ∞ = fjjk−1 fij .
 
Therefore, P Nj ≥ k | X0 = i = fij fjjk−1 , and thus by property 1.66 this yields


h i X fij
E Nj | X0 = i = fij fjjk−1 = ,
1 − fjj
k=1

showing the first property. To show that it is geometrically distributed, notice


that if the chain starts at j,

P (N ≥ k | X0 = j) = fjjk .

Using property 1.26, we can indeed find that

Nj | X0 = j ∼ geometric(1 − fjj ).
5.5. STRUCTURE OF MARKOV CHAINS 147

5.5.6 Periodicity

We saw in example 5.10 that for some chains, we have some kind of periodic
behavior in the chain: a state could only be revisited using an even amount
of transitions. In this section, we will make this interesting restriction on the
dynamics more concrete.

Definition 5.46. Let Xn be a Markov chain, and let i ∈ S (Xn ) be any state. Then,
the period of the state d(i) is
 
(n)
d(i) = gcd n ≥ 1 | pii > 0 ,

where gcd stands for the greatest common divisor.

A state is called aperiodic if d(i) = 1.

Example 5.20. Consider the example of the rat in the maze, example 5.15.
What are the periods of the different states?

The periodicity of the states in a chain tells us a lot about the dynamics. We
will show that the period is constant on equivalence classes. For this, we need
the following algebraic results.

Proposition 5.47. Let a, b, d ∈ N. Then

• If a and b are divisible by d, then so is a + b.

• If a is divisible by d and a + b is divisible by d, then so is b

Proof. We first show the first statement. For this, since a, b are divisible by d we
can write a = αd and b = βd. Then a + b = (α + β)d, which is again divisible by
d.

For the second statement, since a+b is divisible by d, we can write a+b = γd.
Since a is divisible by d, we can write a = αd. Notice that b = a+b−b = (γ −α)d,
and hence b is also divisible by b.
148 CHAPTER 5. MARKOV PROCESSES

Proposition 5.48. Let Xn be a Markov chain and suppose i, j ∈ S (Xn ). If i ∼ j,


then d(i) = d(j). We say that periodicity is a solidarity property.

Proof. Since i ∼ j, we know that there must a finite path p1 from i to j and a
finite path p2 from j to i. Let the respective lengths of both paths be a and b
respectively. The concatenation of these paths p2 ◦ p1 leads to a path from i to
i with length a + b, and by definition we have that a + b is divisible by d(i).

Then, choose any path q from j to j, and denote its length by c. Then the
concatenation p2 ◦ q ◦ p1 is a path from i to i. Hence, a + b + c is divisible
by d(i), and from the previous proposition it holds that c is thus divisible by
d(i). Since the path q was chosen randomly, we find that d(i) is a common
(n)
divisor for all n with pjj > 0 and therefore must also be a divisor of d(j). Since
the argument can be repeated with the roles of i and j swapped it follows that
d(i) = d(j).

An immediate result is the following.

Property 5.49. Let Xn be an irreducible Markov chain. Then all states have the
same period.

In example 5.10, we noticed that the states had period 2, and this was used
to argue that the chain was not regular. The following result shows that the
aperiodicity and irreducibility of a chain are equivalent to its regularity. We
omit the proof.

Theorem 5.50. Let Xn be a Markov chain. Then this chain is regular if and only
if it is aperiodic and irreducible.

5.5.7 Canonical Decomposition of Markov Chains

In this section, we consider the canonical decomposition of Markov chains. For


this, we will need some more results on the equivalence classes.

Proposition 5.51. If a state i is recurrent and i ↷ j, then


5.5. STRUCTURE OF MARKOV CHAINS 149

• i∼j

• State j is recurrent

• fij = fji = 1.

Proof. We start by proving the first statement. For this, denote by m the smallest
(n)
integer such that pij > 0, which exists because of the fact that i ↷ j. Then we
can choose a corresponding path

i → j1 → j2 → ... → jm−1 → j,

where
pij1 pj1 j2 ....pjm−1 jn > 0.
Notice that  
P (Ti = ∞) ≥ pij1 pj1 j2 ...pjm−1 j P Tji = ∞ .
 
However, since P (Ti = ∞) = 0 and by the above, this can only hold if P Tji = ∞ =
0. Hence, j ↷ i.

For the second statement, we will show the converse result: if i is transient
and i ∼ j, then j is transient. For this, notice that since i ∼ j, we can find an
(m) (n)
n and an m such that pij > 0 and pji > 0. By the Chapman-Kolmogorov
equations, we have that for all k ≥ 1,
(m+k+n) (m) (k) (n)
pii ≥ pij pjj pji

Summing over all k, we get


∞ P∞ (i+k+n)
k=1 pii
X (k)
pjj ≤ (m) (n)
< ∞,
k=1 pij pji

and by property 5.41 this indeed implies that j is transient.


 
For the final property, notice that we have already shown that P Tji = ∞ =
 
0, and therefore the complement P Tji < ∞ = fji = 1. The converse result
holds by using the previous two results.
150 CHAPTER 5. MARKOV PROCESSES

The converse of the above proposition gives us another way to think about
transient states. It says that if we can go from i to j but not back, then it cannot
be recurrent.
Proposition 5.52. If i ↷ j and j
H
↷
H
Hi, then i is transient

We now relate the results above with the closedness property of a set.
Proposition 5.53. In any finite closed set, there is at least one recurrent state.

Proof. An easy way to show this is by using the expected-visits characterization


of recurrence. Suppose that all states in the closed set A are transient. Thus
for any state in A, the amount of visits is finite. However, then we have only
a finite amount of visits in a finite amount of states. Since a chain is infinitely
long, this implies that we need to leave A at some point which contradicts the
fact that A is closed.

An immediate result is the following.


Property 5.54. In any finite Markov chain, there is at least one recurrent state.

Proof. Since the Markov chain is finite, S (Xn ) is a finite and closed set. The
result then follows from proposition 5.53.

Using the previous results, we can show the following theorem. It relates the
property of irreducibility with recurrence in the case of finite sets.
Theorem 5.55. Suppose A is a finite closed and irreducible set. Then all states in
A are recurrent.

Proof. Since A is finite and closed we know by proposition 5.53 that there is
at least one recurrent state which we will denote by r. Since A is irreducible,
we have that i ↷ j for any two states i, j. In particular, any state i in S (Xn )
communicates with r and is thus recurrent by proposition 5.51 which shows the
desired result.

We now introduce the canonical decomposition theorem.


5.5. STRUCTURE OF MARKOV CHAINS 151

Theorem 5.56. Suppose we have a finite state space S (Xn ), then we can write

S (Xn ) = T ⊔ R1 ⊔ R2 ⊔ ... ⊔ Rk ,

where T is a set of transient states and where each Ri is a closed irreducible set of
recurrent states.

Proof. Define T as the set of all states in i ∈ S (Xn ) for which there exists a
j such that i ↷ j and jH ↷i.

H Then by proposition 5.52, it follows that i is
transient, and thus all states in T are transient.

Consider now any state in i ∈ S (Xn ) \ T . Define R1 to be the set

R1 = {j ∈ S (Xn ) | i ↷ j} .

Notice that if j ∈ R1 then i ∼ j because i < T . It is easy to see that R1 is closed.


By transitivity of the equivalence relation, it is also irreducible. Hence, the set
R1 is a finite, closed, and irreducible set and therefore consists of recurrent
states by theorem 5.55.

Proceed by choosing s ∈ S (Xn ) \ (T ⊔ R1 ) and defining

R2 = {j ∈ S (Xn ) | s ↷ j} .

Obviously, R1 ∩ R2 = ∅, and we can repeat the same arguments as before.


Repeating this process until we exhaust the state space, we obtain the desired
decomposition. Since the state space is finite this procedure ends in a finite
amount of steps.

Consider now the following relabeling of the states in the transition matrix:

• States in the same set Ri are given consecutive labels and are thus grouped.

• Given an ordering R1 , R2 , ...., we give the states in R1 the labels 1, 2, ...., |R1 |,
the states in R2 the labels |R1 | + 1, ..., |R1 | + |R2 |, ... .

• The states in the transient set T are placed at the end.


152 CHAPTER 5. MARKOV PROCESSES

Notice that then obtain the following matrix:


 
 P1 0 0 0 ... 0 
 0 P2 0 0 ... 0 
 
P =  0 0 P3 0 ... 0  ,
 
 ... ... ... ... ... ... 

... Qk+1
 
Q1 Q2 Q3 Q4

where Pi is the matrix of size |Ri | × |Ri |. Indeed, we can make the following
observations:

• If s ∈ Ri and t < Rj , then pst = 0. Hence, the off-block diagonals are zero
for rows corresponding to elements in one of the Ri .

• If s ∈ Ri and t ∈ T , then we cannot have pst > 0. However, we can have


pts > 0. This is captured by Qi .

Hence transitions within the Ri are governed by Pi and transitions from T are
governed by Qj .

This relabeling of the states is only possible because of the canonical decom-
position theorem. As one might expect, working with this relabeled transition
matrix makes calculations a lot easier. In the next section, we will use this result
extensively in order to calculate absorption probabilities.

5.6 Absorption dynamics

In the Gambler’s ruin problem (example 5.3), we found that the probability of
ending up in one of the absorbing states was equal to one. In this section, we
focus more on the dynamics underlying the absorption of chains. We start with
a definition.

Definition 5.57. A Markov chain is absorbing if it has the following two properties:

1. There is at least one absorbing state.


5.6. ABSORPTION DYNAMICS 153

2. For any non-absorbing j ∈ S (Xn ), we can find an absorbing state i ∈ S (Xn )


such that j ↷ i

In other words, we can transition from any non-absorbing state to an absorbing state
in a finite number of steps.

Suppose we have a Markov chain that starts in the transient class T . In some
sense, we can think of this process as still being undecided. Indeed, as soon
as the chain enters one of the Ri , we know that it will remain there since these
are closed sets. It is interesting to consider the time it takes for the chain to
transition from T into one of the closed sets Ri . We hence define the following
random variable.
Definition 5.58. Given a Markov chain Xn , we can define the exit time of transient
states T for a chain starting in i ∈ T as
τi = min {n ≥ 0 : Xn < T | X0 = i ∈ T } .

Exiting the transient states can be due to the chain entering any of the Ri .
We will call this transition an absorption, but this should not be confused with
a transition to absorbing states6 . The associated probabilities are defined as
follows.
Definition 5.59. The (absorption) probability that the chain starting from i ∈ T
leaves T because of absorption at state j , T is denoted
 
uij = P Xτi = j | X0 = i.

In order to calculate uij , we first propose some notation. We will start by


rewriting  

 P1 0 0 0 ... 0 

 

 0 P2 0 0 ... 0 

 
P =  
 0 0 P3 0 ... 0 



 ... ... ... ... ... ... 

 
 Q1 Q2 Q3 Q4 ... Qk+1 

6 Although these transitions are also absorbing!


154 CHAPTER 5. MARKOV PROCESSES

as " #
P∗ 0
P= .
C Q
Furthermore, we denote (Q)i,j = qij , (P∗ )i,j = pij and (C)i,j = cij . Notice that
the absorption transitions are those governed by the probabilities in C.

In order to calculate these absorption probabilities, we can use the following


result.
Proposition 5.60. The absorption probabilities uij satisfy the system of equations
X
uij = pij + qik ukj .
k∈T

Intuitively, this comes from the fact that the chain gets absorbed in one
step (with probability pij ), or that the chain first moves to another state k (with
probability qik ) and later gets absorbed in j (with probability ukj ).

Proof. Conditioning on the first step gives us


X  
uij = P Xτi = j, X1 = k | X0 = i .
k∈S(Xn )

Using the canonical decomposition theorem, we can rewrite this as a set of sums
X   X  
uij = P Xτi = j, X1 = k | X0 = i + P Xτi = j, X1 = k | X0 = i
k∈T k∈R1
X  
+ P Xτi = j, X1 = k | X0 = i + ....
k∈R2

Suppose without loss of generality that j ∈ R1 . Then by definition of τi we


know that for all n , 1
X  
P Xτi = j, X1 = k | X0 = i = 0
k∈Rn

and that for n = 1 (check this!)


X  
P Xτi = j, X1 = k | X0 = i = pij .
k∈R1
5.6. ABSORPTION DYNAMICS 155

Hence, we obtain
X  
uij = P Xτi = j, X1 = k | X0 = i + pij .
k∈T

Using the law of total probability, we obtain


X   XX
P Xτi = j, X1 = k | X0 = i = P (τi = n, Xn = j, X1 = k | X0 = i)
k∈T k∈T n≥2
XX
= P (X2 ∈ T , ..., Xn−1 ∈ T , Xn = j, X1 = k | X0 = i)
k∈T n≥2
XX
= P (X2 ∈ T , ..., Xn−1 ∈ T , Xn = j | X1 = k, X0 = i) P (X1 = k | X0 = i)
k∈T n≥2

Using the Markov property, this can be written as


X   XX
P Xτi = j, X1 = k | X0 = i = P (X1 ∈ T , ..., Xn−2 ∈ T , Xn−1 = j | X0 = k) pik
k∈T k∈T n≥2
X X  
= P Tkj = n pik
k∈T n≥1
X   X
= P Tkj < ∞ pik = ukj pik
k∈T k∈T

This yields the desired result.

Writing the system of equations in terms of matrices, we get from proposi-


tion 5.60
U = C + QU,
where (U)ij = uij .

This has as solution for U, provided that I − Q is invertible,

U = (I − Q)−1 C.

(n)
We provide an interpretation of this result. For this, denote by uij the
probability that a chain starting at i has been absorbed in j in n or fewer
156 CHAPTER 5. MARKOV PROCESSES

(n)
transitions. Denote by U(n) the matrix (uij ). It is not difficult to see that
limn→∞ U(n) = U.

Notice that U(1) = C, since the probability that it goes from i straight to j
is exactly pij . For U(2) , this is given by

U(2) = C + QC = (I + Q)C.

For U(3) , notice that


(3) (2)
uij = uij + P (X1 ∈ T , X2 ∈ T , X3 = j | X0 = i)

Thus, we have all the two-step transitions within T and then the transitions
from T to j. This gives

U(3) = C + QC + Q(2) C,

where Q(2) is the two-step transition matrix. We have already seen that this
equals Q2 , and hence we get

U(3) = C + QC + Q2 C = (I + Q + Q2 )C.

By repeating the same argument, one can show that

U(n) = (I + Q + ... + Qn−1 )C.

In the limit, we find ∞ 


X 
U =  Qn  C.
n=0
One can show that similarly to the series
n
X 1
zn = = (1 − z)−1 ,
1−z
n=0

we get for sufficiently well-behaved matrices Q that


n
X
Qn = (I − Q)−1 .
n=0
5.6. ABSORPTION DYNAMICS 157

In the rest of this section, we will assume without proof7 that the matrices Q we
encounter are sufficiently well-behaved. This leads to the solution

U = (I − Q)−1 C.

Example 5.21. We calculate the probability of absorption in the Gambler’s


ruin (example 5.3). We suppose N = 4 and p = 0.4

Notice that after reordering we get

1 2 3 0 4 
 

1 0 0.4 0 0.6 0 
 
2 0.6 0 0.4 0 0 
 
P =  
3 0 0.6 0 0 0.4
0 0 0 0 1 0 
 
4 0 0 0 0 1
 

and hence    
 0 0.4 0  0.6 0 
Q = 0.6 0 0.4 , C =  0 0 
   
0 0.6 0 0 0.4
   

One can show that


 
 1.46154 0.769231 0.307692
(I − Q)−1 =  1.15385 1.92308 0.769231
 
0.692308 1.15385 1.46154
 

and  
0.876923 0.123077
U = (1 − Q)−1 C = 0.692308 0.307692
 
0.415385 0.584615
 

Hence, if you start with 1 unit of money the probability that you end up with
nothing is around 0.87 and even if you start with 3 units, the probability of
ruin is still around 40%!

7 See the remark at the end of this section for more information.
158 CHAPTER 5. MARKOV PROCESSES

Knowing the probabilities of absorption is already interesting. But what is


even more interesting is knowing how long we expect this absorption to take.

More generally, suppose we associate with each state a reward


g : S (Xn ) → R : i → g(i),
and we are interested in the expected total reward accumulated by the chain up
until absorption.
τ −1 
i
X 
wi = E  g(Xn ) | X0 = i  .
n=0

By setting g ≡ 1, we get
τ =1 
i
 X 
wi = E  1 | X0 = i  = E [τi ] ,
n=0

the expected absorption time. If we set g(j) = τi,j then wi is the expected
number of visits to the (transient) state j when starting from i.
Proposition 5.61. The matrix W = (wij )ij of expected rewards is given by

W = (I − Q)−1 G,
where the ith entry of the column matrix (G)i is given by g(i).

Proof. We can prove this by conditioning on the first step and adopting a similar
strategy as in proposition 5.60 (try this!).
(n)
However, we will prove this using a different strategy. Denote by Wij the
total expected reward of the chains starting in i that have been absorbed in j in
n or fewer transitions. Notice that
(1)
Wij = G,

since for these chains, the only transient state visited is the initial state i. Hence,
(1)
wij = g(i). For W (2) , it is easy to see that we get
(2)
X
wij = g(i) + pik g(k),
k∈T
5.7. STATIONARITY 159

or in matrix notation
W (2) = G + QG.
Just like we did in the case of absorption probabilities, one can show that
W (n) = (I + Q + ... + Qn−1 )G.
It follows that ∞ 
X 
lim W (n) = W =  Qn  G.
n→∞
n=0
Therefore, we indeed obtain
W = (I − Q)−1 G.

Remark. The observant reader might argue that it has not been shown that ∞ n
P
n=0 Q
exists. The reason for this is that proving this requires quite a bit of machinery. For
the sake of completeness, we give a brief overview of the argument but omit a lot of
the details.

1. Q is a (strictly) sub-stochastic matrix, and these satisfy ρ(Q) < 1, where ρ is


the operator norm.
2. The result then follows from Neumann’s lemma.

5.7 Stationarity

We start this section by linking Markov chains to the theory of renewal pro-
cesses. For this, consider the following result.
Proposition 5.62. Suppose Xn is a Markov chain and suppose j ∈ S (Xn ) is a
recurrent state. Then the counting process
k
X
Njj (k) = I(Xi = j | X0 = j)
i=1

which counts the number of times the chain visits j starting from i is a renewal
process.
160 CHAPTER 5. MARKOV PROCESSES

Proof. This is an immediate result of the Strong Markov property.

In this case, the increments are given by the random variables Tjj .

Using the strong law of renewal processes (theorem 4.5), we find the follow-
ing result.
Theorem 5.63. Let Xn be a Markov chain and suppose that j ∈ S (Xn ) is a recur-
rent state. Then
Njj (t) 1
lim = with probability 1,
t→∞ t T jj
where T jj is the mean recurrence time for the state j.

Proof. This is an immediate result from theorem 4.5 and proposition 5.62.

The theorem applies to chains that start at j. However, is this result still
true if we start from any i ↷ j?
Theorem 5.64. Suppose i ↷ j and suppose furthermore that j is recurrent. Define
the process
k
X
Nij (k) = I(Xi = j | X0 = i).
i=1
Then
Nij (t) 1
lim = with probability 1.
t→∞ t T jj

In what follows, we will show that some Markov chains have an interesting
property closely related to the above results. It all starts with the question:
Given some state j, what is the probability that a chain starting at i is at j at
time N >> 0? In general, this of course heavily depends on i. However, we
have the following result.
Proposition 5.65. Suppose the transition probabilities of a Markov chain X con-
verge to a limit independent of X0 , i.e
(n)
lim pij = πj ∀i, j.
n→∞ |{z}
|{z}
Depends on i Doesn’t depend on i
5.7. STATIONARITY 161

This is a probability distribution on the state space which satisfies


P
j πj =1
• πP = π
• Suppose that X0 has as distribution π, i,e P (X0 = i) = πi , then every position
at the chain has the distribution π, not only the limit.

Proof. We start by proving the first statement. This statement is easily seen to
be equivalent to stating that P(n) = Pn is stochastic for all n which we have
shown in property 5.14.

For the second part, we have by the Chapman-Kolmogorov equations


(n+1)
X (n)
pij = pik pkj ,
k

which in the limit gives


(n+1)
X X (n)
lim pij = πj = (πP)j = πk pkj = lim pik pkj .
n→∞ n→∞
k k

Finally, using the law of total probability we find


X X X
P (X1 = j) = P (X1 = j | X0 = i) P (X0 = i) = pij P (X0 = i) = pij πi .
i i i

Thus, it follows from the second statement that P (X1 = j) = πj . One can repeat
this argument iteratively to obtain the result for all n.

Intuitively, πj is the proportion of time that the Markov chain spends in


state j once the equilibrium state has been reached.
Remark. In order to understand why we call this distribution the steady state,
suppose that Xn has distribution π. Then the distribution of Xn+1 is given by πP,
and thus
P (Xn+1 = j) = (πP)j = πj = P (Xn = j) .
In other words, the distribution of Xn+1 is exactly the same as Xn .
162 CHAPTER 5. MARKOV PROCESSES

Up until now, we only considered probability measures. However, there are


more general kinds of measures that measure a volume that does not necessarily
correspond to a probability. An example of this is the Lebesgue measure on R,
which assigns to each interval [a, b] in R the volume µ([a, b]) = b − a.

Just like probability measures, measures are defined on σ −algebras. They


also satisfy the "intuitive" requirements for volumes. They are defined as follows.
Definition 5.66. Let (Ω, F ) be a measurable space, i.e a set Ω and a σ -algebra F
defined on the Ω. A measure µ is a function
µ : F → R≥0 ∪ {∞} : A 7→ µ(A)
satisfying the conditions

1. For all A ∈ F , we have µ(A) ≥ 0. Thus, sets have non-negative volumes.


2. One has µ(∅) = 0. Thus, the empty set has zero volume.
3. It satisfies the σ −additivity rule: for any sequence A1 , A2 , ... of pairwise dis-
joint sets in F , we have:
[ X
µ( An ) = µ(An ).
n≥1 n≥1

Remark. Notice that it is not difficult to show that a probability measure is a


measure. Indeed, we only need to check that P (∅) = 0 follows from the definition.
However, notice that
P (Ω) = P (Ω ∪ ∅) = P (Ω) + P (∅) ,
from which the condition follows.

Using this generalization, we define the notion of an invariant measure.


Definition 5.67. An invariant measure on a countable state space S (Xn ) is a
non-negative vector ν = (ν1 , ν2 , ...) such that
ν = νP,
with P the transition matrix. If this measure is a probability measure, i.e
P
j πj = 1,
then we call ν a stationary distribution.
5.7. STATIONARITY 163

Example 5.22. Suppose we have a Markov chain with transition matrix


1 1 
 2 2 0 0 
 1 1 0 0 
P =  2 2 1 1 
 0 0 2 2 
0 0 12 21
 

Let ν = (ν1 , ν2 , ν3 , ν4 ), then

νP = ν1 +ν ν1 +ν2 ν3 +ν4 ν3 +ν4


 
2
2 2 2 2 .

The invariant measures that are also probability measures are of the form
 
ν = α2 α2 1−α 2
1−α
2 .

From the above example, we see that stationary distributions need not be
unique. In fact, each closed and irreducible equivalence class Ri may "sup-
port" its own distribution. This measure is given by π|i which satisfies π|i (j) =
0∀j < Ri . In the example above, we have
   
ν = 21 12 0 0 , ν ′ = 0 0 21 12

We briefly mention the following uniqueness theorem, but we omit the proof.
Theorem 5.68. Suppose that the Markov chain Xn is regular. Then it admits a
unique stationary distribution.

5.7.1 Finding the stationary distribution

In this section, we provide three ways to obtain the stationary distribution. The
first two ways are using numerical approximations and the third approach uses
results from linear algebra.

The first approach uses the fact that


(n)
lim p = πj .
n→∞ ij
164 CHAPTER 5. MARKOV PROCESSES

Thus we expect that for N >> 0,


(N )
pij ≈ πj .

Using this, the first approach calculates a high power of P(n) and takes the row
vector, which is going to be an approximation of the stationary distribution.

Example 5.23. Consider the transition matrix


 
0.7 0.2 0.1
P =  0 0.5 0.5 .
 
0 0.9 0.1
 

This has as steady state π = [0, 0.6429, 0.3571]. To check how fast the matrix
converges, we consider the Euclidean distance
 
π1 π2 π3 
dn = P(n) − π1 π2 π3 
 
π1 π2 π3
 

for all n. The convergence is given in figure 5.10

Figure 5.10: Convergence of dn to 0 as n → ∞.

A better way that often isn’t as computationally expensive is using a Monte


Carlo sampling method. This frequentist approach goes as follows.
5.7. STATIONARITY 165

1. Using the methods covered before, sample a Markov chain iteratively.

2. After N transitions, consider the final state XN .

3. Repeat this M times to obtain M sampled values of XN . Then, estimate

Amount of samples such that XN = j


πj = .
M

Of course, the quality of this method heavily depends on the size of the state
space, the sparseness of the stationary distribution, and the choice of simulation
amount M. Using methods such as parallel computing can heavily speed up
this procedure.

The third approach uses results from linear algebra. For this, we will need
the following definition.

Definition 5.69. Let A be a square matrix. An eigenvalue is a number λ such that

Ax = λx for some x , 0.

The corresponding vector x is called the eigenvector with eigenvalue λ

These eigenvalues are exactly those values for which A − λI is singular.


Hence, they satisfy the relation

det (A − λI) = 0.

Notice that det (A − λI) is a polynomial in λ. This polynomial is called the


characteristic polynomial. The roots are the eigenvalues.

Example 5.24. Consider the matrix


 
0 1 0 
A = 0 0 1 
 
4 −17 8.
 

The characteristic polynomial is given by (check this!)

det (A − λI) = −λ3 + 8λ2 − 17λ + 4 = 0.


166 CHAPTER 5. MARKOV PROCESSES
n √ √ o
This has as roots λ ∈ 2 − 3, 2 + 3, 4 .

To find the eigenvector, we solve the equation

(A − λI) v = 0.

For the eigenvalue λ = 4, this becomes

v1 = v42
 


4v 1 − v2 = 0 


v2 = v43
 
4v2 − v3 = 0 =⇒ 
 

 
 v1 17v − 4v = 0
 
−4v + 17v − 4v = 0
 
1 2 3 + 2 3

which has as solution set v = {α, 4α, 16α}. The unique eigenvector with norm 1
is given by
v = [0.06050.24210.9684].

Relating this back to the concept of stationarity, we notice that

π = πP

implies that
πT = PT πT .

Hence, π is the eigenvector of PT corresponding with the eigenvalue 1.

Example 5.25. We revisit example 5.23. For this, we had


 
0.7 0.2 0.1
P =  0 0.5 0.5 .
 
0 0.9 0.1
 

We find that π is then the solution of

0 = (PT − 1I)πT
  
 0.3 0 0  π1 
= −0.2 0.5 −0.9 π2  .
  
−0.1 −0.5 0.9 π3
  
5.7. STATIONARITY 167

From the first equation, we find that π1 = 0. The second and third equations
are equivalent, and we find that the solution set is given by

ν = [0, 1.8α, α].

The stationary distribution is therefore given by

[0, 1.8α, α]
π=
0 + 1.8α + α
1.8 1
= [0, , ]
2.8 1.8
= [0, 0.6429, 0.3571],

which is the same as in example 5.23.

To finish up this section, we show a special case where the stationary distribu-
tion is very nice.

Definition 5.70. Let Xn be a Markov chain with transition matrix P. We call this
matrix doubly stochastic if both the columns and the rows sum up to 1:
X X
pij = pij = 1.
j i

For Markov processes with doubly stochastic transition matrix P, we find


the following theorem.

Theorem 5.71. Let Xn be a Markov chain whose transition matrix P is doubly


stochastic. Suppose that S (Xn ) is of cardinality N . Then the uniform distribution
is a stationary distribution
1
πi = , ∀i > 0.
N

Proof. We need to show that π = πP when πi = N.


1
Thus, we need to show

N
1 X
πi = = πj pij ∀i = 1, ..., N .
N
j=1
168 CHAPTER 5. MARKOV PROCESSES

To show this, we use the fact that P is doubly stochastic

N N
X 1X 1
πi pij = pij = ,
N N
i=1 i=1

which concludes the proof.

5.8 Branching Processes

When we want to model something that has some kind of reproductive dynam-
ics, branching processes are often a good choice. This kind of behavior occurs
in many different domains such as epidemiology, biological processes, and even
nuclear physics.

Reproduction dynamics occur whenever an individual can generate new


individuals. These new individuals can in turn generate another batch of new
individuals and so forth. In this sense, the process branches out just like a
family tree.

We start by introducing the notation. For each generation i, we write the


number of members in that given generation by Xi . We also label the members
of the generation by an integer 1, 2, ..., X1 . The amount of members of the next
generation Xi+1 is then equal to the total amount of offspring for each member
in the previous generation combined. We will denote the number of children of
parent i from generation j by Zj,i . Since each member of a given generation
has a parent from the previous generation, it follows that

X
Xn−1

Xn = Zn−1,l
l=1

Graphically, we can represent this as follows.


5.8. BRANCHING PROCESSES 169

Figure 5.11: First three generations of a branching process

In the remainder of this chapter, we will make the following assumptions:

• The branching processes start with a single member in the zeroth gener-
ation: X0 = 1.

• The number of children/offspring Zi,j are IID with finite mean µ for all
i, j. We will denote this random variable by Z.

• We assume that Zi,j is independent of Xn for any i, j and n.

Using these assumptions, we now ask the following questions.

1. What is the expected size of generation n?

2. What is the probability of extinction p? Extinction occurs when a gener-


ation (and therefore all subsequent generations) has 0 members.
170 CHAPTER 5. MARKOV PROCESSES

3. What is the distribution of the time T until the extinction of the popula-
tion?

We start with the first question. Using the tower law (property 1.73) and our
assumptions, we get
X   X 
n−1
 X    X n−1 
E [Xn ] = E  Z  = E E  Z | Xn−1 
i=1 i=1
= E [ZXn−1 ] = E [Z] E [Xn−1 ] = µE [Xn−1 ] .
Since E [X0 ] = 1, one can easily see that
E [Xn ] = µn , ∀n ∈ N.

For the second question, we need to calculate


p = P (Population eventually dies out | X0 = 1) .
Since the Z are independent, notice that
p2 = P (Population eventually dies out | X0 = 2)
or more generally,
pj = P (Population eventually dies out | X0 = j) .

Indeed, we can simply consider each branch of children from the j parents
as their own independent branching process, see figure 5.12.

Figure 5.12: Branching process with j members in the zeroth generation


5.8. BRANCHING PROCESSES 171

Furthermore, suppose we have j children in the first generation and only


one in the zeroth generation. Then we can again see this as j independent new
branching processes starting from the members of the first generation. Thus,

p = P (Population dies out)


X∞
= P (Population dies out | X1 = j) P (X1 = j)
j=0
X∞
= pj P (Z = j) .
j=0

Intuitively, this corresponds to the fact that for the whole population to die out,
either no one lives in the first generation or all the branches from the members
of the first generations die out, see figure 5.13. There, we indeed see that if the
first generation has j offspring then the only way that the population can go
extinct is if the branches of each of the j members go extinct.
172 CHAPTER 5. MARKOV PROCESSES

Figure 5.13: A branching process with j children in the first generation

We can link this with the generating function of Z, which is given by



X
γZ (s) = sj P (Z = j) .
j=0

We thus find that p satisfies


p = γZ (p).

Suppose that we are given the distribution of Z, and can therefore find the
explicit generating function. We can then find p by looking for the solutions of
the equation γZ (x) = x. However, this might have more than one solution. The
following result tells us that p is not only a solution of this equation but it is the
smallest solution between 0 and 1.
5.8. BRANCHING PROCESSES 173

Proposition 5.72. Assume P (Z = 0) > 0 and P (Z = 0) + P (Z = 1) < 1. then

1. The extinction probability p is the smallest solution of γZ (x) = x within


0≤x≤1
2. p = 1 if and only if µ ≤ 1.

Proof. Assume for now that q ≥ 0 is a solution of γZ (x) = x. We will prove by


induction that q ≥ P (Xn = 0) for all n ≥ 1, from which it follows that

q = γZ (q) ≥ γZ (p) = p.

For n = 1, we have

X
q = γZ (q) = qj P (Z = j)
j=0

= P (Z = 0) + qP (Z = 1) + q2 P (Z = 2) + ...
≥ P (Z = 0) = P (X1 = 0) .

Assume now that q ≥ P (Xk = 0) for all k = 1, ..., n. Using the law of total
probability, we find

X
P (Xn+1 = 0) = P (Xn+1 = 0 | X1 = j) P (Z = j) .
j=0

It is not difficult to see that P (Xn+1 = 0 | X1 = j) = P (Xn = 0)j . Using the


assumption, we obtain

X
P (Xn+1 = 0 | X1 = j) = P (Xn = 0)j P (Z = j)
j=0
X∞
≤ qj P (Z = j) = q.
j=0

Since this holds for all n, we find that

q ≥ lim P (Xn = 0) = p.
n→∞
174 CHAPTER 5. MARKOV PROCESSES

For the second property, we need to prove that p = 1 if and only if µ = E [Z] ≤ 1.
Notice that


dγZ (s) X
| = jP (Z = j) = E [Z] .
ds s=1
j=0

We have the following equalities:


X
γZ′ (s) = jsj−1 P (Z = j) ≥ 0
j=1

X
γZ ”(s) = j(j − 1)sj−2 P (Z = j) ≥ 0
j=2

From the assumptions, we know that P (Z = 0) < 1 and P (Z = 0) + P (Z = 1) <


1. From this, one can easily see that the generating function is both strictly
increasing and strictly convex. We now consider graphically the two cases in
figure 5.14.

γZ (s) > s∀s ∈ (0, 1) ⇐⇒ γ ′ (1) ≤ 1


γZ (s) = s for an s ∈ (0, 1) ⇐⇒ γ ′ (1) > 1.
5.8. BRANCHING PROCESSES 175

Figure 5.14: The two cases

Since p is the smallest solution of γZ (s) = s on [0, 1], p = 1 if γ ′ (1) = E [Z] =


µ ≤ 1.

Notice that if P (Z = 0) = 0 and there is at least one member in the zeroth


generation, then the probability of extinction is zero. Indeed, then the size of
generation n is at least the size of generation n − 1.

We now consider the third question: the distribution of the time T until
extinction. This random variable is defined as follows:

{T = n} = {Xn = 0} \ {Xn−1 = 0} ,

considered here as subsets of the σ −algebra. Hence

P (T = n) = P (Xn = 0) − P (Xn−1 = 0) = γn (0) − γn−1 (0),


h i
where γn (s) = γXn (s) = E sXn is the generating function of the n-th generation.
176 CHAPTER 5. MARKOV PROCESSES

Note that γn (s) is not the same as γZ (s), but they are related. Their exact
relation is the content of the next property.
Property 5.73. Using the setting above, we have the recursive definition
γn+1 (s) = γn (γZ (s)).

Proof. Using the Tower Law,


h h ∞
ii X h i
Xn+1
γn+1 (s) = E E s | Xn = E sXn+1 | Xn P (Xn = j) .
j=0

Since Xn+1 is the amount of children with parents in the n−th generation, i.e
PXn
Xn+1 = k=1 Zn,k , this can be written as
 
X∞ Yj 
γn+1 (s) = sZn,k | Xn = j  P (Xn = j) .
 
E 
 
j=0 k=1

We assume that each of the families grows independently of the other families.
We therefore obtain
j
∞ Y
X  
γn+1 (s) = E Zn,k P (Xn = j)
j=0 k=1

X
= (γZ (s))j P (Xn = j)
j=0
h i
= E γZ (s)Xn = γn (γZ (s)).
From this, the result easily follows.

Example 5.26. Suppose we have a branching process with the following


reproduction probabilities:
1
6 k=0



1

2 k=1





P (Z = k) = 0 k=2.


1

3 k=3





0 k > 3

5.8. BRANCHING PROCESSES 177

1. Is the extinction probability 1?

2. What is the probability to reach extinction in the third year

3. What is the probability that the process ever dies out?

To answer these questions, we note that the generating function of Z is


1 1 1
γZ (s) = + s + s3 .
6 2 3

1. Notice that µ = 1
2 + 3 · 13 = 3
2 > 1, and hence the extinction probability is
not 1.

2. We have P (T = 3) = γ(γ(γ(0))) − γ(γ(0)), which can be shown to be


around 0.0462.

3. We need to find the smallest root in (0, 1) of


1 1 1
γZ (s) − s = − s + s3 = 0
6 2 3
Since 1 is a root, this can be written as
1
γZ (s) − s = (s − 1) (2s2 + 2s − 1) = 0
6
which can easily be solved to find the roots
√ √
3 1 3 1
s = 1, s = − ,s = − − .
2 2 2 2
Hence, the probability of extinction is

3 1
p= − ≈ 0.366.
2 2
178 CHAPTER 5. MARKOV PROCESSES
Chapter 6

Hidden Markov Models

Three things cannot be long hidden: the sun,


the moon, and the truth

Buddha

On a weekday morning, you wake up and want to choose your outfit for
the day. Instead of checking the weather application on your phone, you look
outside and see what passersby are wearing. However, this is insufficient infor-
mation: everyone knows at least one person that likes to wear shorts in freezing
temperatures, or no matter how hot it is refuses to wear less than a jacket.

However, on a cold day, we expect to see more warmly-clothed people than


on a hot day. Notice that what we are observing is not what we want to ob-
serve, but a consequence of it: we observe what people are wearing but what
we actually want to know is the temperature today. Furthermore, the unob-
served temperatures (eg. cold days) tend to be preceded by similar weather
conditions. We will model this as a general mixture model with multiple pop-
ulations together with a transition matrix between populations. By modeling
the between-population transitions using a Markov model, we obtain Hidden
Markov Models.

We first revisit Gaussian mixture models in the setting of classification prob-

179
180 CHAPTER 6. HIDDEN MARKOV MODELS

lems. Then, we consider Hidden Markov models and how we can calibrate them
in practice.

6.1 Gaussian Mixture Models

In a classification problem, we have a data set consisting of several populations


and a selection of features. The goal of classification is to construct a decision
function that assigns observations to the correct population using the values of
the features.

Example 6.1. Some examples of classification problems are

• Lab results and whether the patient is sick or not. This is a binary
classification problem, and we want to predict whether a patient is sick
or not using the lab results.
• Customer information and whether the customer might churn or not
• Banking and transaction information and whether transactions are fraud-
ulent or not
• E-mail information and whether it is spam or not

Many different algorithms are able to predict the class from the features in
the data. We call these algorithms classifiers. Well-known examples are logistic
regression, K-Nearest Neighbours, random forests, and the Gaussian Mixture
Model. We will focus on the latter.
Remark. Algorithms such as K-Nearest Neighbours are called hard classifiers be-
cause they return a single class for each observation. In contrast, soft classifiers such
as the Gaussian Mixture Model return for each class a (posterior) probability that
represents the probability that the given observation belongs to that class.

From any soft classifier, one can construct a hard classifier by simply taking the
class with the highest assigned probability. We call this the Bayes classifier (of the
soft classifier). It can be interpreted as being the ’best guess’ of the soft classifier.
6.1. GAUSSIAN MIXTURE MODELS 181

We will assume that there are two classes c1 and c2 , and a single numerical
feature X which is correlated with the class. Assume that we have a data set
consisting of 10 observations in class c1 and 10 observations in class c2 . We will
first focus on the labeled case in which we know beforehand which class any of
the 20 observations belong to.

In a Gaussian Mixture Model, we will assume that within a given class the
feature X is normally distributed with class-dependent parameters:

X | ci ∼ N (µi , σi2 ).

Given the labeled observations, it is quite easy to estimate the distribution


parameters using the sample mean and sample variance. An example of data
and the estimated densities can be found in figure 6.1.

Figure 6.1: The population curves estimated from the data, with observations
represented by points

From here, building the classifier is quite straightforward. Indeed, suppose


we have estimated class-dependent distribution parameters (µ̂i , σ̂i ). We can
then define the likelihood function of any given observed value of x by
 2
x−µ̂i
1 −1
P (x | ci ) = √ e 2 σ̂i
.
2πσ̂i
182 CHAPTER 6. HIDDEN MARKOV MODELS

Soft classification can then be achieved via Bayes’ theorem:

P (x | c1 ) P (c1 )
P (c1 | x) = ,
P (x | c1 ) P (c1 ) + P (x | c2 ) P (c2 )
P (x | c2 ) P (c2 )
P (c2 | x) = .
P (x | c1 ) P (c1 ) + P (x | c2 ) P (c2 )

Here, P (c1 ) and P (c2 ) are called the priors and P (ci | x) is called the posterior
probability for class ci .

However, in many cases, we do not have labeled data. This could for ex-
ample occur in medical tests, where we are not sure yet how to detect certain
diseases. In this case, we start by using clustering techniques. Intuitively, people
with a certain disease will have similar characteristics1 and will cluster together.

In the Gaussian Mixture Model setting, we will still assume that within
the classes the numerical feature X is normally distributed. Given the class
parameters (µi , σj ), the clustering algorithm would be quite easy: simply assign
each observation to the class with the highest posterior probability. However,
in order to estimate the class parameters we would need to know the classes for
each of the observations. We hence end up in a chicken-and-egg problem:

• In order to determine the parameters (µj , σj ) we need to know the ele-


ments xi in class cj

• In order to classify an element xi , we need the parameters (µj , σj ).

One way to solve this problem is by using the EM-algorithm (Expectation-


Maximization algorithm). This is an iterative algorithm with the following steps:

(0) (0)
• Choose random parameters (µj , σj ) for each population j = 1, ..., K

• Using the Bayes classifier introduced above, assign each observation xi to


(0)
the class with the highest posterior probability xi
1 Assuming that the features are chosen in such a way that they indeed have diagnostic
power!
6.1. GAUSSIAN MIXTURE MODELS 183
 
(1) (1) (0)
• Recalculate (µj , σj ) according to the class populations xi | xi = j

• Repeat until convergence

A visualization of the convergence can be found in figure 6.2

Figure 6.2: Convergence of the EM-algorithm

We see that the converged estimate lies relatively close to the true population
curves.
184 CHAPTER 6. HIDDEN MARKOV MODELS

6.2 Hidden Markov Models

Recall the example given in the introduction, where one uses the observable
clothing of passersby to know what the weather is like, which in this case is
unobservable. In this section, we will construct a suitable model for this.

We will use a Markov chain in order to model the weather. For this example,
we will assume that a specific day is either a hot or a cold day. Furthermore,
we assume that a cold day is followed by another cold day 80% of the time
and similarly for hot days. We hence have a Markov chain with two states,
S (Xn ) = {hot,cold}, and a transition matrix A
!
0.8 0.2
A=P= .
0.2 0.8

We furthermore distinguish three types of clothing that the passersby can wear:
a coat, a sweater, and a t-shirt. For each weather state, we can define the
probability distribution b for the types of clothing that we can observe. Assume
for example that

bhot day = [pcoat|hot , psweater|hot , pt-shirt|hot ]


= [0.05, 0.25, 0.7]

and

bcold day = [pcoat|cold , psweater|cold , pt-shirt|cold ]


= [0.7, 0.25, 0.05].

Then we expect that a passerby has a probability of 70% to wear a coat on a


cold day, but only a probability of 5% to wear a coat on a hot day.

We call the bi the emission probabilities. Graphically, the above model can
be represented as in figure 6.3. This is an example of a Hidden Markov Model.
6.2. HIDDEN MARKOV MODELS 185

Figure 6.3: The graphical representation of the Markov model for the weather
example

We now consider the more general case. Suppose we have a series of hidden
states2 , denoted Q = q1 , q2 , ..., qN . The transition matrix between these hidden
states is denoted by A, and the initial probability distribution is denoted by π.
2 These correspond to the weather states in the previous example.
186 CHAPTER 6. HIDDEN MARKOV MODELS

We denote the series of T observed returns3 as O = o1 , o2 , ..., oT . Further-


more, for each hidden state q we denote the emission probabilities by Bq .

Using the law of total probability, the likelihood of observing O is given by


X
P (O) = P (O | Q) × P (Q) .
q∈Q
| {z } |{z}
Controlled by B Controlled by π and A

There are three possible scenarios for Hidden Markov Models, depending on
the available data and the application.

1. Likelihood: Given A, B, O, determine the likelihood P (O | A, B) .


2. Decoding: From A, B, O, determine the most likely sequence Q.
3. Learning: From O, determine Q and find the best parameters A, B.

We discuss these in the following sections.

6.2.1 Likelihood: The Forward Algorithm

Suppose we are given the transition matrix A and the emission probabilities B.
From the previous section, we know that the probability of observing a sequence
of T observations O is given by
X
P (O | A, B) = P (O | Q) P (Q) .
Q

If there are N different possible hidden states in Q, then there are N T terms
in this sum. For example, if we have the sequence O = [Sweater, Sweater, Coat,
Sweater] in our weather example, we have a total of 24 = 16 possible combina-
tions. Whilst still manageable, for longer time windows or higher amounts of
hidden states this possible set of sequences quickly becomes too large to handle.

Luckily, we can use the Forward Algorithm which adopts a recursive strategy
to simplify the calculations immensely. We start by introducing the following
notation
3 These correspond to the clothing of the passerby in the previous example.
6.2. HIDDEN MARKOV MODELS 187

• We denote by qt (j) or qt = j the event where at time t, the hidden se-


quence Q has state value j
• We denote by bj (ot ) the probability to observe the value ot at time t,
conditionally on the fact that the hidden sequence Q is in state j
• We denote by γt (j) the probability of observing the sequence o1 , ..., ot and
being in the unobservable state j at time t,
γt (j) = P (o1 , ..., ot , qt = j | A, B) .

Conditioning on the (t − 1)-th step and using the law of total probability, we get
γt (j) = P (o1 , ..., ot , qt = j | A, B)
= P (o1 , ..., ot−1 , qt = j | A, B) · P (ot | qt = j, A, B)
= P (o1 , ..., ot−1 , qt = j | A, B) · bj (ot )
 
X 
= bj (ot ) ·  P (o1 , ..., ot−1 , qt−1 = i, qt = j | A, B)
 i 
X 
= bj (ot )  P (o1 , o2 , ..., ot−1 , qt−1 = i | A, B) · P (qt = j | qt−1 = i, o1 , o2 , ..., ot−1 , A, B)
i 
X 
= bj (ot ) ·  P (o1 , o2 , ..., ot−1 , qt−1 = i | A, B) · aij  ,
i

which can be further rewritten as


N 
X 
γt (j) = bj (ot ) ·  γt−1 (i)aij  .
i=1

Using this, we have the following approach for calculating the likelihood.

1. Calculate for each j = 1, ..., N


γ1 (j) = πj bj (o1 ).

2. Recursively calculate for each j = 1, ..., N


N
X
γt (j) = γt−1 (i)aij bj (ot ).
i=1
188 CHAPTER 6. HIDDEN MARKOV MODELS

3. Finally, calculate
N
X N
X
P (O | A, B) = P (o1 , ..., oT , qT = i | A, B) = γT (i).
i=1 i=1

Notice that the values for γt are propagated from the beginning t = 1 to the
end t = T , i.e in a forward fashion. That is why this algorithm is called the
forward algorithm.

To see this algorithm in action, we revisit the weather example.

Example 6.2. Consider once again the weather example. We are interested in
calculating the likelihood of the sequence O = [Sweater, Sweater, Coat].

Here we had as our transition matrix


!
0.8 0.2
A=P= .
0.2 0.8

and emission probabilities

bhot day = [pcoat|hot , psweater|hot , pt-shirt|hot ]


= [0.05, 0.25, 0.7]

and

bcold day = [pcoat|cold , psweater|cold , pt-shirt|cold ]


= [0.7, 0.25, 0.05].

We assume that both states are initially equally likely. Then



1
γ1 (cold) = πcold · bcold (sweater) = 2 · 0.25 = 0.125


γ1 (hot) = πhot · bhot (sweater) = 21 · 0.25 = 0.125.

For the time t = 2, we obtain

γ2 (cold) = γ1 (cold) · acold,cold · psweater|cold + γ1 (hot) · ahot,cold · psweater|hot


= 0.125 · 0.8 · 0.25 + 0.125 · 0.2 · 0.25 = 0.03125
6.2. HIDDEN MARKOV MODELS 189

By symmetry, we have also that γ2 (hot) = 0.03125. Finally, we have

γ3 (cold) = 0.03125 · 0.8 · 0.7 + 0.03125 · 0.2 · 0.05 = 0.0178125

and

γ3 (hot) = 0.03125 · 0.2 · 0.7 + 0.03125 · 0.8 · 0.05 = 0.005625.

Therefore,

P ([Sweater, Sweater, Coat] | A, B) = γ3 (cold)+γ3 (hot) = 0.0178125+0.005625 ≈ 0.023.

Thus, given A, B there is around a 2.3% probability of observing the given


sequence.
190 CHAPTER 6. HIDDEN MARKOV MODELS

6.2.2 Decoding: The Viterbi Algorithm

Sometimes we are interested in the underlying sequence Q of hidden states.


For example in our introductory example, we are interested in the hidden state
because we want to dress accordingly.

Another example lies in sound recognition, more specifically speech-to-text.


Here, the observed signals are the waveforms and the underlying hidden state
is the corresponding letter that it represents. Given a chain of waveforms, we
want to decode this back to the spoken message that caused this waveform. An
example of waveforms can be found in figure 6.4.

Figure 6.4: Waveforms of different words (IPA notation), obtained from [WV]

At a given time t, this comes down to finding the hidden state j which
6.2. HIDDEN MARKOV MODELS 191

maximizes the probability:


vt (j) = P (q1 , ..., qt−1 , o1 , ..., ot , qt = j | A, B) .
Once again, we will develop a recursive formula in order to calculate this. We
will denote this probability as vt (j). One can think of vt (j) as the probability of
the most probable sequence of hidden states that arrive in the hidden state qj
at time t, given the data.

Using this intuition, the following recursive relationship makes sense:


vt (j) = bj (ot ) max(vt−1 (i)aij | i = 1, ..., N ).
We find that each of the possible elements in the sequence Q has a value vt (j)
derived from the previous values of vt−1 (i).

Suppose that we find that i maximizes vt−1 (i) for some t. Then it need
not be the case that vt (j) = bj (ot )vt−1 (i)aij . In other words, the most probable
sequence found at time t − 1 might look completely different from the most
probable sequence found at time t. This is shown in the following example.

Example 6.3. An employer is concerned about the stress levels of their


employees. They distinguish three different levels: low, normal, and elevated.
They found that as deadlines are nearing, people tend to transition from low-
stress levels to normal stress levels and then to elevated stress levels. As soon
as deadlines are met, employees tend to quickly go from elevated stress levels
to low-stress levels. The transition matrix A is given by

Low Normal Elevated


 

 Low 0.6 0.4 0 
 
A = 
 Normal

0 0.75 0.25 
Elevated 0.4 0 0.6
 
Since measuring an employee’s stress level is quite difficult, the employee instead
records the mood of the employees. To simplify things, they categorize an
employee’s mood into one of three categories: happy, neutral, or sad. They
found the following emission probabilities.
Happy Neutral Sad
 

 Low 0.7 0.3 0 
 
B = 
 Normal

0 1 0 
Elevated 0 0.4 0.6
 
192 CHAPTER 6. HIDDEN MARKOV MODELS

Suppose that the employer finds [Neutral, Happy] as the sequence of moods.
Show that the most probable sequence found at t = 1 does not correspond to
the first state in the most probable sequence found at time t = 2.

This example shows that we can only start building the sequence Q after taking
into account all observations up until T . In this sense, the state at time t
determines the state at time t−1, and we need a procedure that goes backwards.
This procedure is obtained using the backpointer.

Definition 6.1. The backpointer is defined as

pt (j) = argmaxi (bj (ot )vt−1 (i)aij | i = 1, ..., N ).

Given the state j at time t, it returns the state at t −1 coming from the most probable
sequence of states.

In the previous example, we find that

p2 (Low) = Elevated.

We now present the Viterbi algorithm, which uses the backpointer in order to
find the most likely sequence of hidden states.

1. Start by defining

v1 (j) = πj bj (o1 )
p1 (j) = 0

for all j = 1, ..., N .

2. Recursively define

vt (j) = max(vt−1 (i)aij bj (ot ) | i = 1, ..., N )


pt (j) = argmaxi (vt−1 (i)aij bj (ot ) | i = 1, ..., N )

for all j = 1, ..., N and t = 2, ...T .


6.2. HIDDEN MARKOV MODELS 193

3. Then the most likely state at time T is

qT = max(vT (i) | i = 1, ..., N ).

The most likely value for time T − 1 is then

qT −1 = pT (qT ).

In general, we find
qt−1 = pt (qt ).

We now show that the most probable sequence in the previous example is
indeed [Elevated, Normal].

Example 6.4. We apply the Viterbi algorithm for example 6.3. We assume
that the states all have the same initial probability. Then

v1 (x) = pNeutral|x



0.3 x = Low
x = Normal

= 1


0.4 x = Elevated

The second state is Happy. We find (check this!)

v2 (j) = max(v1 (i)aij bj (ot ) | i = 1, ..., N )


i



max(0.3 ∗ 0.6 ∗ 0.7, 1 ∗ 0 ∗ 0.7, 0.4 ∗ 0.6 ∗ 0.7) j = Low

 | {z } | {z } | {z }
i=Low i=Normal i=Elevated

= .




0 j = Normal
j = Elevated

0

Thus, v2 (Low) = 0.168 and p2 (Low) = Elevated as we said earlier.

It should now be obvious why we sometimes call the Viterbi algorithm the
Backward-algorithm.
194 CHAPTER 6. HIDDEN MARKOV MODELS

6.2.3 Learning: The Forward-Backward Algorithm

In the previous sections, we always assumed that we knew both A,B, and π.
This is an assumption that is often not true. We will now show how to estimate
these parameters from a data set.

We make a distinction between two cases: the labeled case and the unlabeled
case. In the labeled case, the data is of the form (O, Q), and in the unlabeled
case, it is just O.

Labeled Markov Models

Suppose three different ice cream vendors measure their sales for three consec-
utive days. They make a distinction between busy days, normal days, and slow
days labeled 3, 2 and 1 respectively. They also look at the temperature outside,
labeled H if it is hot and C if it is cold. Their findings are reported in the
following data set:




O1 = (3, 3, 2) Q1 = (H, H, C)

O2 = (1, 1, 2) Q2 = (C, C, H) .




O = (1, 2, 3) Q = (C, H, H)

3 3

They want to model this using a Hidden Markov model. How can we use this
data to estimate π, A and B?

We can easily estimate the initial distribution using the first element of the
Qi , i.e. q1i .

n i o n o
 | q1 = H | i = 1, 2, 3 | | q1i = C | i = 1, 2, 3 | 
π = [πH , πC ] =  , 
3 3


1 2
 
= ,
3 3
6.2. HIDDEN MARKOV MODELS 195

The transition matrix A can be estimated by


Number of transitions i → j
!
A = (aij )ij =
Number of transitions starting from i ij
 
 H C 
= H 23 13  .
 
C 13 23
 

The emission probabilities B can be estimated by


Amount of times we observe (o, q)
P (o | q) =
Amount of times we observe qs
In our case, 


P (1 | H) = 0 P (1 | C) = 43
P (2 | H) = 25 P (2 | C) = 14 .




P (3 | H) = 3

P (3 | C) = 0

5

Unlabeled Markov Models

When the unobservable states are not part of the data set, the estimation pro-
cedure shown above no longer works. We will need to determine not only π, A,
and B, but also Q.

We first introduce the following notation:


γt (j) = P (o1 , ..., ot , qt = j | A, B)
σt (i) = P (ot+1 , ..., oT | qt = i, A, B) .
Here, σt (i) is called the backward probability.

We have already seen in the section on likelihood estimation that the γ can
be estimated using the forward algorithm
γ1 (j) = πj × bj (o1 ) (j = 1, ..., N )
N
X
γt (j) = γt−1 (i)aij bj (ot ) (j = 1, ..., N , t = 2, ..., T .)
i=1
196 CHAPTER 6. HIDDEN MARKOV MODELS

A similar recursive algorithm exists for the backward probabilities σ :

σT (i) = 1 (i = 1, ..., N )
N
X
σt (i) = σt+1 (j)aij bj (ot+1 ) (i = 1, ..., N , t = 1, ..., T − 1)
j=1

We will use these to calculate an estimator for A, given by

Expected number of transitions i → j


âij = .
Expected number of transitions starting at i

In order to simplify things, we will rewrite this in terms of time t transitions,


defined as
ζt (i, j) = P (qt = i, qt+1 = j | O, A, B) .

Indeed, check that


PT −1
ζt (i, j)
âij = PT −1t=1
PN .
t=1 k=1 ζt (i, k)

By using the definition of conditional probability, we can also rewrite ζt as

P (qt = i, qt+1 = j, O | A, B)
ζt (i, j) = .
P (O | A, B)

We will first work out the denominator and then the numerator.

Using the law of total probability on the t−th step, we get

N
X
P (O | A, B) = P (o1 , ..., oT | qt = j, A, B) P (qt = j | A, B)
j=1
6.2. HIDDEN MARKOV MODELS 197

Using conditional independence, this can be further rewritten as


N
X
P (O | A, B) = P (o1 , ..., ot | qt = j, A, B) P (qt = j | A, B) P (ot+1 , ..., oT | qt = j, A, B)
j=1
N
X
= P (o1 , ..., ot , qt = j | A, B) P (ot+1 , ..., oT | qt = j, A, B)
j=1
N
X
= γt (j)σt (j).
j=1

This equality is not very surprising: it essentially says that the likelihood of
observing O given A, B is the same as the likelihood of observing O given that
the t−th hidden state is j, and summing this over all j.

For the numerator, we find


P (qt = i, qt+1 = j, O | A, B) = P (qt = i, qt+1 = j, o1 , ..., oT | A, B)
= P (qt = i, o1 , ..., ot | A, B) P (qt+1 = j | qt = i, A, B)
× P (ot+1 | qt+1 = j, A, B) P (ot+2 , ..., oT | A, B)
= γt (i)aij bj (ot+1 )σt+1 (j).
Combining the equalities found for the numerator and denominator, we can
thus rewrite ζt as
γt (i)aij bj (ot+1 )σt+1 (j)
ζt (i, j) = PN .
j=1 γt (j)σt (j)
The estimates for A will then be given by
PT −1
ζt (i, j)
âij = PT −1t=1
PN .
t=1 k=1 ζt (i, k)

However, notice that there is a problem in this method: we conditioned on A


in the estimation of ζ. This can be solved using an iterative procedure similar
to the Expectation-Maximization algorithm shown later.

For B, we will use


Number of times we expect to observe the (O, Q) pair (ok , j)
b̂j (ok ) = .
Number of times we expect be in state j
198 CHAPTER 6. HIDDEN MARKOV MODELS

We introduce
P (qt = j, O | A, B)
δt (j) = P (qt = j | O, A, B) = .
P (O | A, B)

This can easily be rewritten as

γ (j)σt (j)
δt (j) = PN t .
γ
j=1 t (j)σ t (j)

We then obtain the estimate


PT
t=1 δt (j)I(ot = ok )
b̂j (ok ) = PT .
t=1 δt (j)

Once again, δ depends on the unknown B.

Using the calculations above, we can now introduce the forward-backward


algorithm.

1. Specify the number of states in Q

2. Give initial estimates A, B

3. Calculate Q using Viterbi

4. Calculate γt and σt

5. Estimate A and B

6. Repeat steps 3 through 5 until convergence of A and B


Chapter 7

Gaussian Processes

With all things being equal, the simplest


explanation tends to be the right one

William of Ockham

In the previous chapter, we dipped our toes into classification problems. We


used concepts from stochastic processes in order to build a classifier that is able
to deal with such problems, namely the Hidden Markov Model.

In this chapter, we introduce a family of stochastic processes that can be


useful for regression problems. We start by making the following observations.

• Suppose Xt is an IID stochastic process. Then for any choice of t1 , t2 , ..., tN


we have
N
Y
fXt
1
,...,XtN (Xt1 = x1 , Xt2 = x2 , ..., XtN = xN ) = fX (xi ),
i=1

where fX is the density function of X1 .


• Suppose Xn is a Markov process. Let 0 = t1 < ... < tN . Then we have
(t −t ) (t −t ) (t −t )
fXt ,...,XtN (Xt1 = x1 , ..., XtN = xN ) = π(x1 )px12x2 1 px23x3 2 · · · pxNN−1 xNN−1
1

199
200 CHAPTER 7. GAUSSIAN PROCESSES

• Suppose Xt is a Poisson process with rate λ. Let 0 = t0 < t1 < ... < tN ,
then we have that

Xti − Xti−1 := Yi ∼ P oisson(λ(t − ti )).

Since t0 = 0, we have that for all n,


n
X
Xtn = Yi .
i=1

Hence,

fX0 ,X1 ,X2 ,...,XN (x0 , x1 , x2 , ..., xN ) = fY0 ,Y1 ,Y1 +Y2 ,...,Y1 +Y2 +...+YN (x0 , x1 , x2 , ...., xN )

The main takeaway from this observation is that for the processes covered so
far, we are able to obtain the joint density functions for any (finite) subset
of states. The stochastic process we will define in this chapter is an extreme
version of this: not only is the joint probability distribution easy to find but it
is completely determined by just two functions, the mean and the covariance!

7.1 Introduction

We start by defining the multivariate extension of the normal distribution.


Definition 7.1. A stochastic vector X = (X1 , X2 , ..., XN ) is multivariate normally
distributed with mean µ = (µ1 , µ2 , ..., µN ) and covariance matrix Σ ∈ RN ×N if it
has as density function
1 1 T
Σ−1 (x−µ)
fX (x1 , ..., xN ) = p e− 2 (x−µ) ,
(2π)N det(Σ)
where x = (x1 , ..., xN ).

Notice that this distribution is fully determined by the covariance matrix


and the mean vector.

The multivariate normal distribution has several interesting properties.


7.1. INTRODUCTION 201

Property 7.2. Let Z = (X, Y ) be a multivariate normally distributed stochastic


vector of dimension n, then the marginals are normally distributed.

Proof. Left as an exercise for the reader.

An extension of this result is the following.

Property 7.3. Suppose


! Z is a multivariate normally distributed stochastic vector of
Z1
size n with Z = where Z1 is k−dimensional and Z2 is (n − k)−dimensional.
Z2
We can write the mean and covariance matrix as
! !
µ1 Σ11 Σ12
µ= ,Σ = ,
µ2 Σ21 Σ22

according to Z1 and Z2 .

Then Z1 | Z2 = z is multivariate normally distributed with mean

µ̃ = µ1 + Σ12 Σ−1
22 (z − µ2 )

and covariance matrix


Σ̃ = Σ11 − Σ12 Σ−1
22 Σ21 .

Property 7.4. A stochastic vector (X1P


, ..., XN ) is multivariate normally distributed
if and only if any linear combination N i=1 ai Xi has a normal distribution.

Proof. This is left as an exercise for the reader.

Notice that it does not hold that for any two normally distributed random
variables X and Y , (X, Y ) is jointly normal.

Example 7.1. Suppose X ∼ N (0, 1) and B ∼ Bernoulli( 12 ). Define Y =


(2B − 1)X, then we have that Y ∼ N (0, 1). Indeed,
1 1
P (Y ≤ y) = P (X ≤ y) + P (X ≥ −y) = P (X ≤ y)
2 2
by the symmetry of X. Thus, X and Y are normally distributed.
202 CHAPTER 7. GAUSSIAN PROCESSES

If (X, Y ) are jointly normal, then by property 7.4, X + Y should be normal


as well. However,
1
P (X + Y = 0) =
2
by construction. Hence, X + Y cannot be normal.

If X, Y are jointly normal, however, we have the following result.

Property 7.5. Let X, Y be jointly normal random variables and suppose X ∼


N (µX , σX2 ) and Y ∼ N (µY , σY2 ). Then X + Y is normal and has as distribution

X + Y ∼ N (µX + µY , σX2 + σY2 + 2Cov(X, Y ))

Proof. The fact that X + Y is normal follows from property 7.4. The rest is a
consequence of property 1.61.

We now define the star of this chapter, Gaussian processes.

Definition 7.6. Let Xt be a stochastic process. We say that Xt is a Gaussian process


if for any N ∈ N and 0 ≤ t1 ≤ t2 ≤ ... ≤ tN , we have that (Xt1 , Xt2 , ...., XtN ) is a
multivariate normally distributed stochastic vector.

We saw that in the case of multivariate normally distributed stochastic vec-


tors, the joint distribution was fully determined by the mean vector and the
covariance matrix. This motivates the following definition.

Definition 7.7. Let Xt be a stochastic process. Then the covariance function K is


defined as
K(t, s) = E [(Xt − E [Xt ])(Xs − E [Xs ])] .

We have the following straightforward, but surprisingly strong result.

Theorem 7.8. Let Xt be a Gaussian process. Let 0 ≤ t1 ≤ t2 ≤ ... ≤ tN , then


(Xt1 , Xt2 , ..., XtN ) is a multivariate randomly distributed stochastic vector with mean
 h i h i h i
µ = E Xt1 , E Xt2 , ..., E XtN
7.2. STATIONARITY 203

and covariance matrix


Σ = (K(ti , tj ))i,j .
In particular, the density function of the stochastic vector is completely determined
by the mean and the covariance function.

We now consider some properties of Gaussian processes.

7.2 Stationarity

Recall that a stochastic process is stationary if

(Xt1 , Xt2 , ..., Xtk ) ∼ (Xt1 +s , Xt2 +s , ..., Xtk +s ), (∀t1 , ..., tk ≥ 0, k ∈ N, s ≥ 0)

meaning that they have the same distribution. For general stochastic processes,
it is not straightforward to calculate the joint density function. This makes
checking whether the process is stationary very difficult. That is why one some-
times prefers to use weak stationarity. For this, we only need to be able to
calculate the first and second-order statistics of the process.

Definition 7.9. Let Xt be a stochastic process. We say that this process is weakly
stationary if
E [Xt ] = E [X0 ] (∀t ≥ 0)
and
K(t, t + s) = K(0, s). (∀t, s ≥ 0)

The difference between weak stationarity and (regular) stationarity is that


weak stationarity does not carry any information on the distributions at differ-
ent states outside of the first and second-order statistics. In the next example,
show that weak stationarity does not imply (regular) stationarity.

Example 7.2. Let Xt be the stochastic process with independent Xi defined


as follows 
N (1, 1)

 i even
Xi ∼  .
P oisson(1) i odd

204 CHAPTER 7. GAUSSIAN PROCESSES

This process is weakly stationary:

E [Xi ] = 1, (∀i ∈ N)
K(t, t + s) = K(0, s) = δs0 . (∀t, s ∈ N)

However, this process is trivially non-stationary.

Hence, the (easy-to-check) weak stationarity does not necessarily imply the
(hard-to-check) regular stationarity. However, as the next theorem shows, this
handy equivalence does hold for Gaussian processes.

Theorem 7.10. A Gaussian process is stationary if and only if it is weakly station-


ary.

Proof. This is an immediate consequence of theorem 7.8.

In the remainder of this chapter, we will show how Gaussian processes can
be used in order to perform regression. Furthermore, in the next chapter, we will
focus on a particular example of a Gaussian process called Brownian motion.
7.3. GAUSSIAN PROCESS REGRESSION 205

7.3 Gaussian Process Regression

7.3.1 Introduction to Regression

In a general regression setting, one is given a set of features (often called covari-
ates) and target values {(xi , yi ) | i = 1, ..., N } and the goal is to find a function
f : X → Y that best describes/fits the relationship. Here, xi can be both one-
dimensional or multi-dimensional xi = (xi1 , ..., xim ). The quality of the fit can
be measured using a loss function, such as the quadratic loss
N
X
L2 (f ) = (yi − f (xi ))2 .
i=1

The function that best describes the data (xi , yi ) is then the function f that
minimizes the loss function L(f ). The closer the predicted value f (xi ) lies to
the true value yi , the lower the loss corresponding to that observation xi .

A regression problem can therefore be seen as an optimization problem


over the space of all functions. Because the space of all functions f : X → Y
is massive, one often reduces the space of candidate functions to some smaller
space. It is of course important to check whether the required relation can be
adequately approximated by functions in this smaller space. We now look at
some examples of common choices for smaller function spaces.

Example 7.3. Given a data set (xi , yi ), we could be interested in finding the
linear function that best represents the relationship of interest. In terms of loss
functions, this is the linear function
m
X
f (xi ) = β0 + βj xij
j=1

that minimizes the loss L2 (f ). It can be shown that the least squares estimators1
for the coefficients β are given by

β̂ = (X T X)−1 X T Y ,
1 This is the estimator that minimizes the squared loss.
206 CHAPTER 7. GAUSSIAN PROCESSES

where X is the feature matrix X = (x)ij and Y is the target vector Y = (y)j .

Since the function space is restricted to linear functions of the covariates,


we call this linear regression.

Figure 7.1: Linear regression applied to a noisy data set

Example 7.4. An immediate extension of the above is allowing polynomials


up to a given order. This is more flexible and will always have a solution that
fits at least as well as the one coming from linear regression.
7.3. GAUSSIAN PROCESS REGRESSION 207

Figure 7.2: Polynomial regression with degree 2 applied to a noisy data set

One could ask why one would perform linear regression if polynomial regression
often leads to better fits. The reason for this lies in the fact that in many cases
the data (xi , yi ) has noise on it, and the added flexibility can start to encode
the noise instead of the actual relationship.

In the previous examples, we see that the functions are parameterized using
a fixed set of coefficients. Finding the best fitting function in these reduced
spaces is therefore equivalent to a search within their corresponding coefficient
space. These methods are called parametric methods.

Instead of fixing the model structure ab ovo, we can also use non-parametric
methods. An example of such a method is the k-nearest neighbors regression
algorithm.

Example 7.5. The k-nearest neighbors regression algorithm works as follows:

• Given a point x, we calculate the k nearest points in the data set, N Bk (x)

• The algorithm returns the average value for these k points:


P
yi
i:xi ∈N Bk (x)
f (x) = .
k
208 CHAPTER 7. GAUSSIAN PROCESSES

This of course heavily depends on the choice of metric.

Figure 7.3: The K-Nearest Neighbour regressor for a noisy data set

The upshot of using non-parametric methods such as K-nearest neighbors


regression is that we have a more flexible set of solutions. A drawback is that the
resulting function does not have a (straightforward) analytical expression. For
each observation, we need to calculate the k nearest points which can become
cumbersome for large data sets.

Notice that the K-nearest neighbors algorithm also minimizes a certain loss
function (can you see how?). In the next section, we will consider a framework
in which we can fit a model without using a loss function, namely Bayesian
regression.

7.3.2 Bayesian Regression

Instead of minimizing a loss function, one can also use the Bayesian framework
in the setting of model learning. The upshot of this is that we will end up with a
7.3. GAUSSIAN PROCESS REGRESSION 209

distribution over the space of all possible models instead of just ending up with
one model2 . In order to return a single model using this probabilistic approach,
one could then use the model that maximizes this probability distribution.

Suppose the space of all allowed models is denoted by H, which we call


the hypothesis space. We denote by D the observed data. The basic idea is as
follows.

Before even looking at the data, we have prior beliefs about which models
are more or less likely. In general, incredibly complex models are less likely
to represent the true underlying distribution than simpler models. This follows
from Occam’s razor, the principle that the simplest explanation is most often
the correct one. This yields a prior distribution P (h) for h ∈ H.

Given a model h, we can also consider the likelihood P (D | h), which rep-
resents how likely it is that the data is generated by the model h. Models whose
errors are very large tend to have a lower likelihood than those with low errors.

The quantity we are truly interested in is P (h | D), the posterior probability,


which represents the belief that h is the correct model after observing the data.
The higher this probability, the more we believe that this model is indeed the
correct one. By Bayes’ rule, this probability can be obtained via
P (D | h) P (h)
P (h | D) = .
P (D)
The term P (D) is a constant that ensures that the probability on the left-
hand side integrates to 1. The model we are interested in will be the one
with the largest posterior probability, as it takes into account the prior and the
likelihood.
Definition 7.11. Let D be the observed data, H be the hypothesis space, and P (h) be
the prior distribution on H. We then define the maximum a posteriori (MAP) model
as
P (D | h) P (h)
hMAP = argmaxh∈H P (h | D) = argmaxh∈H
P (D)

Notice that since P (D) is a constant, we can in practice ignore it when we


want to find hMAP .
2 Notice the similarity to hard and soft classifiers.
210 CHAPTER 7. GAUSSIAN PROCESSES

Example 7.6. In a linear regression model, we write


yi = xiT β + ϵi ,
that is, the response variable is a linear function of the covariates xi with an
error ϵi . Here, one uses the Gauss-Markov assumptions. Under these assump-
tions, one has
ϵi ∼ N (0, σ 2 ), (∀i)
where we thus assume that all errors have the same error variance (homoscedas-
tic errors).

We have already obtained the least squares estimator β̂LS for this specific
problem in the previous section. Let us now consider the MAP estimator under
the assumption that all models are equally likely, i.e P (h) is constant over H.

Since all models are assumed equally likely, maximizing P (h | D) is equiv-


alent to maximizing the likelihood P (D | h). In this setting, the MAP will thus
coincide with the maximum likelihood estimator.

Notice that Y
P (D | h) = P (y | x, h) P (x) .
x,y∈D
Because of the Gauss-Markov assumptions, we know that (check this!)
(yi −xT β)2
1 i
P (yi | xi , h) = √ e 2σ 2
2πσ 2
and hence
Y 1 (y−xT β)2
P (D | h) = √ e 2σ 2 P (x) .
x,y∈D 2πσ 2
Since the logarithm is a monotone function, maximizing P (D | h) is equivalent
to maximizing the log-likelihood
X √
log (P (D | h)) = − log( 2πσ 2 ) + (y − xT β)2 − 2σ 2 + log P (x).
x,y∈D

It is not difficult to see that this is equivalent to maximizing


X
(y − xT β)2 ,
x,y∈D
7.3. GAUSSIAN PROCESS REGRESSION 211

which is exactly the least squares estimator.

Thus, under the assumption that the prior is constant over H, the least
squares estimator coincides with the MAP estimator, which itself coincides with
the maximum likelihood estimator.

Using the Bayesian learning framework, we will now use Gaussian processes
as tools for regression problems.

7.3.3 Gaussian Process Regression

Recall that the dynamics of a Gaussian process are all contained within the
mean function µt and the covariance function K(s, t). Instead of going from
the process to these functions, we can also construct Gaussian processes by
choosing a function µt and a covariance function K(s, t). The mean function
does not really have any restrictions, but the covariance function needs to satisfy
some conditions. For example, we do not want K to be negative for any pair
s, t.

There are a lot of possible choices for K, but some of the more common
ones are

• Constant covariance: K(s, t) = α where α ∈ R≥0 .

• Linear covariance: K(s, t) = st.

• White Gaussian noise: K(s, t) = σ 2 δst


2
• Squared negative exponential: K(s, t) = e−|s−t| .

After choosing K and µ, we get a completely determined Gaussian process Xt .


A sample path xt of this process can now be seen as a function f

f : R≥0 → R : t 7→ f (t) ≡ xt .
212 CHAPTER 7. GAUSSIAN PROCESSES

This yields a one-to-one correspondence Ω between the path space P (Xt ) of


the Gaussian process and the function space F defined by

F = {f : R≥0 → R} .

Recall that in the Bayesian learning framework, we needed a distribution de-


fined over a function space. Using the one-to-one correspondence, one can
define such a distribution via P(Ω(ft )).

P (Xn ) Ω F

P(xt ) P(ft ):=P(Ω(ft ))

R Identity R

Hence, given any Gaussian process we obtain a distribution on F . We will


use this later in order to use Bayesian learning.

Example 7.7. We will sample a function from the distribution defined by the
Gaussian process with zero mean and squared negative exponential covariance
function. Instead of sampling a full function, we will sample a finite amount of
function values xt and connect these with lines. The upshot of using Gaussian
processes comes from the fact that this sample will have a known distribution.

Start by choosing a set of values 0 ≤ t1 ≤ t2 ≤ ... ≤ t250 ≤ 25. We will choose


an equidistant grid, i.e ti = 10
i
.

We want to sample from (Xt1 , ..., Xt250 ), which by definition has a multivari-
ate normal distribution. Thus, we sample from a multivariate normal distribu-
tion with mean 0 and covariance matrix
 2
|i−j|
− 10
K(ti , tj ) = e .

The different samples can be visualized in figure 7.4.


7.3. GAUSSIAN PROCESS REGRESSION 213

Figure 7.4: Samples from a Gaussian Process prior

Each sample can be seen as a function f : [0, 25] → R.

Example 7.8. We repeat the setting above but now use different kernels.

Figure 7.5: Gaussian prior with constant kernel


214 CHAPTER 7. GAUSSIAN PROCESSES

Figure 7.6: Gaussian prior with linear kernel

Figure 7.7: Gaussian white noise

We will now use these concepts in order to perform regressions. Suppose


we have a data set D = ((x1 , y1 ), ..., (xN , yN )) = (X, Y ). We want to find a
function f : R≥0 → R that approximates the relationship between X and Y ,
f (xi ) ≈ yi . Choose a subset Z ⊂ R≥0 and a covariance function K. We will
assume that the functions f come from the Gaussian process G(0, K) with zero
mean and covariance function K as in the section above. This implies that
7.3. GAUSSIAN PROCESS REGRESSION 215

for our candidate functions f , (f (X), f (Z)) has a joint multivariate normal
distribution with zero mean and covariance matrix determined by K.

As we are now interested in finding the functions f that approximate the


relationship, we want that f (X) = Y . Thus, we condition (f (X), f (Z)) on
f (X) = Y , which by property 7.3 yields

f (Z) | D ∼ N (µ̃(Z), K̃(Z)).

Here, we have that

µ̃(Z) = K(Z, X)K(X, X)−1 Y

and

K̃(Z) = K(Z, Z) − K(Z, X)K(X, X)−1 K(X, Z).

This is the posterior distribution on the space of functions f . Notice that both
parameters are completely determined by our data and choice of the subset Z.
Furthermore, we have the following properties

• The predictor µ̃(Z) is linear in the observations Y

• The variance K̃(Z) is lower than the variance of the marginal Z. The
magnitude of the decrease depends on the distance between X and Z.

We will now look at an example.

Example 7.9. In this example, we will use Gaussian Process regression in


order to estimate the sinc function using the following data.
216 CHAPTER 7. GAUSSIAN PROCESSES

Figure 7.8: Training data for the estimation of the sinc function

Using the above approach, we can then find the mean function µ̃ and we
can also sample from the posterior distribution. We used the negative squared
exponential kernel for our Gaussian process. The obtained functions are given
in figure 7.9.

Figure 7.9: Gaussian Process Regression for the Sinc function

We see that the mean function lies very close to the true function. Away
from the training data points, we see that the sampled values vary quite a lot.
On the training data itself, there is no error. Show that this is true in general.
Chapter 8

Brownian Motion

Life requires movement.

Aristotle

In the previous chapter, we introduced the notion of Gaussian processes.


Recall that their defining property was that for any finite set Xt1 , Xt2 , ..., XtN ,
their joint probability distribution is the multivariate normal distribution.

We have seen that these processes were completely determined by their


mean and their covariance function. In this chapter, we will consider the Gaus-
sian processes with covariance function K(s, t) = min(s, t). Whilst this may seem
like a weird choice, it will quickly become clear why these Gaussian processes,
called Brownian motions, are so interesting.

Brownian motions were originally used to model the motion of a (dust)


particle in a medium, such as a gas or a liquid. These paths tend to behave
quite erratically, changing directions all the time due to collisions with other
particles. However, their interesting properties made them applicable outside
of physics as well.

217
218 CHAPTER 8. BROWNIAN MOTION

8.1 Definition

As mentioned in the introduction, we will define standard Brownian motion as


a Gaussian process with zero mean and specific covariance function.

Definition 8.1. A Gaussian process Wt is a standard Brownian motion if it satisfies

E [Wt ] = 0, K(t, s) = min(t, s), and W0 = 0 (∀s, t ≥ 0)

and such that Wt is continuous in t.

These are often denoted using W instead of the usual X. In figure 8.1, one
can see some sampled paths of standard Brownian motion.

Figure 8.1: Sampled paths for standard Brownian Motion

They have the following alternative definition.

Definition 8.2. A stochastic process Wt is a standard Brownian motion if it satis-


fies the following conditions

• It starts at 0: W (0) = 0

• It has independent increments Wt2 − Wt1 and Wt4 − Wt3 are independent
random variables.

• It has stationary increments: The distribution of W (t) − W (s) depends only


on t − s
8.1. DEFINITION 219

• An increment of the process over a period of [s, s + t], s, t ≥ 0 is normally


distributed with mean 0 and variance t

Wt+s − Ws ∼ N (0, t).

Proof. We show that Wt+s − W + s ∼ N (0, t). First, notice that for all t,

Var (Wt ) = Cov(Wt , Wt ) = K(t, t) = t.

Since Wt is a Gaussian process, we know that (Wt+s , Ws ) is jointly normal.


Hence, we can apply property 7.5 to find

Wt+s − Ws ∼ N (0, Var (Wt+s ) + Var (Ws ) + 2Cov(Wt+s , −Ws )).

Notice that

Var (Wt+s ) + Var (Ws ) + 2Cov(Wt+s , −Ws ) = Var (Wt+s ) + Var (Ws ) − 2Cov(Wt+s , Ws )
= t + s + s − 2 min(t + s, s)
= t + 2s − 2s = t.

Hence we indeed find


Wt+s − Ws ∼ N (0, t).

We list some of the properties of standard Brownian motion.


Property 8.3. With probability one, the paths of W are continuous.
Remark. In the previous chapter, we have already mentioned the fact that there
exists a one-to-one correspondence Ω : F → P (W ), where P (W ) was the path
space of W and F = {f : R≥0 → R}. The above property states that

P (Ω(F \ C[0, ∞))) = 0,

where C[0, ∞) ⊂ F is the space of continuous real-valued functions on the interval


[0, ∞).

We also have that standard Brownian motion satisfies the continuous Markov
property. In the case of continuous stochastic processes, the Markov property
can be defined as follows.
220 CHAPTER 8. BROWNIAN MOTION

Definition 8.4. Suppose that Xt is a stochastic


n process. oWe say that Xt satisfies the
Markov property if Xt − Xs is independent of Xq | q ≤ s for all t ≥ s.

Property 8.5. Standard Brownian motion satisfies the Markov property.

Proof. This is an immediate consequence of the definition and is left as an


exercise for the reader.

By definition, standard Brownian motion has a constant zero mean function


E [Xt ] = 0. One can construct a stochastic process that behaves similarly to
a standard Brownian motion but allows for drift and a slightly more general
variance structure as follows.

Definition 8.6. Let Wt be a standard Brownian motion and let µ be any real
number. Then we define a Brownian motion with drift µ and infinitesimal variance
σ 2 via
Xt = σ Wt + µt.
This stochastic process satisfies

Xt+s − Xs ∼ N (µt, σ 2 t).

For an example of a Brownian motion with positive drift, see figure 8.2

Figure 8.2: Comparison Brownian motion with and without drift


8.1. DEFINITION 221

Remark. Alternatively, one can define a Brownian motion with drift µ using Gaus-
sian processes. Show that the Gaussian process with mean function µt = µt and
covariance function K(s, t) = σ 2 min(s, t) corresponds to the process defined above.

Another straightforward result follows from the assumption that Brownian


motions start at 0, i.e W0 = 0.

Property 8.7. Let Wt be a Brownian motion. Then

Wt ∼ N (0, t),

or in other words
1 −w2
ft (w) = √ e 2t .
2πt

Related to this is the following property, which is a trivial consequence of


the fact that Brownian motions are Gaussian processes. We mention it here
only for the sake of completeness.

Property 8.8. Let Wt be a Brownian motion. For any 0 ≤ t1 ≤ ... ≤ tn , we have


that (Xt1 , ..., Xtn ) is a multivariate normal distribution.

Proof. This is a trivial consequence of the definition.

Using property 7.3, we can also fully determine the conditional probabilities.

Property 8.9. Let Wt be a Brownian motion. Then for any s < t, we have
s s
Ws | Wt = x ∼ N ( x, (t − s)).
t t

Proof. This is an immediate consequence of property 7.3.

Using the fact that Brownian motions are Gaussian processes, it is quite easy
to generate sample paths. Another interesting way in which one can generate
sample paths is by using the continuous-time limit of a symmetric random
222 CHAPTER 8. BROWNIAN MOTION

walk. Recall that a random walk could be obtained by considering the sum of
a sequence of IID random variables
n
X
S0 = 0, Sn = Xi . (n ≥ 1)
i=1

We assume that this walk is symmetric (i.e E [X] = 0) and that the underlying
process has unit variance Var (X) = 1.

We can define a continuous time process


S⌊nt⌋
Wn (t) = √ ,
n
where ⌊x⌋ is the integer part of x. From the CLT, we have
Sn 1
→d N (0 ).
n n
Hence, we find
S
√n →d N (0, 1).
n
Notice that for all values of n,
(n − 1)t ⌊nt⌋ nt
≤ ≤
n n n
and since the square root is monotone, this implies that
r r r
(n − 1)t ⌊nt⌋ nt
≤ ≤ .
n n n
For any value of t, it is easy to see that
r r
(n − 1)t nt √
lim = lim = t,
n→∞ n n→∞ n
and hence by the sandwich theorem
r
⌊nt⌋ √
lim = t.
n→∞ n
8.1. DEFINITION 223

Therefore,
S⌊nt⌋
lim Wn (t) = lim √
n→∞ n→∞ n
r
S⌊nt⌋ ⌊nt⌋ √
= lim √ →d tN (0, 1) = N (0, t).
n→∞ ⌊nt⌋ n

Using a similar strategy, one can show that for s < t, we have

lim Wn (t) − Wn (s) →D N (0, t − s).


n→∞

Hence, this scaled random walk converges in the limit to a Brownian motion.
A graphical representation of this convergence can be found in figure 8.3.

Figure 8.3: Approximating Brownian motion with scaled symmetric random


walks

In the next result, we list some of the interesting properties that Brownian
motion satisfies.

Property 8.10. A Brownian motion has the following properties.

1. Stationary transition probabilities, i.e

P (Wt+∆t ∈ A | Wt = x) = P (Wt+∆t − Wt ∈ A − x | Wt = x)
224 CHAPTER 8. BROWNIAN MOTION

2. Strong Markov property, i.e if T is a stopping time with respect to W , then

{WT +∆t − WT | ∆T ≥ 0}

is a Brownian motion independent of {Ws | 0 ≤ s ≤ T } .


n√ o
3. Invariance to time-scaling, i.e cWt | t ≥ 0 and {Wct | t ≥ 0} have the same
distribution. Thus n√ o
cW t | t ≥ 0
c

is a standard Brownian motion.

4. Brownian motion is symmetrical, that is

−W ∼ W .

5. Let Wt be a standard Brownian motion. Then the time-inverted process



0 t=0


Xt = 
tW 1 t > 0

t

is also a standard Brownian motion.

Proof. Left as an exercise for the reader.

In previous chapters, we have already seen some theorems on the long-term


behavior of stochastic processes. The following theorem gives us the analog for
Brownian motion.
Theorem 8.11. Let Wt be a Brownian motion. Then
Wt
lim = 0.
t→∞ t

Proof. Using the representation in statement 5 of property 8.10, we find that


Wt 1
 
lim = lim X = X(0) = 0.
t→∞ t t→∞ t
8.2. STOCHASTIC CALCULUS 225

Remark. Another interesting property of the paths of Brownian motion is that even
though it is continuous, it is nowhere differentiable! To gain intuition on why
Brownian motion is nowhere differentiable, consider the motion of a dust particle
in a liquid. For a small enough time frame, we know that the dust particle will
remain close to its original position, meaning that this is indeed continuous. For it
to be smooth, we need a stronger condition: there exists a small time frame where
the movement of the dust particle is approximately straight. However, the erratic
movement of a dust particle does not satisfy this condition as it can abruptly change
directions. Hence, these paths are not smooth.

8.2 Stochastic Calculus

8.2.1 Motivation

The core idea of ordinary calculus is that if we know how f changes and if we
know the initial value, we can determine the values of f . Indeed, if we know
f ′ (t) and f (0), then we can find f (t) for all t via
Z t
f (t) = f ′ (t)dt + f (0),
0

since Z t
f ′ (t)dt = f (t) − f (0).
0
Of course, this assumes some smoothness and integrability conditions on f and
f ′.

Suppose now that the change through time of the quantity is not determinis-
tic anymore, but stochastic. This could be due to (measurement) noise or could
be intrinsic to the quantity we are trying to model. The question we want to
answer is: can we still recover the quantity if the changes become stochastic? It
is obvious that the recovered quantity itself will also be random. In fact, since
it changes through time we could model it as a stochastic process. We could
therefore write
Xt+∆t − Xt = Ct,t+∆t ,
226 CHAPTER 8. BROWNIAN MOTION

where Ct,t+∆t is the random variable that indicates the change from t to t + ∆t
in the quantity Xt . Let us try to rewrite this equality in more familiar terms.

For now, assume that the randomness is due to noise. We can then split
the random variable Ct,t+∆t into a deterministic component (the ’signal’) and a
random component (the ’noise’). We will write the rate of change due to the
signal as µ(t, Xt ) and the change due to the noise as Nt,t+∆t . For small ∆t, we
can then write1
Xt+∆t − Xt = µ(t, Xt )∆t + Nt,t+∆t ,
Let us now focus on the noise component Nt,t+∆t . In most cases, noise tends to
come from a large number of sources. For example, in communication systems,
the signal can be polluted due to noise coming from the sun, cosmic radiation,
and man-made sources. In general, we assume that the noises coming from
these sources in a given time interval [t, t + ∆t] are IID. We also assume that
the mean of the noise is zero, meaning that on average we do not expect any
noise. We can define the noise component Nt,t+∆t as the sum of these IID and
zero-mean random variables

Nt+∆t = X1,t+∆t + X2,t+∆t , ..., XN ,t+∆t ).

By the central limit theorem, it follows that we can model this noise term by

Nt+∆t ∼ N (0, N · σX2 ). = N (0, σt,t+∆t


2

As ∆t grows, the number of noise terms in the interval [t, t + ∆t] increases. As
a result, the variance σt,t+∆t
2
of Nt+∆t grows with ∆t. We will assume that this
can be modeled as σt,t+∆t = σ 2 (t, Xt )∆t
2

Notice that this implies that for any t, ∆t > 0, we have

Nt+∆t ∼ N (0, σ 2 (t, Xt )∆t).

Therefore, we can model this noise term in terms of a standard Brownian mo-
tion using definition 8.6:

Nt+∆t = σ (t, Xt )(Wt+∆t − Wt ) ∼ N (0, σ 2 (t, Xt )∆t)

We can therefore write


1 As the signal component is deterministic, rules from ordinary calculus still apply.
8.2. STOCHASTIC CALCULUS 227

Xt+∆t − Xt = µ(t, Xt )∆t + σ (t, Xt ) (Wt+∆t − Wt ) .


Letting ∆t → 0, this can be written as

dXt = µ(t, Xt )dt + σ (t, Xt )dWt

where
dWt = lim (Wt+∆t − Wt )
∆t→0

and similarly
dXt = lim (Xt+∆t − Xt )
∆t→0

We now ask ourselves the question: Is it possible to solve this differential equa-
tion? In other words, can we find a stochastic process Xt whose evolution
satisfies the dynamics set by the right-hand side of the equation just like we
were able to do in the case of classical calculus? Since there is now a stochastic
component, this is no longer an ordinary differential equation. Instead, we call
such a differential equation a stochastic differential equation.

Luckily, the answer is yes! 2 Even better, the solution looks a lot like what
we have in the classical case:
Z t+s Z t+s
Xt+s = Xs + µ(Xu , u)du + σ (Xu , u)dWu .
s s

R
In order to understand this, we need to first what we mean by σ (Xu , u)dWu .

Remark. Recall that for real functions f and random variables X we have that
f (X) is also a random variable. 3 . Similarly, applying a real function f to a
stochastic process Xt yields a stochastic process. Thus, we will start by looking at an
integral of the form
Zt
Yu dWu
0
where Yu is a stochastic process.
2 Ifthe answer to this question was no, this section would have been a waste of time for
everyone involved...
3 This actually only holds for so-called measurable functions
228 CHAPTER 8. BROWNIAN MOTION

8.2.2 Stochastic Integrals

In this section, we will define the notion of stochastic integrals. The way we will
do this heavily resembles the way one would construct the Lebesgue integral:

• Start by defining the integral for well-chosen easy processes.

• Show that these easy processes can approximate a large set of stochastic
processes.

• Use this approximation to extend the definition of the integral to this


larger set of processes by taking a limit.

In the case of the Lebesgue integral, one uses step functions. We will use the
continuous stochastic process analog, namely simple processes.
Definition 8.12. Let Yt be a stochastic process. We say that Yt is a simple process
if we can find a finite set 0 = t0 < t1 < t2 < ... < tN < ∞ = tN +1 and random
variables Z0 , Z1 , ..., ZN such that

Yt = Zj ⇐⇒ tj ≤ t < tj+1 . (j = 0, ..., N )

Still following the idea behind the construction of the Lebesgue integral, we
will define the stochastic integral or Itô integral as follows.
Definition 8.13. Let Yt be a simple process. Denote by Z0 , Z1 , ..., ZN the associated
random variables and the corresponding times by 0 = t1 < t2 < ...tN . Let Wt be a
Brownian motion. Then we define the Itô integral Xt of the simple process Yt
Zt
Xt = Yu dWu
0
as
k−1
X
Xt = Zi (Wti+1 − Wti ) + Zk (Wt − Wtk ). (where tk ≤ t < tk+1 )
i=0

We show that this definition for the Itô integral on simple processes satisfies
some very desirable properties.
8.2. STOCHASTIC CALCULUS 229

Property 8.14. Let Wt be a Brownian motion and suppose Yt and Vt are simple
processes. Then

• If α, β are constants then αYt + βVt is a simple process and furthermore, we


have the following linearity property:
Zt Zt Zt
(αYs + βVs ) dWs = α Ys dWs + β Vs dWs .
0 0 0

• For any 0 < u < t, we have


Zt Zu Zt
Ys dWs = Ys dWs + Ys dWs
0 0 u

• For all t ≥ 0, the Itô integral Xt of Yt has as variance


Zt h i
Var (Xt ) = E Ys2 ds.
0

• The Itô integral is continuous almost surely.

Proof. The proof of the first statement is straightforward. For the second state-
ment, notice that if Yt is a simple process, then so are the processes

Yt I(t ≤ u) and Yt I(t ≥ u).

The proof follows then by integrating the equality Yt = Yt I(t ≤ u) + Yt I(t ≥ u)


and using linearity.

For the third statement, notice that


 k−1 
X 
E [Xt ] = E  Zi (Wti+1 − Wti ) + Zk (Wt − Wtk ) = 0. (where tk ≤ t < tk+1 )
i=0

Hence

k−1
2 
h i X  
Var (Xt ) = E Xt2 = E  Zi (Wti+1 − Wti ) + Zk (Wt − Wtk ) 

 
i=0
230 CHAPTER 8. BROWNIAN MOTION

Using
h the facti that Brownian motion has independent increments and that
E (Wt − Ws )2 = (t − s) for s < t, one can easily show that this gives

k−1 h
X i h i
Var (Xt ) = E Zi2 (ti+1 − ti ) + E Zk2 (t − tk ).
i=0

Therefore, we have
k−1 Z
X ti+1 h i Z t h i
Var (Xt ) = E Zi2 ds + E Zk2 ds.
i=0 ti tk

By linearity and using indicator functions, we get


 k 
Z t X  Zt h i
2
Var (Xt ) = E  (Zi I(ti ≤ s ≤ ti+1 ))  ds =
  E Ys2 ds
0 i=1 0

The final statement is straightforward.

We now show that these processes can approximate a larger set of stochas-
tic processes. This larger set consists of all uniformly bounded, continuous
stochastic processes.

Proposition 8.15. Let Yt be a process such that

• t 7→ Yt is continuous

• Yt is uniformly bounded by a constant K almost surely, i.e there exists a K


such that
P (|Yt | < K ∀t ≥ 0) = 1.

(n)
Then we can find a sequence of simple processes Yt such that
Z t  
(n)
lim E |Yu − Yu |2 =0
n→∞ 0

for all t.
8.2. STOCHASTIC CALCULUS 231

Proof. We only give the construction of the sequence. Notice that it suffices to
show this for all t separately. We show this for t = 1. The sequence is given by
n
 Z i 
i −1 i  n
 
(n)
X 
Ys = I <s< · n Yu du  .
n n i−1
i=0 n

One can show that this sequence has the desired properties.

Using this result, one can define the integral for the more general case. We
will omit the technical details.
Definition 8.16. Let Yt be a stochastic process satisfying the conditions in proposi-
(n)
tion 8.15. Let Yt be the associated sequence from the same proposition. Then we
can define the Itô integral of Yt
Zs
Yu dWu
0

for all s via Z s Z s


(n)
Yu dW u = lim Yu dWu .
0 n→∞ 0

One can show that all the properties in property 8.14 are also satisfied by
the Itô integral of the more general class of processes.
Remark. The integral can also be defined for unbounded stochastic processes Yt .

A useful result regarding Itô integrals is the following.


Theorem 8.17. Let Wt be a Brownian motion and suppose f is an integrable and
deterministic function. Then
Zt Zt !
2
f (u)dWu ∼ N 0, f (u) du
s s

Proof. We partition the interval [s, t] using equidistant jumps


k
zn,k = s + (t − s). (k = 0, ..., 2n )
2n
232 CHAPTER 8. BROWNIAN MOTION

By definition, we have
n −1
2X
Z t  
f (u)dWu = lim f (zk ) Wzn,k+1 − Wzn,k .
s n→∞
k=0

Since we used an equidistant grid, the increments satisfy

t−s
 
Wzn,k+1 − Wzn,k ∼ N 0, n .
2

Using the facts that Brownian motion has independent increments and f is
deterministic, the above result becomes, in the limit,
 2n −1 
Z t  X (t − s) 
f (u)dWu ∼ lim N 0, f (zn,k )2 n  .
s n→∞ 2
k=0

We recognize the Riemann sum in the variance, and by the definition of the
classical integral
n −1
2X Z t
(t − s)
lim f (zn,k )2 n = f (u)2 du.
n→∞ 2 s
k=0

One can show that the limit of this sequence of normally distributed random
variables indeed converges in distribution:
Z t Z t !
2
f (u)dWu ∼ N 0, f (u) du ,
s s

from which the result follows.

We are now in the same position as when we are in our first calculus class
where we just learned about integrals using Riemann sums. Even though we
now understand a bit more about what is going on, we cannot really calculate
any integrals. In order to do this, we will need some more preparation.
8.2. STOCHASTIC CALCULUS 233

8.2.3 Itô’s lemma and variations

Just like we looked at the construction of the Lebesgue integral for inspira-
tion for the previous section, we will now look at the fundamental theorem of
(ordinary) calculus for inspiration on how to actually solve integrals.

Before we do this, it is important to first look at the differences between


stochastic calculus and regular calculus. For this, we introduce the notion of
variation.
Definition 8.18. Let [T1 , T2 ] be an interval with T 1 < T 2. Suppose we have an
equidistant partition of this interval
T2 − T1
tk = T1 + k∆t, ∆t = .
n
Then the total variation of a stochastic process Xt on the interval [T1 , T2 ] is given
by the random variable
Xn
V = lim |Xtk − Xtk−1 |.
∆t→0
k=1
Similarly, the quadratic variation of the stochastic process Xt on the interval [T1 , T2 ]
is given by the random variable
n
X
Q = lim (Xtk − Xtk−1 )2
∆t→0
k=1

If the stochastic process is a Brownian motion we write VΠ and QΠ for the total
variation and the quadratic variation respectively.
Remark. In order to relax the notation, we do not refer to the interval bounds T1
and T2 . As the results that follow hold for any choice of T1 and T2 , this should not
cause any confusion.

It is a well-known fact that for continuous, integrable functions f , we have


n Z T2
X df
lim |f (tk ) − f (tk−1 )| → du < ∞.
∆t→∞
k=1 T1 du

In stochastic calculus, the analogous result does not hold.


234 CHAPTER 8. BROWNIAN MOTION

Proposition 8.19. Let Wt be a standard Brownian motion, then its total variation
satisfies
E [VΠ ] = ∞.

Proof. Notice that


n
X n
X
VΠ = lim |Wtk − Wtk−1 | = lim |Xk |,
∆t→0 ∆t→0
k=1 k=1

where Xk ∼ N (0, ∆t). One can show then that


r
2∆t
E [|Xk |] =
π
Thus, we find
r
2 T2 − T1
E [VΠ ] = lim nE [|Xk |] = lim n = ∞.
n→∞ n→∞ π n

Remark. The above property implies that the total variation of a Brownian motion
is unbounded over any interval with non-zero Lebesgue volume!

Even though the total variation of Brownian motion is infinite, it is quite


surprising to note that this does not hold for the quadratic variation!

Proposition 8.20. Let Wt be a Brownian motion. Then

E [QΠ ] = T1 − T1 .

In particular, h i
E dWt2 = dt

and dW is of order dt.

Proof. Left as an exercise for the reader.


8.2. STOCHASTIC CALCULUS 235

Because of this, the standard chain rule

d(Wt2 ) = 2Wt dWt

no longer holds. Indeed, taking the integral of both sides yields


Z
2
Wt = 2 Wt dWt .

Taking the expectations, we get on the left-hand side


h i
E Wt2 = t

whilst on the right-hand side we get


" Z #
E 2 Wt dWt = 0.

Hence, using the standard chain rule leads to incorrect results.

However, if we take into account the result from proposition 8.20, we would
get
d(Wt2 ) = 2Wt dWt + dWt2 .
By integrating both sides we get
Z t Z t
Wt2 =2 Ws dWs + dWt2 ,
0 0

which yields Z t
1 2 
Ws dWs = Wt − t .
0 2
This correction term is precisely what we needed in order to make the expec-
tations of both sides of the equation equal.
Remark. One could wonder what happens with dW h t andi higher orders. It turns
3
3
out that they are zero. For example, we have that E |dWt |3 is proportional to dt 2 ,
which is negligible compared to dt.

In general, we have the following result for functions of Brownian motion.


236 CHAPTER 8. BROWNIAN MOTION

Proposition 8.21. Let f be a twice differentiable function. Then


∂f 1 ∂2 f
df (Wt ) = dWt + dt.
∂Wt 2 ∂Wt2

Recall that in the motivation, we mentioned a kind of stochastic process


which solves an SDE of the form
dXt = µ(t, Xt )dt + σ (t, Xt )dWt .
We define these as Itô processes.
Definition 8.22. An Itô process is a stochastic process defined by the stochastic
differential equation
dXt = µ(t, Xt )dt + σ (t, Xt )dWt .

For these processes, we have a similar result as in proposition 8.21. This


result is often called Itô’s lemma.
Theorem 8.23. Let Xt be an Itô process and suppose that f (x, t) is a twice dif-
ferentiable function. Then f (Xt , t) is also an Itô process and satisfies the stochastic
differential equation

∂f σt2 ∂2 f
!
∂f ∂f
df = + µt + 2
dt + σt dWt .
∂t ∂x 2 ∂x ∂x
Thus, we get
Z t
∂f
f (Xt , t) − f (Xs , s) = σ (Xu , u) (X , u)dWu
∂x u
Z st
σ 2 (Xu , u) ∂2 f
!
∂f ∂f
+ (X , u) + µ(Xu , u) (Xu , u) + (Xu , u)
s ∂t u ∂x 2 ∂x2

The next example shows how one can use Itô’s lemma in order to solve
SDEs.

Example 8.1. Consider the Itô process defined by


dXt = µXt dt + σ Xt dWt .
8.2. STOCHASTIC CALCULUS 237

We want to find an expression of Xt using Itô’s lemma. From the expression of


Xt , we can find that
dXt
= µdt + σ dWt .
Xt
This heavily resembles the differential of the logarithm log(Xt ), which is an Itô
process by theorem 8.23. From this same theorem, we can find the true SDE
associated with this Itô process, namely

∂ log(Xt ) 1 σ 2 Xt2 ∂2 log(Xt )


!
∂ log(Xt ) ∂ log(Xt )
d log(Xt ) = + µXt + dt+σ Xt dWt .
∂t ∂x 2 2 ∂x2 ∂x

Notice that f (Xt , t) = log(Xt ) is not a function of t, only of Xt . Hence, we find


that
∂ log(Xt )
=0
∂t
∂ log(Xt ) 1
=
∂x Xt
2
∂ log(Xt ) 1
2
= − 2.
∂x Xt

Thus, the SDE of the Itô process log(Xt ) is given by

σ2
d log(Xt ) = (µt − )dt + σ dWt .
2
Integrating both sides yields
Z t
d log(Xs )ds = log(Xt )
0
Z t Zt
1 2

= µ − σ ) ds + σ dWs
0 2 0
1 2
 
= µ − σ t + σ Wt .
2
Since Xt = exp(log(Xt )), we find that
1 2
Xt = e(µ− 2 σ )t+σ Wt .
238 CHAPTER 8. BROWNIAN MOTION

Example 8.2. As another example, suppose we have a Brownian motion Wt .


Consider the square of this Brownian motion, Xt = Wt2 . A Brownian motion
can be seen as an Itô process with zero drift and σt ≡ 1. Hence, Xt = Wt2 is
also an Itô process by theorem 8.23 and from this theorem, we obtain (check
this!)

dXt = 2Wt dWt + dWt2 = 2Wt dWt + dt.

Integrating this yields the same result as we have seen before, namely

Z t
Wt2 = t+2 Ws dWs .
0

8.3 Reflection Principle and Hitting Times

In this section, we consider the notion of hitting times for Brownian motion.
Recall that we have already seen these in the chapter on Markov processes. We
first define the continuous counterpart.

Definition 8.24. Let Wt be a Brownian motion. The first-passage time (or hitting
time) of Wt for a > 0 is the first time Wt hits the value a:

Ta = inf {t ≥ 0 | Wt = a} .
8.3. REFLECTION PRINCIPLE AND HITTING TIMES 239

Figure 8.4: Hitting Time for a

Evidently, the first-passage time is a random variable. Before discussing the


distribution of this random variable, we first consider some useful concepts.

For any hitting time Ta , we can define the reflected path Rt , which is ob-
tained by reflecting the portion of Wt after the hitting time Ta about the line
y = a.

Definition 8.25. Let Wt be a Brownian motion and Ta be the hitting time for a
real value a. Then we define the reflected path


Wt t < Ta


Rt =  .
2a − Wt
 t ≥ Ta

This reflected path mimics the original path Wt up until the random time
Ta , see figure 8.5.
240 CHAPTER 8. BROWNIAN MOTION

Figure 8.5: The reflection of a Brownian motion

For times that occur after the hitting time, we have the following property.
Property 8.26. If Ta < t, then
a + Wt − WTa ∼ N (a, t − Ta ).

Proof. This is an immediate consequence of the fact that Wt − WTa ∼ N (0, t −


Ta ).

An interesting corollary of the strong Markov property, property 8.10, is the


reflection principle. Essentially, it states that after hitting a certain state a, the
reflected path and the Brownian motion are equally likely to be in any given
state.
Property 8.27. Let Wt be a Brownian motion and Rt be the reflected path around
a. Then, we have that
P (Wt > a | Ta ≤ t) = P (Rt > a | Ta ≤ t) .
In other words, we have that
P (Wt > a | Ta ≤ t) = P (Wt < a | Ta ≤ t) .
As a consequence, we find that
1
P (Wt ≥ a | Ta ≤ t) = .
2
8.3. REFLECTION PRINCIPLE AND HITTING TIMES 241

Proof. Since Ta is evidently a stopping time, the Strong Markov Property for
Brownian motions (property 8.10) gives us that Vt = WTa +t − WTa is a Brownian
motion independent of Ws for s ≤ Ta . Hence, property 8.7 tells us that

Vt ∼ N (0, t) ∀t ≥ 0.

By the symmetry of the normal distribution, we thus find that


1
P (Vt > 0) = P (Vt < 0) = .
2
Hence, we also find that
    1
P WTa +t − WTa > 0 = P WTa +t − WTa < 0 = .
2
The result then follows from the stationary transition probability property of
Brownian motions, see property 8.10.tb

We now try to find the distribution of the hitting time, i.e P (Ta ≥ t).
Property 8.28. The distribution for the hitting time Ta is given by

P (Ta ≤ t) = 2P (Wt ≥ a) .

Thus, !!
a
P (Ta ≤ t) = 2 1 − Φ √ ,
t
where Φ is the cumulative distribution function for a standard normally distributed
random variable.

Proof. Using the law of total probability, we find

 :0

P (Wt ≥ a) = P (Wt ≥ a | Ta ≤ t) P (Ta ≤ t) + P
(Wt
≥a | Ta > t)P (Ta > t) .
 
 
Using the result from property 8.27, we thus find

P (Ta ≤ t) = 2P (Wt ≥ a) .

The result now easily follows from the fact that Wt ∼ N (0, t).
242 CHAPTER 8. BROWNIAN MOTION

As an immediate consequence, one can show (do this!) that the density
function will be given by

|a| a2
fTa (t) = √ e− 2t . (t ≥ 0)
2πt 3
Examples of this density are given in figure 8.6 for several values of a.

Figure 8.6: The density function for the hitting times

As an incredibly interesting and unexpected result, we find that Brownian


motion hits every non-zero level almost surely in finite time.

Proposition 8.29. Let Wt be a Brownian motion. Then for any a ∈ R, we find


that
P (Ta < ∞) = 1.

Proof. Recall that we have already observed that for any a,


!!
a
P (Ta < t) = 2 1 − Φ √ .
t
8.3. REFLECTION PRINCIPLE AND HITTING TIMES 243

Taking the limit for t, we find


!!
a
lim P (Ta < t) = 2 1 − lim Φ √ .
t→∞ t→∞ t
Since Φ is continuous and Φ(0) = 12 , we find
1
 
P (Ta < ∞) = 2 1 − = 1.
2

Since the probability of hitting a state a is one, the next natural question is
to ask what the expected hitting time is. In the following result, we show that
even though the probability to hit the state is 1, the expected hitting time for
any non-zero state is infinite.
Proposition 8.30. The expected hitting time for any nonzero state is infinite, i.e
∀a ∈ R \ {0},
E [Ta ] = ∞.

Proof. Denote by B the set B = {−a, +a} and denote tB = E [TB ] , ta = E [Ta ].
Since a , 0, it holds that tB , ta . Furthermore, since a ∈ B, it is obvious that
tB < ta . Suppose we reach −a first, then the expected hitting time starting from
−a to a can be written as the expected hitting time from −a to 0 (denoted r0 )
and the expected hitting time from 0 to a, which was ta . Thus4 ,
1
(r + t ) .
ta = tB +
2 0 a
Since we have stationary transition probabilities (property 8.10), we find that
ta = r0 . Hence
ta = tB + ta .
Since tB > 0, the above equality forces ta = ∞.

We have thus found the distribution and expected value of the hitting times
for a Brownian motion. In the following section, we will consider some other
interesting distributions that are related to the dynamics of a Brownian motion.
4 The equality ta = tB + 21 (r0 + ta ) follows from ta = 12 tB + 12 (tB + ta + r0 ), which is obtained
by splitting ta in the two (equally likely) scenarios where Wt hits a first or Wt hits −a first.
244 CHAPTER 8. BROWNIAN MOTION

8.4 Related distributions

8.4.1 Maximum of a Brownian Motion

We consider the following stochastic process derived from a Brownian motion.


Definition 8.31. Let Wt be a Brownian motion. Define the stochastic process

Mt = max {Wu | 0 ≤ u ≤ t} ,

then we call Mt the maximum process of Wt .

An example of a maximum process is given in figure 8.7, represented by a


red dotted line.
Remark. Since in our definition, we always set W0 = 0, the maximum process is
non-negative.

Figure 8.7: The maximum process Mt of a Brownian motion Wt

Before we continue, we prove the following technical result.


8.4. RELATED DISTRIBUTIONS 245

Proposition 8.32. Let Wt be a Brownian motion and Mt be the associated maxi-


mum process. Then for any a > 0,

Ta ≤ t ⇐⇒ Mt ≥ a.

Proof. =⇒ : Suppose Ta ≤ t. Then WTa ∈ {Wu | 0 ≤ u ≤ t} and therefore Mt ≥


WTa = a.
⇐= : Suppose Mt ≥ a, then since Mt is trivially a continuous function, it con-
tains all the points in the interval [M0 , Mt ] by the intermediate value theorem.
Thus, it contains all the points in the interval [0, Mt ]. Since 0 ≤ a ≤ Mt , we
thus have that a = Mu for some u ∈ [0, t] and therefore we have Ta ≤ u ≤ t.

The above result allows us to use the results from the section on hitting
times in this case.
Proposition 8.33. The distribution function for the maximum process is given by
r
2 −z2
fMt (z) = e 2t . (z ≥ 0)
πt

Proof. Using proposition 8.32, we find


!!
a
P (Mt ≥ a) = P (Ta ≤ t) = 2 1 − Φ √
t
One can show (do this!) that this implies
r Z ∞
2 u2
P (Mt < a) = 1 − e− 2t du.
πt a

Using the fact that


r r
Z a Z ∞
1 2
− u2t 1 u2 1
e du + e− 2t du = ,
2πt 0 2πt a 2
we can rewrite this as
r Z a
2 u2
P (Mt < a) = e− 2t du.
πt 0
246 CHAPTER 8. BROWNIAN MOTION

Hence by the fundamental theorem of calculus, this gives


r
2 − z2
fMt (z) = e 2t . (z ≥ 0)
πt

Figure 8.8: Density of the maximum process

Let us now consider the expected maximum level of Brownian motions. This
can easily be obtained since we have the density for the maximum process.

Property 8.34. For a Brownian motion, the expected value of the maximum is
r
2t
E [Mt ] = .
π

Proof. This follows from a simple calculation and is left as an exercise for the
reader.

We can also easily define the minimum process of a Brownian motion.


8.4. RELATED DISTRIBUTIONS 247

Definition 8.35. Let Wt be a Brownian motion, then the minimum process mt is


defined
mt = min {Wu | 0 ≤ u ≤ t} .

Just like we have that a maximum process is non-negative, the minimum


process is strictly non-positive. However, notice that by symmetry the minimum
process mt has the same distribution as −Mt . We thus find the following result.

Property 8.36. For a Brownian motion, the expected value of the minimum is
r
2t
E [mt ] = E [−Mt ] = − .
π

However, we thus find that E [Mt ] → ∞ and E [mt ] → −∞. In order for this
to hold, it must cross the zero line infinitely many often, and thus by property
8.10 any level is crossed infinitely many times. Thus, any level a ∈ R is recurrent.

8.4.2 Zeroes of Brownian Motion

In the previous section, we have observed that the zero state is expected to be
hit infinitely often. In this section, we are interested in studying the behavior of
these zeroes.

Proposition 8.37. Let Wt be a Brownian motion let s < u. Then the probability
that Wt has no zero in the interval (s, u) is given by
r !
  2 s
P Z C = arcsin .
π u

Here, Z denotes the event Z = {Wt has at least one zero in (s, u)}.

Proof. Suppose Wt starts at state a at time s, i.e Ws = a. Then the probability


of hitting 0 in the interval (s, u) is equal to the probability of hitting −a starting
from 0 in a time smaller than u − s:

P (Z | Ws = a) = P (T−a ≤ u − s) ,
248 CHAPTER 8. BROWNIAN MOTION

which by symmetry (property 8.10) equals P (Ta ≤ u − s) .

We integrate this over all possible states a and obtain


Z∞
1 − a2
P (Z) = P (Ta ≤ u − s) √ e 2s da
−∞ 2πs
Z ∞ Z u−s
1 −a2
= fTa (y)dy √ e 2s da
−∞ 0 2πs
Z ∞ Z u−s 2
prop 8.28 |a| −a 1 − a2
= p e 2y dy √ e 2s da.
−∞ 0 2πy 3 2πs

Changing the order of integration, we get


Z u−s Z∞ 2
1 −a a2
P (Z) = p |a|e 2y e− 2s dady.
0 2π sy 3 −∞

Notice that the right integral is symmetric around 0, and hence we can further
rewrite this as
Z u−s Z∞ 2 2
1 −a −a
P (Z) = 2 p ae 2y 2s dady.
0 2π sy 3 0

Using substitution, we get


Z u−s Z ∞ 2 2
1 − a2y − a2s
P (Z) = p e d(a2 )dy.
0 2π sy 3 0

We obtain Z u−s
1 sy
P (Z) = dy
π sy 3 s + y
p
0

which can be rewritten as


√ Z u−s
s dy
P (Z) = √ .
π 0 (s + y) y

Using substitution y = sx2 , we get


Z √ u−s
2 s 1
P (Z) = dx.
π 0 1 + x2
8.4. RELATED DISTRIBUTIONS 249

This can be rewritten as

r !
2 u−s
P (Z) = arctan .
π s

Remember the mnemonic √ SOH-CAH-TOA,


√ √ see figure 8.4.2. Applying this to a
right triangle with sides u, u − s and u we find

r !
2 s
P (Z) = arccos .
π u

Thus,

r !! r !
2 π s 2 s
P (Z) = − arcsin = 1 − arcsin .
π 2 u π u

Hence,
r !
  2 s
P Z C = arcsin .
π u
250 CHAPTER 8. BROWNIAN MOTION

Figure 8.9: Distribution of the last zero

Using this result, we can find the distribution of the time since the last zero.

Proposition 8.38. Let Lt denote the time that the last zero of a standard Brownian
motion Wu in [0, t], i.e

Lt = sup {s < t | Ws = 0} .

Then
r !
2 s
P (Lt < s) = arcsin .
π t

This means that Lt is arcsine-distributed with support (0, t). This has as
density
1
fLt (s) = p . (0 < s < t)
π s(t − s)
8.4. RELATED DISTRIBUTIONS 251

Figure 8.10: Density of the last zero

8.4.3 Times of maximum processes

We have already studied the maximum process Mt from a Brownian motion, but
in this section, we study the distribution of the times at which those maximums
are attained for the first time. In other words, we consider the first-passage
times of the maximum process associated with a Brownian motion.

Proposition 8.39. Let Wt be a standard Brownian motion and Mt its associated


maximum process. For any t, define the random variable U (t) to be the first time
that the Brownian motion attained the maximum at t, i.e

U (t) = inf {s ≤ t | Ws = Mt } .

This random variable has as density function

1
fU (t) (u) = p ,
π u(t − u)

i.e U (t) is arcsine-distributed with support (0, t).


252 CHAPTER 8. BROWNIAN MOTION

Proof. Start by choosing values 0 ≤ a < y. The set of paths {Mt ≤ y, Ta ≤ u | u < t}
contains all paths of Wt which, at time t, have already hit a but have not ex-
ceeded y. Furthermore, since a ≥ 0 we have that at the first passage time Ta = s,
it holds that Ws = Ms = a. Thus, from property 8.10 we find that

P (Mt ≤ y | Ta = s) = P (Mt−s ≤ y − a) .

Integrating over the hitting time of a, we find that by definition of conditional


probabilities
Zu
P (Mt ≤ y, Ta ≤ u) = P (Mt ≤ y | Ta = s) fTa (s)ds.
0

By combining the previous results on the distributions of the maximum process


and the hitting times, we find
Z uZ y−a

x 2 a2

a 1 −1 t−s + s
P (Mt ≤ y, Ta ≤ u) = p e 2 dxds.
0 0 π s s(t − s)

Using the fundamental theorem of ordinary calculus, this yields


(y−a)2 a2
 
a 1 −1 t−u − u
f (y, u) = p e 2 .
π u u(t − u)

We now try to relate this to the density of the random variable U (t) . Notice that
Mt = y implies that U = Ty (check this!). In the joint density equation above,
we can hence set a = y and Ta = Ty = U , yielding

y y2
fMt ,U (y, u) = p e− 2u . (0 < u < t, y > 0)
πu u(t − u)

In order to extract the density function of U , we integrate out the maximum


process and obtain
Z∞
y y2 1
fU (u) = p e − 2u
dy = p , (0 < u < t)
0 πu u(t − u) π u(t − u)

proving the required result.


8.5. JUMP-DIFFUSION PROCESSES 253

8.5 Jump-Diffusion Processes

In chapter 2, we have seen three different definitions of Poisson processes. With


the machinery that we have introduced so far, we can give another one where
we use the underlying stochastic differential equation. This definition is closely
related to the first definition.
Definition 8.40. A Poisson process Nt is any process that satisfies the stochastic
differential equation

dNt = Nt+dt − Nt with dNt ∈ {0, 1, 2, 3, ...}


(λdt)k e−λdt
P (dNt = k) = .
k!
Notice that in contrast to the previously seen SDEs, the differentials dNt are discrete
random variables as opposed to continuous random variables.

We will use this differential dNt in order to introduce jump-like dynamics to


continuous stochastic processes such as Wt . We can also easily define stochas-
tic processes whose dynamics are completely governed by the jump processes,
together with another stochastic process that represents the jump size.
Definition 8.41. Let Nt be a Poisson process and let Jt be any stochastic process
independent of Nt . Then we can define the process Xt with random jumps via the
SDE
dXt = Jt dNt .

Notice that, for dt small enough, we find that dNt is roughly Bernouilli:

dNt ≈d Bernouilli(λdt).

This significantly simplifies the first-order dynamics imposed on a process Xt


with random jumps.
Property 8.42. Suppose that Xt is a random jump process with jump size governed
by Jt and jump frequency determined by the Poisson process Nt . Then for dt (in the
first order), we have

P (dXt ≤ a) ≈ P (Jt ≤ a) (λdt) + Ia≤0 (1 − λdt).


254 CHAPTER 8. BROWNIAN MOTION

Proof. This is a straightforward result from the observations above.

An example of a stochastic process with random jumps is given in figure


8.11.

Figure 8.11: A stochastic process with random, normally distributed jumps Jt

When no jumps are present, recall that an Itô process can be written as

dXt = µt (Xt , t)dt + σt (Xt , t)dWt .

Processes of this form are also called diffusion processes, because of their ties
to the motion of random particles.

We have seen that a function of an Itô process f (Xt ) is, under some mild
conditions, again an Itô process with associated stochastic differential equation

∂f 1 2 ∂2 f
!
∂f ∂f
df (Xt ) = + µt + σt dt + σt dWt .
∂t ∂x 2 ∂x2 ∂x
We will generalize this to stochastic processes with jump dynamics as well. This
generalization will be defined as jump-diffusion processes.
8.5. JUMP-DIFFUSION PROCESSES 255

Definition 8.43. Let Wt be a Brownian motion. Suppose Nt is a Poisson process


and It is a stochastic process. A jump-diffusion process is a process satisfying the
stochastic differential equation

dXt = µt dt + σt dWt + Jt dNt .

In order to make these processes a bit more tractable, we make the following
dependency assumptions.

• dNt and Jt are independent

• dNt and dWt are independent

• µt and σt are functions of t and Xt . They depend on the process up to,


but not including time t. One denotes this often as µt− .

The only jump-diffusion processes which are Itô processes are those with no
jump dynamics5 . As a consequence, we can’t apply Itô’s lemma to jump-
diffusion processes.

However, one can show that for a stochastic function f and a jump-diffusion
process Xt , the associated stochastic process f (Xt ) is still a jump-diffusion pro-
cess. In fact, we have the following generalization of Itô’s lemma.

Proposition 8.44. Let Xt be a jump-diffusion process

Xt = µt dt + σt dWt + Jt dNt ,

and suppose f (Xt , t) is a stochastic function that satisfies some mild conditions.
Then
∂f 1 2 ∂2 f
!
∂f ∂f
df (Xt ) = + µt + σt dt + σ t dWt + (f (Xt + Jt , t) − f (Xt , t))dNt .
∂t ∂x 2 ∂x2 ∂x

Example 8.3. In his celebrated model, Robert Merton used a jump-diffusion


model to model the behavior of equity prices. With St representing the value
5 In which case it doesn’t really make sense to call them jump-diffusion processes
256 CHAPTER 8. BROWNIAN MOTION

of the equity, σ representing the equity volatility and µ the expected return, the
stochastic differential equation of St is given by
dSt
= µdt + σ dWt + (Jt − 1)dNt .
St−
The jumps caused by dNt can be due to unannounced news (such as the start of
a war), or due to scheduled news such as the introduction of monetary policies
affecting the equity of interest.

The jump size Jt represents the recovery value: if it is 0, it means that the
equity became worthless after the jump. If it is 0.9, it means that it dropped
10% in value. Some models allow Jt to be negative as well, allowing for positive
news to be included in the jump-diffusion model.

If we assume that the jump process Jt is log-normally distributed, the solu-


tion of this SDE is given by
PNt
(µ− 12 σ 2 )t−σ Wt + j=1 Yj
Xt = X0 e ,

where Yj is normally distributed. Examples of the paths of this process can be


found in figure 8.12.

Figure 8.12: Some sample paths from the Merton Jump-Diffusion model with
log-normal jumps
8.6. MARTINGALES 257

8.6 Martingales

A lot of the examples of stochastic processes we have covered in the previous


sections have an interesting property called the martingale property. Processes
with this property can be seen as representing fair games. They are defined as
follows.

Definition 8.45. The stochastic process Xt is a martingale if it satisfies the follow-


ing two properties

1. E [|Xt |] < ∞ for all t

2. For any 0 ≤ t1 < t2 < ... < tn < t, we have


h i
E Xt | Xt1 , Xt2 , ..., Xtn = Xtn .

In other words, given all the information regarding a stochastic process up to some
point tn , the best guess we have for any point in the future is the latest observation
Xtn .

We show that some of the processes we have seen before are martingales.

Example 8.4. Consider the random walk Sn generated by the random vari-
ables Xi with distribution.

1
2 k = 1


P (Xi = k) =  .
 12 k = −1

Then Sn is easily seen to be a martingale.

Example 8.5. A Brownian motion is a martingale. Indeed,

E [Wt | Wu , u ≤ s] = E [Wt − Ws + Ws | Wu , u ≤ s] .
258 CHAPTER 8. BROWNIAN MOTION

Using the linearity of the expectation operator, we have

E [Wt | Wu , u ≤ s] = E [Wt − Ws ] + E [Ws ] .

Since Wt − Ws ∼ N (0, t − s), we have (check this!)

E [Wt | Wu , u ≤ s] = E [Wt − Ws + Ws | Wu , u ≤ s]
 :0

= E [Wt 
− s | Wu , u ≤ s] + E [Ws | Wu , u ≤ s] = Ws .
W  

This is only the case for Brownian motions without drift. Indeed, for a Brownian
motion with drift
Xt = µt + σ Wt
we have
E [Xt+s | Xs ] = µt + Xs , Xs .

Figure 8.13: Comparison Brownian motion with and without drift


8.6. MARTINGALES 259

Example 8.6. The Itô integral of simple processes is a martingale. Indeed,


recall that these were of the form
k−1
X  
Xt = Zi Wti+1 − Wti + Zk (Wt − Wtk ). (where tk ≤ t ≤ tk+1 )
i=0

Suppose we condition this on Xs , where tl ≤ s < tl+1 < tk ≤ t ≤ tk+1 . We can


write
l−1
X k−1
X
Xt = Zi (Wti+1 − Wti ) + Zl (Ws − Wtl ) +Zl (Wtl+1 −Ws )+ Zi (Wti+1 −Wti )+Zk (Wt −Wtk ).
i=0 i=l+1
| {z }
=Xs

Hence using the above result we get


k−1
 
 X 
E [Xt | Xs ] = Xs + E Zl (Wtl+1 − Ws ) + Zi (Wti+1 − Wti ) + Zk (Wt − Wtk )
i=l+1

which indeed yields E [Xt | Xs ] = Xs .

Example 8.7. Related to the above example, the quadratic process


Zt
M2 (t) = 2 Wu dWu = Wt2 − t
0
is also a martingale.

Indeed, using the fact that t is not a random variable, we get


h i h i
E Wt2 − t | Wu , u ≤ s = E Wt2 | Wu , u ≤ s − t
h i
= E (Wt − Ws + Ws )2 | Wu , u ≤ s − t
h i
= E (Wt − Ws )2 | Wu , u ≤ s + Ws2
+ 2E [(Wt − Ws )Ws | Wu , u ≤ s] − t
h i
= E (Wt − Ws )2 + Ws2 + 2Ws E [Wt − Ws ] − t
= (t − s) + Ws2 − t = Ws2 − s.
260 CHAPTER 8. BROWNIAN MOTION

Example 8.8. Consider the exponential


a2 t
Me (t) = eaWt − 2 ,

for any real a ∈ R.

We claim that this is a martingale. For this, notice that of any s < t, we can
decompose (check this!)
a2 (t−s)
Me (t) = Me (s)ea(Wt −Ws )− 2 .

Hence,
" #
a2 (t−s) h i
a(Wt −Ws )− 2
E [Me (t) | Me (s)] = Me (s)E e = Me (s)E eX ,

 
a2 (t−s)
where X ∼ N − 2 , a2 (t − s) .

Notice that the second term on the right-hand side is exactly the moment-
generating function of X evaluated at 1, ΦX (1), and using example 1.20 we see
that this equals
h i a2 (t−s) a2 (t−s)
E eX = ΦX (1) = e− 2 + 2 = 1.
Hence,
E [Me (t) | Me (s)] = Me (s).
8.7. RELATED PROCESSES 261

8.7 Related Processes

8.7.1 Ornstein-Uhlenbeck Process

As we have mentioned already multiple times, Brownian motion is often used


to model the trajectory of a particle suspended in a medium. As we discussed
in remark 8.1, the velocity of paths governed by Brownian motions will never be
defined. One possible way to avoid this phenomenon is to model the velocity
Vt as a stochastic process, and then define the trajectories using the familiar
relation dXt = Vt dt.

When modeling the velocity of particles, it is important to account for fric-


tion. The higher the velocity of a particle, the stronger the friction which slows
down the particle. This phenomenon is called mean reversion, meaning the
process will tend to return to some equilibrium or average. A model that incor-
porates this is the Ornstein-Uhlenbeck process.
Definition 8.46. Let Wt be a Brownian motion. Then the Ornstein-Uhlenbeck
process is defined by
Xt = e−t We2t .

A visualization of some paths of this process can be found in figure 8.14.

Figure 8.14: Sample paths of an Ornstein-Uhlenbeck process

This process has a couple of nice properties.


262 CHAPTER 8. BROWNIAN MOTION

Property 8.47. An Ornstein-Uhlenbeck process Xt satisfies

Xt ∼ N (0, 1), ∀t.

d
Proof. It suffices to show that W1 = Xt . By property 8.10, we know that
√ d
cWt = Wct .

This also implies that


√ d
cWet = Wcet .
Therefore,

W1 = We−2t+2t = e−2t We2t = e−t We2t ,
from which the result follows.

Property 8.48. An Ornstein-Uhlenbeck process is weakly stationary.

Proof. We have

cov(Xt , Xt+s ) = e−2t−s cov (We2t , We2(t+s) ) .

Since Brownian motion is Gaussian with

K(Wt , Ws ) = min(t, s)

it follows that

cov(Xt , Xt+s ) = e−2t−s min(e2t , e2(t+s) ) = e−2t−s e2t = e−s .

The fact that E [Xt ] = E [X0 ] is a trivial consequence of property 8.47.

In fact, Ornstein-Uhlenbeck processes are stationary. We will show this by


invoking theorem 7.10, which requires the following result.

Proposition 8.49. An Ornstein-Uhlenbeck Xt process is a Gaussian process.


8.7. RELATED PROCESSES 263

Proof. We need to show that for any 0 ≤ t1 , ..., tN , it holds that (Xt1 , ..., xtN ) is
joint normally distributed. By property 7.4, this is equivalent to showing that
any non-trivial linear combination is normally distributed, i.e
n
X
ai Xi ∼ N (µ, σ 2 )
i=1

for some µ, σ ∈ R and any a1 , ..., an ∈ R with at least one ai , 0.

We write
n
X n
X
ai Xi = ai e−t We2ti .
i=1 i=1

Define the real numbers αi = ai e−t for all i = 1, ..., N and define the positive
real numbers si = e2ti . Using these, we can rewrite
n
X N
X
ai Xi = αi Wsi ,
i=1 i=1

which is normal since Brownian motions are Gaussian processes. Since the
choice of a1 , ..., aN was random, the same holds for any non-trivial linear com-
bination of ai , proving the result.

Hence, we can extend property 8.48.


Property 8.50. An Ornstein-Uhlenbeck process is stationary.

Proof. This follows from property 8.48, proposition 8.49, and theorem 7.10.

The next result gives the stochastic differential equation of the Ornstein-
Uhlenbeck process.
Proposition 8.51. The stochastic differential equation

dXt = −Xt dt + 2dWt

has as solution
Xt = e−t We2t .
264 CHAPTER 8. BROWNIAN MOTION

Proof. Denote by Xt the solution of the SDE −Xt dt + 2dW √ t . This is an Itô
process with mean µt (Xt , t) = −Xt and variance σt (Xt , t) = 2.
2

We start by applying Itô’s lemma on the function f (t, Xt ) = et Xt . This yields

∂f ∂2 f √ ∂f
!
∂f
df = − Xt + 2 + 2 dWt
∂t ∂x ∂x ∂x
  √
t t t
= e Xt − e Xt + 0 + 2e dWt

= 2et dWt .

Integrating from −∞ to t gives us


Z t √
t
e Xt = 2eu dWu ,
−∞

and therefore we find Z t √


Xt = e −t
2eu dWu .
−∞
We can apply theorem 8.17 since the integrand is a deterministic function. We
thus find that the last term has as distribution
Zt √ Zt !
2eu dWu ∼ N 0, 2e2u du .
−∞ −∞

It is easy to see that this implies (check this!)


Zt √  
2eu dWu ∼ N 0, e2t .
−∞

From this, one finds that Xt = et We2t is a solution of the SDE.


8.7. RELATED PROCESSES 265

Example 8.9. In this example, we introduce the Hull-White one-factor model.


This generalization of the Ornstein-Uhlenbeck process is frequently used to
model the short rate rt . The short rate rt represents the interest rate at which a
market participant can borrow (or lend) money for an infinitesimal amount of
time. Thus, borrowing one unit of money for a very short time period between
t and t + ∆t will cost around r(t)∆t.

The short rate is an important quantity in financial mathematics, as the


prices of many financial instruments can be completely determined by it. Even
though future short rates are unknown, one can extract the markets’ predictions
by looking at the prices of several products. By plotting this ’implied’ future
short rate for every time t, we obtain a curved called the initial term structure
which we denote by θ0 (t).

When modeling the short rate r(t) using a stochastic process, most traders
want that the predicted model prices of the instruments should, on average,
coincide with the prices on the current market. This means that on average,
our process for rt should coincide with the market implied curve θ0 (t). This
can be forced using mean reversion.

In contrast with the Ornstein-Uhlenbeck process we covered earlier, the


mean is no longer zero but now becomes θ0 (t). This means that the SDE
changes to !
dθ0 (t)
drt = + a · (θ0 (t) − rt ) dt + σ dWt ,
dt
with initial condition r0 = θ0 (0), the current short rate. Here, a represents
the rate at which rt reverts to θ0 (t). Let us show that this indeed implies that
E [rt ] = θ0 (t).

For this, apply Itô’s lemma on f (rt , t) = eat rt . This yields


" #!
at at ∂θ0 (t)
df = ae rt + e + a · (θ0 (t) − rt ) dt + σ eat dWt
∂t
!
∂θ0 (t)
= + ae θ0 (t) dt + σ eat dWt .
at
∂t
Notice that !
∂θ0 (t) at d  
+ ae θ0 (t) dt = θ0 (t)eat ,
∂t dt
266 CHAPTER 8. BROWNIAN MOTION

and hence we obtain after integrating the SDE of f (rt ) from 0 to t


Z t
at at
e rt − r0 = θ0 (t)e − θ0 (0) + eas σ (s)dWs .
0

Solving for rt , we obtain


Z t
rt = r0 e −at
+ θ0 (t) − θ0 (0)e −at
+ ea(s−t) σ (s)dWs .
0

Using the initial condition r0 (t) = θ0 (0), this simplifies to


Z t
rt = θ0 (t) + ea(s−t) σ (s)dWs .
0

Assuming a deterministic function for σ , we can use theorem 8.17 to find the
desired result,
E [rt ] = θ0 (t),

8.7.2 Geometric Brownian Motion

In the section on martingales, we briefly covered the exponential process, de-


a2 t
fined as Me (t) = eaWt − 2 . In the section on jump-diffusion processes, we have
also seen how Merton modeled equity prices using a similar exponential-type
stochastic process.

It turns out that these processes are all related to a special kind of stochas-
tic process, called geometric Brownian motions, which have some interesting
properties that make them incredibly useful. In fact, the Nobel prize-winning
Black-Scholes model crucially uses these stochastic processes. They are defined
as follows.

Definition 8.52. let Wt be a Brownian motion. A standard geometric Brownian


motion Xt is a process given by
Xt = eWt .
8.7. RELATED PROCESSES 267

It follows from the calculations in example 8.1 that a standard geometric


Brownian motion solves the stochastic differential equation
1
dXt = Xt dt + Xt dWt .
2
We also have the following results.
Property 8.53. The geometric Brownian motion has moments
t
E [Xt ] = e 2 , Var (Xt ) = e2t − et .

Proof. Notice that if Xt is a geometric Brownian motion, then

log(Xt ) = log(eWt ) = Wt ∼ N (0, t),

from which it follows that Xt ∼ ℓN (0, t). The result then follows from example
1.19.

As a trivial consequence, geometric Brownian motions are not Gaussian.

Figure 8.15: Geometric Brownian Motion

It is not difficult to see that geometric Brownian motion is strictly non-


negative, as opposed to regular Brownian motion.
Remark. By taking the exponential of more general stochastic processes Xt , one can
obtain a wide family of processes called Lévy processes. These are intrinsically related
to both geometric Brownian motions and Poisson processes. These processes will not
be covered in this book.
268 CHAPTER 8. BROWNIAN MOTION

8.8 Optional stopping and First Exits

We have already seen some results on stopping times, such as the Strong Markov
Property. In this section, we introduce the the related notions of stopped pro-
cesses, and also introduce a new stopping time called the first exit time.

Definition 8.54. Let Xt be a stochastic process and suppose T is a stopping time


for Xt . Then we define the stopped process

Xt t<T


Zt = Xt∧T = Xmin(t,T ) =  .
XT
 t≥T

In other words, the stopped process is the process that stops changing as soon as the
stopping time is reached.

Figure 8.16: A stopped process Zt .

The next result tells us that a stopped martingale process is also a martin-
gale.
8.8. OPTIONAL STOPPING AND FIRST EXITS 269

Theorem 8.55. Let Xt be a martingale, and let T be a stopping time for the process
Xt . Then the stopped process Zt = Xt∧T is also a martingale. Thus in particular for
all t,
E [Zt ] = E [Z0 ] = E [X0 ] .

Proof. Since Xt is a martingale, we have that E [|Xt |] < ∞ almost surely for all
t. Hence, we have that E [|Zt |] < ∞ almost surely for all t as well.

Notice that, using the indicator function, we can split the stopped process
as
Zt = Zs + (Xt − Xs )I(s < T ).
Check that this coincides with Zt for all t ≥ 0. Then, we find that

E [Zt | Xu , u ≤ s] = E [Zs + (Xt − Xs )I(s < T )) | Xu , u ≤ s] .

Since T is a stopping time, we can write this as

E [Zt | Xu , u ≤ s] = Zs + I(s < T )E [Xt − Xs | Xu , u ≤ s] .

Since Xt is a martingale, this implies the desired result, i.e.

E [Zt | Xu , u ≤ s] = Zs .

Unfortunately, this pointwise property does not allow us to conclude that


E [XT ] = E [X0 ]. However, we will show a result, called the optional stopping
theorem, that shows that under some conditions we are allowed to make this
conclusion.
Theorem 8.56. Let Xt be a martingale defined on (Ω, F , P). Let T be a stopping
time for Xt , which is hence also a random variable defined on (Ω, F , P). Suppose
that any one of the two following conditions hold:

• The stopping time is bounded almost surely, i.e there exists a c ≥ 0 such that
P (T ≤ c) = 1.

• The stopping time is almost surely finite (but not necessarily bounded) and
there exists a positive number K such that for all t ≥ 0, P (Xt ≤ K) = 1.
270 CHAPTER 8. BROWNIAN MOTION

Then E [XT ] = E [X0 ].

Proof. We give an outline of the proof, see [DW] for more details. Since T is
almost surely finite, we can define the random variable XT on (Ω, F , P) via
XT (ω) = XT (ω) (ω), ∀ω such that T (ω) < ∞.
This defines XT on a set of probability one, which we denote by H. For the null
set where T is infinite, we can set it to 06 . Notice that for all ω ∈ H, we have a
pointwise convergence
lim Xt∧T (ω) = XT (ω).
t→∞
It remains to be shown that E [XT ∧t ] → E [XT ] , since the result then follows
from theorem 8.55. Using the additional assumptions, one can construct a
random variable that is both integrable and dominates |XT ∧t | for all t. From
this, the Dominated Convergence Theorem yields the desired result.

We will use this result in order to study first exit times. They are defined as
follows.
Definition 8.57. The first exit time from the interval (b, a) is the first time at
which a Brownian motion Wt hits either a or b, where b < 0 < a:
Tab = inf {t ≥ 0 | Wt < [b, a]} .
Since Brownian motion has continuous paths that start at 0, this is a non-zero
random variable almost surely.

It is evident that first exit times are stopping times. Just like we did with
hitting times, we now consider the probability that these random variables are
finite.
Property 8.58. The exit time Tab for a finite interval (b, a) satisfies
P (Tab < ∞) = 1.

Proof. It is evident that for the hitting time Ta , we have Tab ≤ Ta . The result
then follows from proposition 8.29.
6 Expectations are integrals, whose values are not influenced by events in null sets.
8.8. OPTIONAL STOPPING AND FIRST EXITS 271

Consider now the stopped martingale with stopping time Tab . As soon as
the Brownian motion exits the interval (b, a), it will stay at the boundary that it
hit. Notice that the Brownian motion can leave through either a or b. What is
the probability of these scenarios?

Proposition 8.59. Suppose Wt is a standard Brownian motion and let b < 0 < a
be an interval. Then the probability of first exit via a is given by

b
P (Tab = Ta ) = .
b−a
Hence, the probability of first exit via b is given by
a
P (Tab = b) = .
a−b

Proof. Notice that


h i
E WTab = E [aI(Tab = Ta ) + bI(Tab = Tb )] ,

which by proposition 1.65 is equal to


h i
E WTab = aP (Tab = Ta ) + bP (Tab = Tb ) ,

Since P (Tab = Ta ) + P (Tab = Tb ) = 1, we can rewrite this as


h i  
E WTab = b + (a − b)P WTab = Ta

Notice that by |WTab | < a − b, which together with property 8.58 allows us to
apply theorem 8.56 to find
h i
E WTab = E [W0 ] = 0,

which implies that

b a
P (Tab = Ta ) = = 1− = 1 − P (Tab = Tb )
b−a a−b
from which the required result follows.
272 CHAPTER 8. BROWNIAN MOTION

Recall that the expected hitting time E [Ta ] was infinite for any state a , 0.
We now consider the expected exit time for the interval b < 0 < a. We have the
following result.

Property 8.60. Let Wt be a Brownian motion and let b < 0 < a be an interval.
The expected exit time is given by

E [Tab ] = −ab.

Proof. We will leverage the fact that the quadratic process is a martingale (see
example 8.7)
Zt
Xt = Ws ds = Wt2 − t.
0

With Tab being the exit time for Wt , it is easy to see that Tab is a stopping time
for Xt . Therefore, by theorem 8.56 (check why the assumptions hold), it follows
that h i h i
0 = E [X0 ] = E Xt∧Tab = E WT2ab ∧t − E [t ∧ Tab ] .

This yields us
h i
E [Tab ] = E WT2ab .

As an exercise, check that this yields E [Tab ] = −ab.

8.9 Hitting and Exit time transforms

In this section, we will discuss some distributional information regarding the


exit times. This will be done by first considering the hitting times. Because the
approach is much simpler, we will instead try to obtain the Laplace transforms
of these exit times.

We will once again use a different stochastic process. This time, we will use
the exponential process. We have seen in example 8.8 that this process, given
by
α2 t
Xt = eαWt − 2 , (α > 0)
8.9. HITTING AND EXIT TIME TRANSFORMS 273

is a martingale. Denote by Ta the hitting time of Wt and by Tab the exit time
of Wt . Applying theorem 8.567 , we find
 2 
αWTa − α 2Ta
h i
1 = E [X0 ] = E XTa = E e ,

which we can rewrite as


α 2 Ta
   α2 T 
a
h i
E XTa = E eαa− 2 = eαa E e− 2 .

2
We write θ = α2 . Notice that the expectation on the right-hand side is exactly
the Laplace transform of the random variable Ta :
h i Z∞
LTa (θ) = E e −θTa
= e−θt fTa (t)dt.
0

Recall that we already have derived the density of hitting times, namely
|a| a2
fTa (t) = √ e− 2t .
2πt 3
h i
Using the fact that E XTa = 1 = eαa LTa (θ), we easily find that
Z∞ √
LTa (θ) = e−θt fTa (t)dt = e−αa = e− 2θa .
0

Notice that in particular


Z ∞
LTa (0) = fTa (t)dt = 1,
0

since Ta is a non-negative random variable.

We will now do something similar in order to find the Laplace transform of


Tab . For this, notice that first of all
 
 
 
LTa (θ) = E e −θTa (I(Ta < Tb ) + I(Ta ≥ Tb )) .
 | {z }
=1
7 Inorder to see why we are allowed to use this theorem, note that we have already seen that
P (Ta < ∞) = 1. Show that the stopped process can be uniformly dominated by eαa .
274 CHAPTER 8. BROWNIAN MOTION

Figure 8.17: A Brownian motion for which Tb < Ta

Suppose that Tb < Ta , such as in figure 8.17. Then for the Brownian motion
to hit a, it first needs to move from b to a, which of course is independent
of {Ws | s ≤ Tb }. Applying the Strong Markov property, property 8.10, we can
rewrite
n o
Ta | (Tab < Tb ) = Tb + inf s > 0 | WTb +s = a
Stat. Trans. Prob. n o
= Tb + inf s > 0 | WTb +s − WTb = a − b
= Tb + T̃a−b ,

where T̃a−b is the hitting time for the state a−b of the Brownian motion W̃t that
starts in 0 at Tb , see figure 8.18.

Notice that the times Tb and T̃a−b are independent. Furthermore, since
b < 0 < a, notice that a − b > a.
8.9. HITTING AND EXIT TIME TRANSFORMS 275

Figure 8.18: The path W̃t starts at Tb in 0

Using the above computations, we can now write (check this!)


h i h i
E e−θTa I(Ta ≥ Tb ) = E I(Ta ≥ Tb )e−θ(Tb +T̃a−b ) .
Since Tb and T̃a−b are independent, this can be written as
h i h i h i
E e−θTa I(Ta ≥ Tb ) = E I(Ta ≥ Tb )e−θTb · E e−θT̃a−b
| {z }
LT̃ (θ)
a−b
h i √
− 2θ(a−b)
= E I(Ta ≥ Tb )e−θTb e .
Substituting this in the equation for LTa (θ), this yields
h i
LTa (θ) = E e−θTa (I(Ta < Tb ) + I(Ta ≥ Tb ))
h i h i √
−θTa −θTb − 2θ(a−b)
=E e I(Ta < Tb ) + E I(Ta ≥ Tb )e e .

Combining the above results, we hence obtain


√ h i h i √
e− 2θa = LTa (θ) = E e−θTa I(Ta < Tb ) + E I(Ta ≥ Tb )e−θTb e− 2θ(a−b)
| {z } | {z }
x1 x2
276 CHAPTER 8. BROWNIAN MOTION

This whole argument can be repeated for b8 , in which case we obtain

√ h i h i √
2θb −θTa − 2θ(a−b)
e = LTb (θ) = E e −θTb
I(Tb < Ta ) + E I(Tb ≥ Ta )e e .
| {z } | {z }
x2 x1

We hence find the system of equations

 √ √
− 2θa = x + x e− 2θ(a−b)
e√

 1 2 √
.
e 2θb = x2 + x1 e− 2θ(a−b)

This has as solution

 √ √ √
e−2θb −e 2θb sinh(− 2θb)
x1 = =

 √ √ √
e 2θ(a−b)
√ −e √
− 2θ(a−b) sinh( √2θ(a−b)

 − 2θa −e 2θa sinh( 2θa)
.
x2 = √ e =

 √ √
e 2θ(a−b) −e− 2θ(a−b) sinh( 2θ(a−b)

Notice now that

h i
LTab (θ) = E e−θTb I(Tb < Ta ) + e−θTa I(Ta < Tb )
= x1 + x2
√ √
sinh(− 2θb) + sinh( 2θa)
= √ .
sinh( 2θ(a − b))

In order to find the density of the hitting times, one can apply an inverse
Laplace transform.

8 This is left as an exercise.


8.9. HITTING AND EXIT TIME TRANSFORMS 277

Figure 8.19: The Hyperbolic Sine function


278 CHAPTER 8. BROWNIAN MOTION
Bibliography

[C1] US Department of Commerce, N. O. A. A. (2018, September 6). Climate


- temperature graph. National Weather Service. Retrieved August 9, 2022,
from https://www.weather.gov/apx/temperature_graphs

[WV] Mannel, R. (2020, 12 March). The Waveforms of Speech.


Macquarie University. Consulted 17 august 2022, from
https://www.mq.edu.au/about/about-the-university/
our-faculties/medicine-and-health-sciences/
departments-and-centres/department-of-linguistics/
our-research/phonetics-and-phonology/speech/acoustics/
speech-waveforms/the-waveforms-of-speech

[HET] The History of Economic Thoughts, Abraham Wald. Retrieved August


20, 2022 from https://www.hetwebsite.net/het/profiles/wald.
htm

[AM] Andrey Markov, Wikimedia Commons. Retrieved August 20, 2022 from
https://commons.wikimedia.org/wiki/File:Andrei_Markov.jpg

[DW] Williams, D., (1991) Probability with martingales, Cambridge University


Press, Cam- bridge,

279

You might also like