0% found this document useful (0 votes)
46 views114 pages

Ikaj Stochmod Lectnotes

These lecture notes are written for students with some background knowledge in elementary probability and statistics, including random variables, probability distributions, mean, variance, covariance, conditional probabilities and basic understanding of the law of large numbers and the central limit theorem. The collection of material in the notes intends to deepen the knowledge of probability and stochastic processes and provide an introduction to stochastic models, including some experience o

Uploaded by

ingemar.kaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views114 pages

Ikaj Stochmod Lectnotes

These lecture notes are written for students with some background knowledge in elementary probability and statistics, including random variables, probability distributions, mean, variance, covariance, conditional probabilities and basic understanding of the law of large numbers and the central limit theorem. The collection of material in the notes intends to deepen the knowledge of probability and stochastic processes and provide an introduction to stochastic models, including some experience o

Uploaded by

ingemar.kaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Stochastic Modeling

for Engineering Studies


Ingemar Kaj
Department of Mathematics
Uppsala University
Lecture notes for the course:
Stochastic Modeling, 5 credits, 1MS007
Uppsala University, December 2015
Contents

Chapter 1. Introduction 1
1.1. Brief review of Probability Theory 2
1.2. Stochastic Processes 7
1.3. Simulation of stochastic processes 17
1.4. The life-length process 19
1.5. Reliability of systems 22
1.6. The Poisson process 25
1.7. Properties of the Poisson process 26
Solved exercises 31
Chapter 2. Markov Chain Models, discrete time 37
2.1. The Markov property 37
2.2. Stationary distribution and steady-state 41
2.3. First return and first passage times 45
Solved exercises 47
Chapter 3. Continuous time Markov chains 55
3.1. Birth-and-death processes 58
3.2. Credit rating models 62
Solved exercises 64
Chapter 4. Some non-Markov models 67
4.1. Renewal models 67
4.2. Renewal reward processes 69
4.3. Reliable data transfer 70
4.4. Time series models 72
4.5. Autoregressive and moving average processes 75
4.6. Statistical methods 80
Solved exercises 83

iii
iv Contents

Chapter 5. Applications in Biology and Bioinformatics 87


5.1. The Wright-Fisher model 87
5.2. Kingman’s coalescent process 93
5.3. Some models for DNA sequences 96
5.4. Recovering the genome from fragments 99
5.5. Pairwise alignment methods 101
5.6. Hidden Markov models 104
Solved exercises 107
Bibliography 109
Chapter 1

Introduction

Everywhere in science and technology we see random change and patterns of varia-
tion due to chance. Be it
- changes in weather or climate;
- spread of disease;
- traffic congestion;
- noise or shut-down in communication systems;
- quality variation in a manufacturing line;
- the intrinsic randomness in the evolution of all living species,
or, something else. This course gives an introduction to mathematical ideas, tools,
and models, which are widely used to study some of the basic mechanisms of ran-
domness in such systems. The course notes aims at providing a semi-rigorous math-
ematical introduction to a collection of models based on the theory of stochastic
processes and to give a number of modern examples from various areas of engineer-
ing and natural sciences where these models come to use.
The selection of material starts at a level corresponding to a standard course
in introductory probability and statistics but does not assume prior knowledge of
stochastic processes. Following a brief review of tools from probability theory, we
introduce basic concepts and ideas of stochastic processes for studying stochastic
models and give introductory examples of random processes. This is followed by a
discussion on how to instruct a computer to simulate random behavior. We discuss
briefly the theoretical background of computer simulation of stochastic processes and
look at some examples of simulation code. Additional example code and simulation
output are scattered throughout the text. Then we present two basic stochastic
models in more detail. First the life-length process, which describes events that
occur after a random waiting time. One may think of the failure time of a mechanical
device or the life-length of a human being as examples. Second, the Poisson process
is the fundamental model for a sequence of events which occur randomly in time and
independently of each other. In the subsequent chapters we cover Markov chains in
discrete and continuous time, renewal processes, and time series models. Variations
of the basic models appear in case studies. In greater detail, we provide theory,
examples and exercises to help

1
2 1. Introduction

- use probabilistic arguments including conditional distributions and expecta-


tions;
- understand Poisson processes and models based on life-length distributions,
and use them to assess risk and error;
- carry out basic mathematical modeling using Markov chains in discrete and
continuous time;
- find probabilities and expected values for finite Markov chains using the prin-
ciple of conditioning on the first jump;
- review and apply Markov chain methods based on stationary and asymptotic
distributions;
- use fundamental models of time series, in particular moving average and au-
toregressive models, and carry out covariance calculations in these cases;
- understand the basic principles of renewal theory and use them for performance
calculations;
- apply suitable stochastic models in a given situation and draw useful conclu-
sions.
To illustrate and visualize methods and tools we use at some places statistical
data and apply descriptive statistics and basic statistical analysis. Computer sim-
ulations are used to help visualize and understand random dynamics and random
evolution over time. Our simulations are based on relatively simple Matlab code
(Copyright 1994-2014, The MathWorks, Inc). Readers are encouraged to try out
their own versions.

1.1. Brief review of Probability Theory


The main challenge of introductory probability theory is to gain an understanding
of random variables and develop some skills in calculus with random variables. In
probability theory, random variables are used to describe quantities with values
determined by chance and hence varying depending on the outcome of a random
experiment. A random variable or, equivalently, a stochastic variable, is a function
which assigns to each possible outcome of the chance experiment the appropriate
value of the quantity. We adopt the standard practice of denoting random variables
with capital letters X, Y , etc. A standard example is tossing a coin and letting
X be a random variable taking values 0 or 1 if the coin lands head or tail. Or
we may think of an upcoming game of soccer as a chance experiment and define a
random variable Y to take three values, say 1, × or 2, for match win, tie, or loss
for the home team. The next step is to assign probabilities in a consistent manner
to all possible outcomes. The obvious choice for describing a symmetric coin is
P (X = 0) = P (X = 1) = 1/2. For the soccer game example one option might be
to use historical match data and tro to estimate “correct” probabilities p1 , p× and
p2 , such that
P (Y = 1) = p1 P (Y = ×) = p× P (Y = 2) = p2 , p1 + p× + p2 = 1.
Of course, we can rename the values of Y and restrict to integer values in a case like
this.
In general, we consider real-valued random variables and review in list form the
following notions:
1.1. Brief review of Probability Theory 3

Distribution function F (x) = P (X ≤ x).


Quantile: The number xα such that F (xα ) = 1 − α.
Discrete random variable: X is discrete if it has a finite or countable number
of values x1 , x2 , . . . , with probabilities p(x1 ), p(x2 ), . . . where p(x) is the probability
function of X
Continuous random variable: X is continuous Z x with density function f (x) if X
assumes all values in an interval and F (x) = f (t) dt all x.
−∞
Then
1) f (x) = F ′ (x) ≥ 0 for all x such that the derivative exists,
Z b Z ∞
2) P (a < X < b) = f (x)dx 3) f (x) dx = 1.
a −∞
Expected value: X

 xi p(xi ) (X discrete),

i
E(X) = µ = Z ∞


 x f (x) dx (X continuous).
−∞

Variance: V (X) = σ 2 = E((X − µ)2 ) = E(X 2 ) − µ2 .


p
Standard deviation: D(X) = σ = V (X).

Here, the expected value E(X) and the variance V (X) of X, when they exist, are
the standard means of measuring “average value” and “average squared variation” of
random variables. Suppose that we also have access to measurement data x1 , . . . , xn ,
which we believe represent “typical, independently choosen, values of X”. Then
x1 , . . . , xn is a sample of observations of size n from X and we may apply statistical
estimators, such as

n
X
Sample mean: µ
b = x̄ = xi .
i=1
n
1 X
Sample variance: σb2 = (xi − x̄)2 .
n − 1 i=1

Next, we collect some standard probability distributions, writing p(k) for probability
function, f (x) for density function, µ for expected value and σ 2 for variance.
4 1. Introduction

Discrete distributions
 
n k n−k
Binomial: X ∈ Bin(n, p) if p(k) = p q , k = 0, 1, 2, . . . , n.
k
0 ≤ p ≤ 1, q = 1 − p, µ = np, σ 2 = npq.
Geometric: X ∈ Ge(p) if p(k) = pq k , k = 0, 1, 2, . . .
q q
0 < p ≤ 1, q = 1 − p, µ = , σ 2 = 2 .
p p
µk −µ
Poisson: X ∈ Po(µ) if p(k) = e , k = 0, 1, 2, . . .
k!
µ > 0, σ 2 = µ.

Continuous distributions

Uniform:
1 a+b (b − a)2
X ∈ Re(a, b) if f (x) = , a ≤ x ≤ b, µ= , σ2 = .
b−a 2 12
Exponential:
1 1
X ∈ Exp(λ) if f (x) = λe−λx , x ≥ 0, λ > 0, µ= σ2 = .
λ λ2
Gamma distribution:
λn xn−1 −λx n n
X ∈ Γ(n, λ) if f (x) = e , x ≥ 0, λ > 0, µ= , σ2 = .
(n − 1)! λ λ2

Normal (Gaussian):
1 (x−µ)2
X ∈ N(µ, σ 2 ) if f (x) = √ e− 2σ2 , −∞ < x < ∞ σ > 0.
2π σ
2
where µ = m is the mean and σ is the variance. For the N (0, 1)-distribution we
write Φ(x) for the distribution function and λα for the α-quantiles.

Independence versus dependence

Two random variables X1 , X2 , are said to be independent if for all x1 , x2 ,


P (X1 ≤ x1 , X2 ≤ x2 ) = P (X1 ≤ x1 )P (X2 ≤ x2 ).
Similarly, the random variables X1 , . . . , Xn are mutually independent if for all
x1 , . . . , xn ,
P (X1 ≤ x1 , . . . , Xn ≤ xn ) = P (X1 ≤ x1 ) · · · P (Xn ≤ xn ).
Covariance: C(X, Y ) = E((X − µx )(Y − µy )) = E(XY ) − µx · µy .

C(X, Y )
Correlation coefficient: ρ(X, Y ) = .
D(X) · D(Y )
1.1. Brief review of Probability Theory 5

Two random variables X and Y are said to be uncorrelated if C(X, Y ) = 0 (or


ρ(X, Y ) = 0). Independent random variables are also uncorrelated.

Sums of random variables with finite mean and variance

The expected value of a sum of random variables is the sum of the expected values:
E(X1 + · · · + Xn ) = E(X1 ) + · · · + E(Xn )
The variance of a sum of random variables is the sum of covariances of all pairs:
Xn  X n X n
V Xi = Cov(Xi , Xj )
i=1 i=1 j=1

In particular, V (X1 + X2 ) = V (X1 ) + V (X2 ) + 2Cov(X1 , X2 )


The variance of a sum of independent random variables is the sum of the variances:
V (X1 + · · · + Xn ) = V (X1 ) + · · · + V (Xn ).

Conditional distribution

Suppose X and Y are discrete, integer-valued, random variables. The conditional


distribtution of X given that Y = n, is defined by the probability function P (X =
k|Y = n) = P (X = k, Y = n)/P (Y = n), k = 0, 1, . . . . The relation
X∞
P (X = k) = P (X = k|Y = n)P (Y = n), k = 0, 1, . . . ,
n=0
is called the law of total probability.

Law of large numbers

The law of large numbers says that if X1 , X2 , . . . are independent random variables
with the same distribution and finite expected value µ, then the average of a large
number of terms is also given by µ, in the sense of a limit
X1 + · · · + Xn
→ µ, n → ∞,
n
where the convergence holds in a strong probabilistic sense called almost sure con-
vergence.

Central limit theorem

The central limit theorem says that if X1 , X2 , . . . are independent random vari-
ables with the same distribution, finite expected value µ and finite variance σ 2 ,
0 < σ 2 < ∞, then the distribution of the centered and normalized partial sum is
6 1. Introduction

asymptotically normal,
n
1 X
√ (Xk − µ) ⇒ N (0, 1), n → ∞,
nσ k=1
where the arrow notation means that the distribution function of the quantity on
the left hand side converges to the distribution function of the standard normal
distribution.

Various

Geometric sums:

X ∞
X
1 an
ak = , |a| < 1, ak = , |a| < 1.
k=0
1−a k=n
1−a

X ak  x n
Exponential function: = ea , lim 1+ = ex .
k=0
k! n→∞ n

Example 1.1 (Minimum temperature). In 1875 the temperature in Uppsala


during the coldest day of the year reached −39.5◦ C. In 1992 the temperature of the
coldest day was −11.5◦ C. A new record high of the minimum temperature (since
1840) was set in 2015, when the temperature on December 29 dropped to −11.3◦ C.
To study the random variation of this quantity let X be a random variable which
gives the lowest temperature of any day in a given future year in Uppsala. It is
reasonable to assume that X is a continuous random variable with negative values.
To get an idea of what type of distribution this X might have we consider historical
data, which is available as a data series x1 , x2 , . . . starting with the year 1840. The
sample mean is x̄ = −22.28◦ C and the sample standard deviation is σ b = 4.497◦ C.
To visualize the data it is convenient to consider a histogram, see Figure 1. By

20

15

10

0
-40 -35 -30 -25 -20 -15 -10

Figure 1. Minimum annual temperatures in Uppsala, 1840-2015

normalizing with the number of observations, n = 176, we may interpret the data
as an estimate of the probability distribution of a discrete version Z of X, such
that Z is the minimum temperature rounded off to an integer. The shape of the
corresponding probability function appears to be rather close to a Gaussian curve,
1.2. Stochastic Processes 7

hence suggesting the modeling assumption X ∈ N(µ, σ 2 ) with parameters given by


our point estimates µ = −22.28 and σ = 4.497. This approach is illustrated in
Figure 2, where the height of the filled bars represent the values of the probability
function p(k) = P (Z = k), with k ranging from −39 to −11. The overlaid solid curve
is the density function of a random variable with distribution N(−22.28, 20.2218).

0.12

0.1

0.08

0.06

0.04

0.02

0
-40 -35 -30 -25 -20 -15 -10

Figure 2. Minimum temperatures 1840-2015, fitted Gaussian density function

Example 1.2 (Telephone traffic). A classical model for telephone switch traffic
assumes that the interarrival times of calls, that is, the duration between the starting
times of any two consecutive calls directed to the switch, are statistically independent
and follow the same exponential distribution. Hence, to describe the incoming calls
let U1 , U2 , . . . be a sequence of independent random variables all with the same
exponential distribution with parameter λ. Then, starting to trace calls at time
t = 0,
T1 =U1 = time of the first call
T2 =U1 + U2 = time at which second call arrives
..
.
Tn =U1 + · · · + Un = time of nth call
U1 U2 U3 Un Un+1

Tn

For each n, the sum Yn is a new random variable with expected value and variance
n n
E(Tn ) = E(U1 ) + · · · + E(Un ) = , V (Tn ) = V (U1 ) + · · · + V (Un ) = 2 .
λ λ
It can be shown that Tn has the the Γ(n, λ)-distribution.

1.2. Stochastic Processes


As the introductory examples show, the first steps in setting up a stochastic model
might be to identify relevant random variables and study their distributions and
other characteristics, such as expected values and variances. To cover dynamically
changing random quantities we need stochastic processes as a further tool. This
8 1. Introduction

involves studying random functions t 7→ X(t), where t is often a time parameter.


For each fixed t the value X(t) is an ordinary random variable. There exists a large
number of interesting random processes with a range of widely different behavior,
and which are useful in a variety of applications. We are going to study some of the
mathematical models and some of the applications in detail.
We collect some notions formalizing the idea of a stochastic process in the fol-
lowing definition.

Definition 1.3. A stochastic process (or random process) {X(t), t ∈ I} is a col-


lection of random variables indexed by a set I. We distinguish stochastic processes
with
• discrete time: I is a sequence of integers (normally I = {0, 1, . . . } or I =
{. . . , −1, 0, 1 . . . })
• continuous time: I is an interval of the real line (normally I = [0, ∞) or
I = (−∞, ∞))
The set E of possible values for X(t) is called the state space. A stochastic process
has
• discrete states, if E is a discrete set such as E = {0, 1, . . . , r} or E = {0, 1, . . . };
and
• continuous states, if E is an interval of the real line such as E = [0, ∞).
The function t 7→ X(t) is called a path, a trajectory, or a realization of the process.
In discrete time it is convenient to write Xn instead of X(n), n = 0, 1, . . . , so that
the path is n 7→ Xn . In this case the random process {Xn } is also called a time
series.

Any collection of independent random variables trivially forms a stochastic pro-


cess. We encounter this example later in discrete time with the designation white
noise. Typically, however, there is some statistical dependence between the out-
comes X(t1 ) and X(t2 ) of a random process {X(t)} sampled at two time points t1
and t2 . In fact, most of the interesting stochastic processes that are used in theory
and applications fall into one of several categories of processes, each characterized
by a particular dependence structure over time. Such categories that we will make
acquaintance with are stochastic processes with independent increments, stationary
processes, Markov processes, and renewal processes.
As the terminology of discrete and continuous time introduced above already
suggests, stochastic processes often describe the evolution over time of randomly
varying quantities. But sometimes the interpretation of a random process is entirely
different, such as if Xn denotes the nucleotide base (A, G, C or T) at the nth position
in a DNA sequence. Then the path of the process, n → Xn ∈ E, does not map time
points, but rather a spatial location, into the state space E = {A, G, C, T }.
A fundamental example of a discrete time random process is the random walk.
1.2. Stochastic Processes 9

Example 1.4 (Random walk). Suppose that {Zk } is a sequence of independent


and identically distributed (i.i.d.) random variables such that

1 with probability p
Zk =
−1 with probability 1 − p.
Then the sequence
Xn = Z1 + · · · + Zn , n ≥ 1, X0 = 0,
is a discrete time stochastic process called a simple random walk. The state space
E in this example is the set of all integers. We observe that the defining relation
can be rewritten as the recursion
(1.1) Xn+1 = Xn + Zn+1 , n ≥ 0,
which says that each new update of the random walk is obtained from the current
value plus an independent contribution of plus or minus one, regardless of the pre-
vious values X0 , . . . , Xn−1 . Figure 3 shows five independent simulated paths of a
random walk, compare Section 1.3.

15

10

-5

-10

-15
0 5 10 15 20 25 30 35 40 45 50

Figure 3. Simulated trajectories of a simple random walk

In greater generality than the random walk we may form the partial sums
n
X
Sn = Yk , n ≥ 1,
k=1

for a given sequence of random variables {Yk }. We assume that the summation
terms are independent with the same distribution, such that the expected value
µ = E(Yk ) and the variance σ 2 = V (Yk ) exist as finite numbers. By the law of large
numbers the average value Sn /n will be closer and closer to µ the larger n we take.
By the central limit theorem the distribution of the centered and normalized partial
sum is asymptotically normal:
n
Sn − E(Sn ) 1 X
p =√ (Yk − µ) ⇒ N (0, 1), n → ∞.
V (Sn ) nσ k=1
10 1. Introduction

To illustrate this construction, let us assume as an example that each Yk is the


square of a unit mean exponential, that is Yk = Zk2 , where the distribution of Zk is
exponential with parameter 1. Then
µ = E(Yk ) = E(Zk2 ) = V (Zk ) + E(Zk )2 = 1 + 1 = 2
and so E(Sn ) = 2n. Now if pick an n, say n = 100, and produce a sequence
of typical outcomes of S100 we expect these to fall somewhere around the average
E(S100 ) = 200 with a variation that depends on other properties of the Yk . To check
this numerically, Figure 4 shows three histograms based on 1000 values each for Sn
with n = 100 (upper panel), n = 200 (middle panel) and n = 400 (lower panel).
In agreement with the central limit theorem, these distributions take on a more
and more Gaussian shape the larger values of n we apply. The plots illustrate the
position of a random walker after 100, 200 and 400 steps, respectively, if the steps
are independent and have size Y1 , Y2 , . . . .

100

50

0
0 200 400 600 800 1000 1200

100

50

0
0 200 400 600 800 1000 1200

100

50

0
0 200 400 600 800 1000 1200

Figure 4. Summing independent variables - generalized random walk

Definition 1.5. With a stochastic process {X(t)} we can associate


• the mean value function m(t) = E(X(t));
• the variance function v(t) = V (X(t));
• the (auto)covariance function r(s, t) = C(X(s), X(t)); and
p
• the (auto)correlation function ρ(s, t) = r(s, t)/ v(s)v(t),
if these functions exist finitely.

Example 1.6 (Random walk, mean and variance). For the random walk,
E(Z1 ) = 1 · p + (−1) · (1 − p) = 2p − 1, E(Z12 ) = 12 · p + (−1)2 · (1 − p) = 1
1.2. Stochastic Processes 11

so

V (Z1 ) = 1 − (2p − 1)2 = 4p(1 − p).

Thus,

m(n) = E(Xn ) = nE(Z1 ) = (2p − 1)n


v(n) = V (Xn ) = nV (Z1 ) = 4p(1 − p)n
r(n, m) = C(Xn , Xn ) + C(Xn , Zn+1 + · · · + Zm )
= V (Xn ) + 0 = 4p(1 − p)n, if m ≥ n,

hence

r(n, m) = 4p(1 − p) min(n, m); and


p p
ρ(n, m) = min( n/m, m/n).

The random walk and, more generally, partial sum processes are examples in dis-
crete time of processes with independent increments and with stationary increments,
according to

Definition 1.7. A stochastic process {X(t)} has


• independent increments, if for any choice of time points 0 ≤ t1 ≤ · · · ≤ tk and
k ≥ 1, the successive increments
X(t1 ), X(t2 ) − X(t1 ), . . . , X(tk ) − X(tk−1 )
are independent random variables;
• stationary increments, if the distribution of the increment X(s + t) − X(s) does
not depend on s.

Random processes having both independent and stationary increments are called
Lévy processes. For the random walk these two properties are “built-in”, since if
we take k integers n1 ≤ · · · ≤ nk then the increments are the consecutive sums
Z1 +· · ·+Zn1 , . . . , Znk−1 +1 +· · ·+Znk . These are independent since all of the random
variables Z1 , Z2 . . . are independent. Moreover, Xn+m − Xn = Zn+1 + . . . Zn+m is a
sum of m independent terms all with the same distribution, regardless of the number
n, which means that the increments are stationary.
Another concept of stationarity which is of great importance in engineering sci-
ences is that of weakly stationary stochastic processes. Weak stationarity means
that the mean value and the covariance structure of the process are preserved under
a shift of the time scale. In contrast, strictly stationary random processes preserve
all statistical properties over time not just the first and second order moments. Strict
stationarity is of theoretical importance but often mathematically difficult to use in
applications. The definitions for continuous time are as follows, with straightforward
modifications for discrete time.
12 1. Introduction

Definition 1.8. A stochastic process {X(t)} is said to be


• strictly stationary, if, for any t1 ≤ · · · ≤ tk and k ≥ 1, the distribution of the
time-shifted vector (X(t1 + s), . . . , X(tk + s)), where s is arbitrary, is the same
as the distribution of the vector (X(t1 ), . . . , X(tk ));
• weakly stationary (or wide sense stationary), if
– the mean value function m(t) ≡ m is constant;
– the covariance function r(s, t) = r(t−s) is a function of the time difference
t − s only.

Thus, a weakly stationary random process describes a random entity which fluc-
tuates around a constant average value and is such that the covariance of any two
values depends only on the distance in time between them, and not on “clock time”.
The next example, which is used here to illustrate stationarity, is entirely different
from the random walk. The random wave process has strong dependence over time.
In fact, the complete trajectory of the process is determined by the outcome of only
two random variables.

-1

-2

-3

-4
0 1 2 3 4 5 6 7 8 9 10

Figure 5. Simulated paths of a random wave

Example 1.9 (Random wave). A random wave is a continuous time and contin-
uous state stochastic process of the form
X(t) = A cos(t + φ), t ≥ 0,
where A and φ are independent random variables, A > 0 has finite second moment
E(A2 ) < ∞ and φ is uniformly distributed on [0, 2π]. Figure 5 shows five simulated
trajectories of a random wave obtained by letting A have the distribution of the
absolute value of the normal distribution N (0, 1), in which case, by symmetry,
Z ∞
1 −x2 /2 1 h −x2 /2
i∞ p
E(A) = 2 x√ e dx = 2 √ −e = 2/π
0 2π 2π x=0

and Z ∞
2 1 2
E(A ) = x2 √ e−x /2 dx = 1,
−∞ 2π
where in the last step we recognize the integral expression for the variance in the
N(0, 1) distribution.
1.2. Stochastic Processes 13

Now let us verify that the random wave process is weakly stationary. By inde-
pendence, E(X(t)) = E(A)E(cos(t + φ)), where
Z 2π
1 1h iy=2π
E(cos(t + φ)) = cos(t + y) dy = sin(t + y) = 0,
0 2π 2π y=0

so the mean value function m(t) is constant m = 0. The covariance function r(s, t)
equals
Z
E(A2 ) 2π
r(s, t) = E(X(s)X(t)) = cos(s + y) cos(t + y) dy.
2π 0

Next, using the trigonometric formula 2 cos(x) cos(y) = cos(x + y) + cos(x − y),
Z 2π Z
1 2π
cos(s + y) cos(t + y) dy = (cos(s + t + 2y) + cos(t − s)) dy
0 2 0
1
= 0 + cos(t − s),
2
and hence the random wave process is weakly stationary with mean m = 0 and
covariance function
E(A2 )
r(s, t) = cos(t − s).
2

Time series data, by which we mean series of successive data point measurements
collected over time, are used in just about all areas of empirical sciences, such as
econometrics, weather forecasting and production engineering. Time series analysis
refers to the vast collection of statistical and mathematical methods used to sort
out meaningful information from the time series data. We return to these topics in
Section 5 and close this section with a few additional examples.
Example 1.10 (EEG). Electroencephalography, EEG for short, measures voltage
fluctuations within the neurons of the brain and hence records the spontaneous
electrical activity of the brain over a period of time, typically a few minutes. A
measurement series comes out as a wave-formed pattern of voltage variation and
may be recorded separately in different frequency bands. Figure 6 shows three one-
second segments of normal EEG-measurements. The upper curve is an example
of an alpha-wave signal in the frequency range 7-14 Hz, the middle curve shows
fluctuations of gamma-waves, meaning that the frequency is approximately 30-100
Hz, and the lower curve is the superposed signal over all frequency bands (Source:
Wikipedia).
EEG measurements reflect highly complex mechanisms of the human brain with
huge variability depending on the activity and health status of the observed person,
and may appear quite “non-stationary” over time. Yet, the theory of weakly station-
ary stochastic processes is a natural modeling approach to EEG-data over shorter
times. Observed deviations from stationarity or tracking of nontypical frequencies
in data might be the basis of tools to detect pathological effects or effect of drugs
on the brain.

Two more advanced examples


14 1. Introduction

Figure 6. Wave patterns of EEG measurements

Example 1.11 (Finance market models). Financial mathematics and financial


engineering have developed and grown at spectacular rate over the last decades. A
variety of advanced mathematical, statistical and computational methods are used
and many of the most important tools are based on stochastic models.
To touch briefly upon some topics in this area we begin with a set of empir-
ical data. Figure 7 shows the stock price of General Motors in dollars at the
end of each week during 1990-2009, adjusted for dividends and splits (data from
finance.yahoo.com). The question arises whether we can formulate a stochastic
model for the stock price development of GM, or the stock price of a share in
general, by introducing a discrete time stochastic process {Vn } and setting Vn =
stock price end of week n, where n = 0 corresponds to a given starting time. For
this to be a valid model, the trajectories of the process must share the essential
characteristics of empirical data such as above. It is not at all obvious what the
characteristic features in these data are, nor which assumptions on the random pro-
cess would result in paths that resembles the data. A popular approach, however,
considers the relative change in stock price from one week to the next and looks at
the development of the successive ratios. To introduce the model which results from
this approach, let us first carry out the procedure for the empirical data set. The
GM data consists of data points {vn : 0 ≤ n ≤ 1044}. The empirical logreturns of
GM are obtained as the transformed data
yn = ln(vn /vn−1 ), 1 ≤ n ≤ 1044.
Figure 8 shows a time plot of the logreturns as well as a histogram of the logreturn
values over the 20 year period. From this view point it now appears quite reasonable
to model the logreturns with a stationary sequence of random variables. The his-
togram suggests, moreover, that the logreturn values are distributed symmetrically
1.2. Stochastic Processes 15

50

45

40

35

30

25

20

15

10

0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010

Figure 7. Weekly stock price GM, 1990-2009

15

0.3

0.2 10

0.1

0
5

−0.1

−0.2

0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4

Figure 8. Logreturns data GM, 1990-2009

around a mean close to zero. The curve overlaid the histogram in Figure 8 is the
probability density of the normal distribution with mean zero and standard devia-
tion fitted to that of the sample. In comparison with the normal distribution, the
data exhibits a higher concentration of mass in the tails and close to the mean value,
which is typical for logreturns data in general. Yet in modeling it is a common as-
sumption to apply the zero mean normal distribution. To simplify even further it is
tempting to assume that the successive logreturn values are independent. This leads
to the following model. Let X1 , X2 , . . . be independent and identically distributed
random variables with the standard normal distribution. Let
Yn = logreturn value of the stock week n = σXn , n≥1
Then Yn ∈ N (0, σ 2 ). It is natural to take for the constant σ the sample standard
deviation of the logreturn data. Now set
Yn = log(Vn /Vn−1 ).
16 1. Introduction

Solve for Vn to get


n
X
Vn = Vn−1 eYn = Vn−2 eYn−1 eYn = · · · = V0 exp{σ Xk },
k=1

which is the resulting stochastic model for the stock returns.


One should keep in mind, however, that the Gaussian assumption employed here
misses out on some of the typical features of real financial data, as we indicated in
Figure 8. As for alternative models there is no general agreement currently, but
rather a number of different approaches being proposed. Some have focus on short
or long time dependency between the data points purportedly better aligned with
the real dynamics of an efficient market. Other models take the view that prices are
formed by the interaction and interplay between a large number of financial agents
which participate on the market and obey a given set of rules.

Example 1.12 (Brownian motion). A small particle such as a pollen particle in


air or a grain moving around in water performs a continuously changing, erratic mo-
tion in all directions. The botanist Robert Brown in 1828 asked for a mathematical
description of such motions, which later became known as Brownian molecular dif-
fusion. The physicist Jean Perrin did extensive studies of molecular motions under
microscope around 1905:
The trajectories are confused and complicated and change so often and
so rapidly that it is impossible to follow them; . . . Similarly, the apparent
mean speed of a grain during a given time varies in the wildest way in
magnitude and direction, and does not tend to a limit as the time taken
for an observation decreases, . . .
Albert Einstein devised a solution of Brown’s problem in 1905 and Norbert Wiener
in 1923 proved the existence of what is today called Brownian motion. The standard
Brownian motion (in one spatial dimension) is a continuous time stochastic process
{B(t), t ≥ 0} with time index I = [0, ∞) and state space E = (−∞, ∞), such that
- B(0) = 0;
- all increments are independent random variables;
- the increments are stationary;
- for any t, B(t) has the normal distribution N (0, t);
- the paths t → B(t) are continuous functions (in a suitable,
probabilistic sense).
To model the motion of a small particle in three-dimensional space, take three
independent Brownian motions B1 , B2 , B3 and let the position of the particle at time
t have coordinates (B1 (t), B2 (t), B3 (t)).
We can get some understanding for Brownian motion by comparing with random
walks. It turns out that if we run the simple random walk {Xn } in Example 1 a large
number of steps and change properly the scale on the y-axis, the resulting random
path is an approximation to Brownian motion. √ Put more formally, the behavior of
the scaled random process B (n) (t) = X[nt] / n, t ≥ 0, gets closer to that of true
1.3. Simulation of stochastic processes 17

Brownian motion the larger n we take. Since


[nt] [nt] m
(n) 1 X p 1 X √ 1 X
B (t) = √ Zk = [nt]/n p Zk ≈ t √ Zk , m = [nt],
n k=1 [nt] k=1 m k=1
this is partly a consequence of the central
Pmlimit theorem, which implies that as m
1
tends to infinity the distribution of √m k=1 Zk converges to the standard normal
distribution, N (0, 1), and hence the distribution of B (n) (t) converges to the normal
distribution N (0, t). Figure 9 shows a simulation of such approximative Brownian
motion, using three independent, scaled random walks with n = 10, 000.

2.5

1.5

0.5

−0.5

−1
1

0.5 2
1
0 0
−1
−0.5 −2
−3
−1 −4

Figure 9. A random walk representing Brownian motion

1.3. Simulation of stochastic processes


Computer simulation of stochastic processes is a useful tool that can be applied
with at least two objectives in mind. First of all, in situations where the random
mechanisms of a system are so complicated that no direct methods of mathematical
analysis are available, computer simulations can give insights and help understand
the system behavior. Secondly, to understand random dynamics it is of great help
to be able to visualize even relatively simple stochastic processes. In this section we
introduce the basic underlying ideas. For a tutorial on stochastic simulation using
Matlab, see [9].
The most fundamental methods for generating random numbers involve mechan-
ical devices, such as carefully manufactured dices, roulette wheels, etc. Alternative
techniques for generating true random numbers have been invented more recently
and are based on the randomness in the decay of a radioactive source or in the at-
mospheric noise from a radio. In principle, a computer equipped with a unit able to
18 1. Introduction

sample atmospheric noise could be programmed to act in a non-predictable manner,


and such methods are used for example in cryptography (see e.g. www.random.org).
Standard methods for computer simulations, however, are based on an entirely dif-
ferent principle which we describe next.
The standard technique to introduce randomness into computers is to use a
pseudo-random number generator. This is an algorithm in the form of a few lines
of program code. The input to the program is an arbitrary number, a seed, such as
given by the current clock time, and the output is the first pseudo-random number.
When the procedure is repeated, each time executing the code with the most recent
output as new input, the result is a sequence of pseudo-random numbers. Hence
pseudo-random numbers are entirely deterministic and predictable, since the same
seed will always produce the same sequence, but what matters is that the underlying
algorithm can be designed so that pseudo-random numbers “look like” and behave as
if they were true random numbers. To make these ideas more precise let us assume
that we want to generate pseudo-random numbers representing typical values of a
random variable X with distribution function F (x) = P (X ≤ x). The procedure
usually consists in two steps, namely
1) generating typical values from the uniform Re[0, 1] distribution;
2) transforming the uniform numbers to the distribution F .
Step 1) amounts to producing a sequence of pseudo-random numbers u1 , u2 , . . . in
[0, 1] that resemble as closely as possible true random numbers distributed uniformly
in the unit interval. In practice this means that the ui ’s, given with an accuracy
corresponding to a certain number of decimal digits, should pass any statistical
test designed to distinguish them from independent outcomes of a uniformly dis-
tributed random variable. Many algorithms are known that produce such uniformly
distributed pseudo-random numbers. These are often based on a recursive relation
of the form vn+1 = avn + b (mod c), which gives integers {vn } in the range [0, c − 1].
It then remains to put un = vn /c. In this procedure, the parameters a, b, c are
selected with great care based on principles of mathematical number theory. In
particular, c has to be a prime number. A typical choice is a = 16 807, b = 0,
c = 231 − 1 = 2 147 483 647, which was used as default in Matlab version 4. Current
versions use more advanced, so called Mersenne Twister algorithms.
Step 2) varies depending on the nature of the distribution F . Suppose X takes
integer values k with probability pk = P (X = k), k ≥ 0. In this case the pseudo-
random sequence {ui } is transformed into a sequence {xi } of typical outcomes from
X as follows:
for each i, if p0 + · · · + pk−1 < ui ≤ p0 + · · · + pk let xi = k.
If X is a continuous random variable the standard technique is based on the following
inversion property:
If U ∈ Re[0, 1] and F (x) is a continuous distribution function then the
random variable X = F −1 (U ) has distribution function F .
The proof of this property is contained in the calculation
FX (x) = P (X ≤ x) = P (F −1 (U ) ≤ x) = P (U ≤ F (x)) = F (x).
1.4. The life-length process 19

To apply the inversion property, start with ui and let xi be defined by the relation
F (xi ) = ui . However, it is not always computationally efficient to use inversion
for simulating continuous random variables, in many cases alternative methods are
used.
Example 1.13. In Matlab a sequence u of n uniform random numbers is produced
by the command u=rand(1, n). The sequence of spin variables {Zk } in Example 1.4
is obtained as z=2.*(u<p)-1 and the path of the random walk as cumsum(z). The
graph in Figure 3 is produced by specifying p and the number of simulation steps n
and then evaluating stairs(0:n-1),[0 cumsum(2.*(rand(1,n-1)<p)-1)]);.
The most well-known example of a continuous inversion is the logarithmic trans-
formation x = − log(1 − u), which gives exponentially distributed random numbers.
Indeed, let U have the uniform distribution on [0, 1] and put X = − ln(1 − U ). For
any x > 0 we have X ≤ x if and only if U ≤ 1 − e−x , and hence
FX (x) = P (X ≤ x) = P (U ≤ 1 − e−x ) = 1 − e−x .
Moreover, let c be a constant and put Y = cX. Then
FY (y) = P (Y ≤ y) = P (X ≤ x/c) = 1 − e−x/c , x > 0,
which is the distribution function of an exponential random variable with inten-
sity 1/c. Hence − ln(1 − rand(1, n))/λ; produces a sample of n observations from
Exp(λ).
Finally, we mention the graphs of Figure 5 that are generated by the commands
t=linspace(0,10*pi);
A=abs(randn(1,1));
phi=rand(1,1).*2*pi;
plot(t./(2*pi),A.*cos(t+phi));

1.4. The life-length process


The life-length process is a particularly simple type of stochastic process with only
two states, which is used to distinguish whether a system is functioning or if it is
out of order.
Definition 1.14. Let T be a positive stochastic variable with distribution function
F (t) = P (T ≤ t), t ≥ 0, such that F (0) = 0. The life-length process corresponding
to the life-length distribution F is

1 if t < T
X(t) =
0 if t ≥ T .

If we interpret T as the life-length of a system then {X(t), t ≥ 0} is a continuous


time stochastic process with state space E = {0, 1} and initial value X(0) = 1,
which signifies that the system works (X(t) = 1) or does not work (X(t) = 0). The
mean value function equals
m(t) = E(X(t)) = 1 · P (T > t) + 0 · P (T ≤ t) = 1 − F (t),
which shows that the life-length process is neither stationary nor weakly stationary.
20 1. Introduction

X(t)

t
t t+h
T

Figure 10. Life-length process

The key concept for life-length processes is the life-length intensity, which is a
function λ(t) that at any given time t gives the typical rate of system failure at that
time. The idea is to look at a system we know works at time t, find the probability
that it does not work at a later time t + h, and then close in on the approximate
failure time t by letting h tend to zero.
To carry out this calculation of failure intensity, let h > 0 be a small number and
consider the interval [t, t + h]. The probability that a system functioning at time t
fails no later than at time t + h, can be expressed using conditional probabilities as
P (X(t + h) = 0|X(t) = 1) = P (T ≤ t + h|T > t).
By definition of conditional probability we have, moreover,
P (t < T ≤ t + h) F (t + h) − F (t)
P (T ≤ t + h|T > t) = = .
P (T > t) 1 − F (t)
Now we make the additional assumption that the life-length variable T is a con-
tinuous random variable with a density function f (t). Since the density function
is the derivative of the distribution function, f (t) = F ′ (t), we have the difference
approximation
F (t + h) − F (t) = f (t)h + o(h), h → 0,
where the remainder term o(h) is such that o(h)/h → 0 as h → 0. Thus,
f (t)h + o(h)
P (system failure in (t, t + h]) = P (X(t + h) = 0|X(t) = 1) = ,
1 − F (t)
which shows that by defining the life-length intensity function as
f (t)
λ(t) = ,
1 − F (t)
we have
P (system failure in (t, t + h]) = λ(t)h + o(h), h → 0,
and hence the desired interpretation of the function λ(t) as an infinitesimal intensity
of system failure. The life-length intensity function λ(t) is also called the failure rate,
the hazard function or the instantaneous error function.
1.4. The life-length process 21

The knowledge of λ(t) completely determines the distribution of the life-length


T , and vice versa. Indeed, the calculation
d f (t)
log(1 − F (t)) = − = −λ(t)
dt 1 − F (t)
shows that the intensity function can be found from the life-length distribution
function by differentiation. Conversely, given the intensity function the distribution
function is obtained by the integration
Z t h it
λ(s) ds = − log(1 − F (s)) = − log(1 − F (t)).
0 0

Combining this with the formula


Z ∞
EX = P (X > x) dx
0
valid for any nonnegative, continuous random variable X, we can find the expected
life-length by using
Z ∞ Z ∞
E(T ) = tf (t) dt = (1 − F (t)) dt.
0 0
For clarity these relations are summarized as follows.

Proposition 1.15. The intensity function λ(t) and the life-length distribution
function F (t) are obtained from each other via the relations
d n Z t o
λ(t) = − log(1 − F (t)) F (t) = 1 − exp − λ(s) ds .
dt 0
Moreover, Z ∞ n Z t o
E(T ) = exp − λ(s) ds dt.
0 0

Definition 1.16. If T is a life-length variable with distribution function F then


the remaining life-length given that the current age is s, is a random variable Ts
defined by its assigned distribution
P (Ts ≤ t) = P (T ≤ s + t|T > s), t ≥ 0.

Since
P (T > s + t)
P (T > s + t|T > s) =
P (T > s)
n Z s+t o nZ s o
= exp − λ(u) du exp λ(u) du ,
0 0
we have immediately

Proposition 1.17. For any s ≥ 0


n Z s+t o
P (Ts ≤ t) = 1 − exp − λ(u) du t ≥ 0.
s
22 1. Introduction

Example 1.18 (Exponential Life). Suppose T ∈ Exp(λ). Then F (t) = 1 − e−λt ,


f (t) = λe−λt , and hence the life-length intensity
λ(t) = f (t)/(1 − F (t)) ≡ λ,
is independent of t. Furthermore, the remaining life-length Ts at age s has distribu-
tion n Z s+t o
P (Ts ≤ t) = 1 − exp − λ du = 1 − e−λt ,
s
and is hence independent of the age. These relations manifest the “lack of memory
property” of the exponential distribution.
Example 1.19 (Life Insurance). In life insurance, the random variable T is the
life-length of a newborn and Tn is the remaining life of a person n years of age. By
Proposition 1.17, the probability that a person alive on the nth birthday is going to
die before the (n + 1)th birthday is given by
n Z n+1 o
(1.2) qn = P (Tn ≤ 1) = 1 − exp − λ(u) du .
n
Statistical bureaus with the task of collecting life-length data estimate probabilities
such as {qn } and publish the results as mortality tables, typically separate for male
and female populations. Figure 11 shows a plot of estimated qn ’s, the one-year
death probality at age n, with the upper, blue curve for men and the lower, green,
for females, based on data for Germany. Also slightly visible in this graph is the
effect of infant mortality.
0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
0 10 20 30 40 50 60 70 80 90 100

Figure 11. Life mortality data, death probability in the next year (y-axis) as
function of current age (x-axis), upper curve for males, lower curve females

1.5. Reliability of systems


The life-length model is useful and efficient if we deal with a single entity subject
to random failure, such as a circuit board, a lightbulb, a tire going flat, or, human
disease or death. Other realistic systems consist of many different components, each
subject to random failure. Of course, the functionality of such systems depend both
on status of the nodes and system design, how the components are inter-connected.
The simplest example to begin with is the system of two components in series.
Checking the system at a snap-shot point in time, we assume that the components
1.5. Reliability of systems 23

are independent, each working with probability p and having failed with probability
1 − p. Then, clearly, the function probability of the system will be
Rser (p) = P (2-series system works) = p2 .

p p

In contrast, connecting the component nodes in parallel,


Rpar (p) = P (2-parallel system works) = 1 − P (system does not work)
= 1 − P (node 1 not working) P (node 2 not working) = 1 − (1 − p)2

We can easily extend these examples to systems of n components functioning


with probabilities p1 , . . . , pn , not necessarily equal. Furthermore, we may track the
life-length of the system rather than just making one snap-shot observation. For this,
we consider a collection of n independent components with life-lengths T1 , . . . , Tn
characterized by life-length intensities λi (t), 1 ≤ i ≤ n. For the two cases where the
components are all connected either in series or in parallel we are interested in the
system time, that is the life-length of the system. These are given respectively by
Tser = min(T1 , . . . , Tn ) and Tpar = max(T1 , . . . , Tn ).
The corresponding system survival functions (or survivor functions) are
Rser (t) = P (series system functions at time t) = P (Tser > t);
Rpar (t) = P (parallel system functions at time t) = P (Tpar > t).

For the series system, by independence


Rser (t) = P (T1 > t, . . . , Tn > t) = R1 (t) · · · · · Rn (t),
Rt
where Ri (t) = exp{− 0 λi (s) ds}, 1 ≤ i ≤ n, in view of Proposition 1.15. If we
introduce the function
X n
(1.3) λser (t) = λi (t), t > 0,
i=1

it follows that
n Z t o
(1.4) Rser (t) = exp − λser (s) ds .
0

The conclusion is that a system of n independent components in series behaves


exactly as a single system does, if its failure rate is taken to be the accumulated
sum of component failure rates.
We compare this with the parallel system. The calculation
Rpar (t) = 1 − P (T1 ≤ t, . . . , Tn ≤ t) = 1 − (1 − R1 (t)) · · · · · (1 − Rn (t)),
shows that in this case the distribution of the system time is given in a less explicit
form.
24 1. Introduction

Example 1.20 (Exponential systems). a) Let Ti , 1 ≤ i ≤ n, denote independent,


exponentially distributed random variables with expected values E(Ti ) = µi , and
assume these are the life times of n components connected in series. Using 1.4,
n Xn
1o
P (Tser ≤ t) = 1 − Rser (t) = 1 − exp − t ,
i=1
µi
which shows that the system time is itself exponentially distributed with expected
system time
1
E(Tser ) = 1 .
µ1
+ · · · + µ1n
b) Restricting to the special case µ1 = · · · = µn = µ, and assuming instead that the
n components are set up in parallel,
Rpar (t) = 1 − (1 − e−t/µ )n
and
1 1
E(Tpar ) = µ(1 + + · · · + ).
2 n
Example 1.21 (Competing risks model). The summation formula (1.3) means
that we have n simultaneously active risk factors all with the potential to terminate
life time. For example, suppose the life-length of a mechanical device is subject to
the three major risks of start-up failure during the first phase of operation, random
errors, and failure due to aging. This suggests a function λ(t), which is such that
its graph resembles the “profile of a bathtub”, namely a function that starts from a
relatively large value and decreases over an period, then stays more or less constant
until aging mechanisms set in that corresponds to convex increase of the function.
As an approximation one may take
λ1 (t) = a(1 − t/t1 )+ λ2 (t) = b λ3 (t) = c(t − t2 )+
and model the life-length as a series system of three components, with intensity
λ(t) = λ1 (t) + λ2 (t) + λ3 (t)
depending on the five parameters a, b, c, t1 , t2 .
Example 1.22 (Human life time). The life-length intensity for the life time of
human beings fits with the idea of the bathtub profile in Example 1.21, signifying the
three major risks infant mortality, random risk exposure, and aging. An alternative,
and probably better, method is to estimate λ(t) directly using mortality data as in
the previous Example 1.19. By (1.2), we have qn ≈ 1 − e−λn , where λn is an estimate
of λ(t) for t around age n. Hence, λn ≈ − ln(1 − qn ), and the same data as before
gives an estimated life-length intensity (for females) as depicted in Figure 12.
Example 1.23 (k out of n system). Systems in series and parallel are not the
only ones of interest. A system with n indepenent components is said to be a k out
of n system, if the system works if and only if at least k components work. Formally,
suppose all n components have distribution function F (t). Then
Xn (t) = the number of components which work at time t ∈ Bin(n, 1 − F (t)),
1.6. The Poisson process 25

0.4

0.3

0.2

0.1

-0.1
0 10 20 30 40 50 60 70 80 90 100

Figure 12. Estimated human life-length intensity, female data

and the distribution of the system life-length Tsys is determined by


P (Tsys > t) = P (Xn (t) ≥ k).
As an example let us consider a 2 out of 3 system with i.i.d. components that
have exponential life-lengths T1 , T2 , T3 with the same intensity λ. To obtain the
system survival function, note that the distribution function of a component is
F (t) = 1 − e−λt and that survival of the system up to time t requires survival of two
or all three components at time t. Hence the system survival function is given by
P (Tsys > t) = P (X3 (t) ≥ 2) = 3(e−λt )2 (1 − e−λt ) + (e−λt )3 = 3e−2λt − 2e−3λt .

1.6. The Poisson process


The Poisson process is a fundamental stochastic process for modeling events that
occur randomly and independently in time. There are several equivalent ways of
introducing the Poisson process and we have chosen here the constructive approach
starting from a family of exponential random variables.

Definition 1.24. Let {Uk , 1 ≤ k < ∞} be a sequence of independent and iden-


tically distributed random variables all having the exponential
P distribution with
intensity λ > 0, P (Ui ≤ t) = 1 − e−λt , t ≥ 0. Let Tk = ki=1 Ui , k ≥ 1, denote the
corresponding sequence of accumulated sums. The stochastic process {N (t), t ≥ 0}
in continuous time and with nonnegative, integer values that counts the number of
time points {Tk , k ≥ 1} that occurs before time t,
X∞
N (t) = 1{Tk ≤t} , t ≥ 0,
k=1
is called the Poisson process with intensity λ.

To help visualize the Poisson process it is useful to consider the random times
Tk to be the time epochs of the occurrences of abstract “Poisson events” and the
random variables Uk to be the corresponding inter-occurrence times of these events.
Then
N (t) = the number of Poisson events in the interval [0, t].
In other words, the Poisson process {N (t)} is a jump process in continuous time
which, starting from N (0) = 0, climbs from each integer value to the next larger value
26 1. Introduction

in such a way that the waiting times on each level are independent and exponential
with mean 1/λ.
The definition given above immediately suggests a method to simulate trajecto-
ries of the Poisson process. Building on Example 1.13, the Matlab commands
interarr = − log(rand(1,n))./lambda;
stairs(cumsum(interarr), 0 : n − 1);

produce an output trajectory such as those in Figure 13. Here, given that lambda
and n are assigned parameters, rand(1, n) is a vector of n uniform random num-
bers on the interval [0, 1] and thus interarr becomes a vector with values that
represent typical outcomes of the exponential random variables U1 , . . . , Un (c.f. Ex-
ample 1.13). Consequently the vector cumsum(interarr) of the cumulative sums
of interarr give typical values of the times T1 , . . . , Tn of occurrence of the Pois-
son events. Figure 13 illustrates the variation which arise naturally in independent
realizations of Poisson paths.

30

25

20

15

10

0
0 5 10 15 20 25

Figure 13. Simulated trajectories of the Poisson process, λ = 2

1.7. Properties of the Poisson process


There are n or more Poisson events in the interval [0, t] if and only if the nth Poisson
event occurs no later than time t, that is if and only if Tn ≤ t. Hence
(1.5) {N (t) ≥ n} = {Tn ≤ t}.
Equivalently, since the complementary sets are also identical, {N (t) < n} = {Tn >
t}. In particular, for n = 1, {N (t) = 0} = {U1 > t}. This shows
P (N (t) = 0) = P (U1 > t) = e−λt .
To use the full strength of (1.5) we use the following fact from probability theory.
The sum of n independent exponential random variables with intensity λ has the
gamma-distribution Γ(n, λ). In the present context this means that the time Tn of
the nth occurrence of a Poisson event has the density function
λn tn−1 e−λt
fTn (t) = , t ≥ 0.
(n − 1)!
It follows from (1.5) that for any n ≥ 1 we have
P (N (t) = n) = P (N (t) ≥ n) − P (N (t) ≥ n + 1) = P (Tn ≤ t) − P (Tn+1 ≤ t),
1.7. Properties of the Poisson process 27

and therefore
d λn tn−1 e−λt λn+1 tn e−λt
P (N (t) = n) = fTn (t) − fTn+1 (t) = − .
dt (n − 1)! n!
On the other hand, the expression on the right side above can be recognized as the
derivative of a well-known function:
 n

d −λt (λt) λn tn−1 λn+1 tn
e = e−λt − e−λt .
dt n! (n − 1)! n!
Therefore,
(λt)n
P (N (t) = n) = e−λt +C
n!
where C is a constant of integration. But if we also take into account the initial
condition P (N (0) = n) = 0, it follows that C = 0 and hence for any t the random
variable N (t) is Poisson distributed with expected value E(N (t)) = λt.
With this we have demonstrated one part of the following proposition, which
characterizes the Poisson process as a stochastic process with independent, station-
ary, Poisson-distributed increments.

Proposition 1.25. The Poisson process {N (t), t ≥ 0}, N (0) = 0, is an integer-


valued stochastic process such that
a) the successive increments N (t1 ), N (t2 ) − N (t1 ), . . . , N (tk ) − N (tk−1 ) are inde-
pendent random variables for any choice of 0 ≤ t1 ≤ · · · ≤ tk and k ≥ 1;
and
b) for some λ > 0,
N (t) − N (s) ∈ Po(λ(t − s)), for any s ≤ t.

In the framework of this result observe the natural parameter interpretation


λ = E(N (1)) = expected number of Poisson events per unit of time.
Superposition property. We can now demonstrate the important superposi-
tion property of the Poisson process, namely that if {N1 (t)} and {N2 (t)} are two
independent Poisson processes with the respective intensities λ1 and λ2 , then the
sum M (t) = N1 (t)+N2 (t) defines a new Poisson process, which has intensity λ1 +λ2 .
To see this, we first observe that M (0) = 0 and {M (t)} attains integer values only.
Hence it suffices to verify a) and b) of Proposition 1.25. The increments of {M (t)}
are independent, since we have for any k that M (tk )−M (tk−1 ) = N1 (tk )−N1 (tk−1 )+
N2 (tk ) − N2 (tk−1 ) and the increments on the right hand side are independent, by
assumption, so that a) holds. Moreover, for any s ≤ t,
M (t) − M (s) = N1 (t) − N1 (s) + N2 (t) − N1 (s) ∈ Po(λ1 (t − s) + λ2 (t − s)),
by the additive property of independent Poisson distributed random variables, which
shows b). Hence {M (t)} is a Poisson process with intensity λ1 + λ2 .
By repeating this argument it follows that if N1 , . . . , Nn are independent Poisson
processes with intensities λ1 , . . . , λn , then Mn = N1 + · · · + Nn is also a Poisson
process. The intensity for the superposition process Mn is the sum λ1 +· · ·+λn . This
additivity property is similar to (1.3) discussed previously for lifelength intensities
28 1. Introduction

of series systems. Indeed, if U1k is the time of the first jump of Nk , k = 1, . . . , n, then
the first jump of Mn occurs at time min(U11 , . . . , U1n ). Thus, the inter-occurrence
times of the component processes relate to the superposition process in a similar
way as component life-lengths relate to the series system life-length.
Thinning property. The Poisson process is also preserved under an operation
which can be viewed as the opposite of superposition, namely thinning. We start
with a Poisson process N with intensity λ, which jumps at the successive times Tk ,
k ≥ 1, as in Definition 1.24. In addition, let {Jn , n ≥ 1} denote a sequence of i.i.d.
random variables with the possible values 0 or 1 and such that P(Jn = 0) = p and
P(Jn = 1) = 1 − p where p, 0 ≤ p ≤ 1, is a parameter. Similarly as in Definition
1.24, let
X∞
M (t) = 1{Tk ≤t} Jk , t ≥ 0.
k=1
Thinking of p as a thinning parameter, so that each time a Poisson event in N occurs
it is independently removed with probability p and kept with probability 1 − p, it is
clear that the random process M , which counts only the remaining Poisson events,
is a thinned version of N . It is a nice and useful feature of Poisson processes that the
thinned process M is also a Poisson process, the intensity of which is now λ(1 − p).
Several proofs of this property can be found in Gut [11].
As an application of this model assume that the arrivals of messages to an e-
mail address inbox is given by a Poisson process, and the sequence of messages after
having installed an e-mail spam filter is a thinned Poisson process that corresponds
to the original one.
To complete the presentation of various aspects of the standard Poisson process
it remains to mention the dynamical approach, which emphasizes the alternative
interpretation of λ as an infinitesimal jump intensity.

Proposition 1.26. The Poisson process {N (t), t ≥ 0}, N (0) = 0, is an integer-


valued stochastic process such that
a) the successive increments N (t1 ), N (t2 ) − N (t1 ), . . . , N (tn ) − N (tn−1 ) are inde-
pendent;
b) as h → 0, P (N (t + h) − N (t) = 1) = λh + o(h); and
c) as h → 0, P (N (t + h) − N (t) ≥ 2) = o(h).

If we use the notation


pn (t) = P (N (t) = n), n = 0, 1, . . . , t ≥ 0,
it follows from Proposition 1.26, that for any n ≥ 1, t ≥ 0 and h > 0,
pn (t + h) = pn (t)(1 − λh − o(h)) + pn−1 (t)(λh + o(h)) + o(h)
and thus
pn (t + h) − pn (t) o(h)
= λ(pn−1 (t) − pn (t)) + .
h h
Taking h → 0 this leads to the series of coupled differential equation
d
pn (t) = λ(pn−1 (t) − pn (t)), n ≥ 1.
dt
1.7. Properties of the Poisson process 29

Moreover, the case n = 0 yields


d
p0 (t) = −λp0 (t).
dt
Solving these equations we recover again the Poisson distribution
n
−λt (λt)
pn (t) = e , n ≥ 0.
n!
It turns out that either of Proposition 1.25 and Proposition 1.26 is equivalent to
Definition 1.24. For complete proofs and an exposition and comparison of the various
alternatives, see e.g. Gut [11], Ch VII.
Example 1.27 (Telephone network). Returning to the introductory Example 1.2
and rephrasing, it is the Poisson process which is the classical model of telephone
traffic. As early as 1920, the engineer A.K. Erlang of the Danish telephone company
in Copenhagen had developed an exceptionally useful queueing theory for telephone
switches. As the global, wired telephone network expands during the following
decades these new and accurate methods are used to decide the size of the systems.
Simplifying the process, the first step is to set an acceptable risk, say 1%, that a
person placing a new call will not get out on the line. Next, the potential number
of users in a particular geograhical area gives an estimate of the arrival intensity λ.
With this information it is then possible to derive the required number of connections,
outgoing lines, to be installed in each switch.

Figure 14. Incoming calls to health care center during one week (axis labels may
be ignored and merely indicate that data is part of a longer series of measurements)
30 1. Introduction

Example 1.28 (Phone call arrivals). Suppose that phone call inquiries to a local
health care center are placed at random times throughout the opening hours of a
typical work day. Is it realistic to assume that the calls are well described by the
Poisson process? Perhaps not, for several reasons. Is the intensity really constant
throughout the day? Is it the same for each day of the week? What about the
required independence between the Poisson events given that the calls are made
from essentially the same geographical area? To get some insight into these matters
we look again at real data. Figure 14 shows the counting process which results from
plotting the times of all incoming calls placed to a mid-sized health care center in a
town in mid Sweden during one week of 2011. The plot confirms that some aspects
of the data deviate from Poisson behavior. The intensity appears to be larger in the
morning of each day and then gradually decrease. Also, the number of incoming
calls is larger during Monday and Friday in comparison to the other days of the
week. This suggests the alternative of a time inhomogenous Poisson process where
we replace the constant intensity λ with a nonnegative function λ(t), t ≥ 0. Then
the corresponding counting process {N (t)} willR t be such that for each t, N (t) has
the Poisson distribution with expected value 0 λ(s) ds. In this case the process no
longer has independent or stationary increments.
Example 1.29 (Spatial Poisson process). A spatial point process is a collection
of random points in a spatial region S of the plane or in all of R2 . We write N (A)
for the number of points in a set A ⊂ S, and allow arbitrary (but well-defined) sets
A. Denote by |A| the volume of the set. We say that {N (A), A ⊂ S} is a spatial
Poisson point process with intensity λ > 0 if
• for each set A, N (A) has the Poisson distribution with expected value λ|A|;
• for any pair of disjoint sets A and B in S, the random variables N (A) and
N (B) are independent.
Figure 15 shows a simulation of a spatial Poisson process in the unit square of the
plane. Typical areas of application include describing the positions of bacteria in a
cell culture, or the locations of subscribers to a mobile phone system.

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

Figure 15. Spatial Poisson process, λ = 200


Solved exercises 31

-10

-15

-20

-25

-30

-35

-40
1840 1860 1880 1900 1920 1940 1960 1980 2000 2020

Figure 16. Annual minimum temperatures in Uppsala, 1840-2015

Solved exercises
1. Figure 16 shows a plot of the annual series of temperatures studied in Example
1.1. Assuming the Gaussian model for the temperature variable X discussed in
this example, find the probability that the temperature of the coldest day next
year
a) falls below −25◦ C;
b) yields a new all-time record low (since 1840).
Solution. The assumption is that X ∈ N(µ, σ 2 ) with parameters µ = −22.28 and
σ = 4.497. Let Y be the centered and scaled random variable Y = (X − µ)/σ.
Clearly,
E(X) − µ V (X − µ) V (X)
E(Y ) = = 0, V (Y ) = 2
= = 1,
σ σ σ2
and since the normal distribution is preserved under addition and multiplikation
with constants, we have Y ∈ N(0, 1).
a)
P (X ≤ −25) = P (X − µ ≤ −25 − µ) = P (Y ≤ (−25 − µ)/σ)
= P (Y ≤ −2.72/4.497) = P (Y ≤ −0.6165) = Φ(−0.6165),
where Φ is the distribution function of the standard normal. By symmetry,
Φ(−x) = 1 − Φ(x). Hence P (X ≤ −25) ≈ 1 − 0.73 = 0.27.
b) P (X < −39.5) = P (X ≤ −39.5) = · · · = 1 − Φ(3.841) ≈ 0.00006
2. For the teletraffic model in Example 1.2, find the covariance and the correlation
coefficient between the variables T1 and Tn , n ≥ 1. Same questions for the pair
Tn−1 and Tn , n ≥ 2. Interpretations?
Solution. We have
C(T1 , Tn ) = C(U1 , U1 + · · · + Un ) = C(U1 , U1 ) + · · · + C(U1 , Un ).
For n ≥ 2, since U1 and Un are independent, C(U1 , Un ) = E(U1 Un )−E(U1 )E(Un ) =
0. Also, C(U1 , U1 ) = V (U1 ) = 1/λ. Hence C(T1 , Tn ) = 1/λ and
C(T1 , Tn ) 1/λ 1
ρ(T1 , Tn ) = =p p =√ .
D(T1 )D(Tn ) 1/λ · n/λ n
32 1. Introduction

Similarly,
n−1
C(Tn−1 , Tn ) = V (Tn−1 ) =
λ
and r
(n − 1)/λ 1
ρ(Tn−1 , Tn ) = p p = 1− .
(n − 1)/λ · n/λ n
Hence, for large n, the correlation between T1 and Tn is close to zero and the
correlation between Tn−1 and Tn is close to one. This is the probabilistic way
of saying that, if we know when the first call comes in we still know very little
about the timing of caller one hundred, say, but if we know when the 99th call
is placed then we are in a good position to predict the time of the next call after
that. Of course, covariance and correlation are symmetric in the two variables,
so these statements can also be turned around.
3. With reference to Example 1.11, let Un = Vn /V0 denote the stock price relative
to the initial price over n weeks.
a) Express the probability P (Un ≤ u) in terms of the standard normal disti-
bution function Φ.
b) Based on the GM data suggest an estimate of the parameter σ.
c) Using the estimate of σ from b), what is the probability that the stock price
more than doubles in one year? What is the probability that the stock price
increases tenfold over a period of 20 years?
P
Solution. Put Sn = nk=1 Xk . By the summation property of the normal √ dis-
tribution, the distribution of Sn is N (0, n). Thus, the distribution of Sn / n is
N (0, 1) with distribution function Φ(x). We have
Un = Vn /V0 = eσSn
so
P (Un ≤ u) = P (eσSn ≤ u) = P (Sn ≤ ln(u)/σ).
Hence S
n ln(u)   ln(u) 
P (Un ≤ u) = P √ ≤ √ =Φ √ .
n σ n σ n
For b), a crude visual inspection of the histogram in Figure 8 indicates that 95%
of all log return values fall in the interval ±2σ, if we take σ ≈ 0.05, which is
thus a reasonable point estimate of σ compliant with the Gaussian assumption.
The first probability in c) is
 ln 2 
P (U50 > 2) = 1 − Φ √ ≈ 0.025
0.05 50
and the second
 ln 10 
P (U1000 > 10) = 1 − Φ √ ≈ 0.073.
0.05 1000
4. Carry out a simulation of typical values of a lifelength variable T , with lifelength
intensity
1
λ(t) = √ , t ≥ 0.
2 t
Solution. By Proposition 1.15,
Solved exercises 33

n Z t o √
1
F (t) = P (T ≤ t) = 1 − exp − √ du = 1 − e− t .
0 2 u
By the inversion principle for simulation of continuous random variables, we
seek numbers ti such that F (ti ) = ui , where (ui )1≤i≤n is a given sequence of
uniformly distributed random numbers, that is
ti = (− ln(1 − ui ))2 , i = 1, . . . , n.
Comparing this result to the situation of the generalized random walk in Figure
4, it follows that T has the same distribution as the square of an exponential
random variable with mean one.
5. Two independent components with constant failure intensitites are connected in
parallel. The expected life-lengths of the two components are c and 1 − c, for
some parameter c with 0 ≤ c ≤ 1. Find the expected life-length of the system as
a function of c, and determine for which values of c the expected life is minimal
and maximal, respectively.
Solution. We have two independent, exponentially distributed life-length vari-
ables T1 and T2 with expected values E(T1 ) = c and E(T2 ) = 1 − c, and wish to
compute E(Tpar ) where Tpar = max(T1 , T2 ). One way to do this is to start with
Rpar (t) = P (Tpar > t) = 1 − P (Tpar ≤ t) = 1 − P (T1 ≤ t)P (T2 ≤ t).
By setting λ1 = 1/c and λ2 = 1/(1 − c) this becomes
Rpar (t) = 1 − (1 − e−λ1 t )(1 − e−λ2 t ) = e−λ1 t + e−λ2 t − e−(λ1 +λ2 )t ,
and so
Z ∞
1 1 1
E(Tpar ) = Rpar (t) dt = + − = 1 − c(1 − c),
0 λ1 λ2 λ1 + λ2
which is maximal and equal to 1 for the limiting cases c = 0 and c = 1 and
attains it minimal value 0.75 for c = 0.5.
6. A system consists of three independent components in parallel all with the same
failure intensity λ(t) = 2t (per hour). Find the system failure intensity λpar (t).
Plot the two functions together in the same graph and interpret the result.
Solution. The life-length distribution function for each component is
n Z t o 2
F (t) = 1 − exp − 2s ds = 1 − e−t , t ≥ 0,
0

and the corresponding density function


2
f (t) = F ′ (t) = 2t e−t .
It follows that the distribution function of Tpar = max{T1 , T2 , T3 } is given by
2
Fpar (t) = P (Tpar ≤ t) = P (T1 ≤ t, T2 ≤ t, T3 ≤ t) = F (t)3 = (1 − e−t )3
and the density function by
′ 2 2
fpar (t) = Fpar (t) = 3F (t)2 f (t) = 3(1 − e−t )2 2t e−t .
34 1. Introduction

The failure rate of the system is therefore


2 2
fpar (t) 6te−t (1 − e−t )2
λpar (t) = = .
1 − Fpar (t) 1 − (1 − e−t2 )3
Figure 17 shows plots of λ(t) and λpar (t) in the same graph. The resulting
curves indicate the improvement in reliability obtained by using three redundant
components instead of a single one. After a long time the parallel system will
operate under more or less the same risk of failure as a single unit, since then,
most likely, only one of the three units remain.

5
Lifelength intensity

0
0 0.5 1 1.5 2 2.5 3
Time

Figure 17. Lifelength intensities for single component (blue) and system (red)

7. Operational disturbances in a power plant occur according to a Poisson process


with intensity 6 disturbances each year. Find the probability of
a) no disturbance in the first quarter of a year;
b) one disturbance in the first half of a year;
c) no disturbance in the first quarter of a year, given one disturbance the first
half of the same year.
Solution. By introducing
X(t) = number of disturbances in the time interval [0, t], t ≥ 0,
we obtain a Poisson process {X(t)} with intensity 6 events per year, that is
X(t) ∈ Po(6t). In particular, X(1/4) ∈ Po(3/2), X(1/2) ∈ Po(3) and X(1/2) −
X(1/4) ∈ Po(3/2). Since the increments of the process are independent it follows
that X(1/4) and X(1/2) − X(1/4) are independent random varibles. We obtain
a) P (X(1/4) = 0) = e−3/2 ≈ 0.223
b) P (X(1/2) = 1) = 3e−3 ≈ 0.149
c)
P (X(1/4) = 0, X(1/2) = 1)
P (X(1/4) = 1|X(1/2) = 1) =
P (X(1/2) = 1)
P (X(1/4) = 0, X(1/2) − X(1/4) = 1)
=
P (X(1/2) = 1)
P (X(1/4) = 0) P (X(1/2) − X(1/4) = 1) e−3/2 · 32 e−3/2 1
= = −3
=
P (X(1/2) = 1) 3e 2
8. Suppose incoming calls to a telephone occur at time points given by a Poisson
process with intensity three calls per hour. Each call lasts exactly ten minutes.
Solved exercises 35

a) At an arbitrary time, what is the probability that the phone is busy?


b) Incoming calls while the phone is busy are considered lost. What is the
average number of lost calls per hour?
Solution. Let
N (t) = number of incoming calls during the time interval (0, t].
By assumption this is a Poisson process with intensity λ = 3 calls per hour.
Changing the time scale to minutes it is a Poisson process with intensity 1/20
calls per minute. For problem a) we are asked to consider an “arbitrary time
point”. Any time greater than ten minutes, t > 10, will do, since this avoids
any influence of the fact that we start counting the phone calls at time t = 0
and the call duration is ten minutes. Now we observe that for such times t the
two events
{phone is busy at t} and {at least one call arrives during (t − 10, t]}
coincide. Thus,
P (phone is busy at t) = 1 − P (no call during (t − 10, t])
= 1 − P (N (t) − N (t − 10) = 0).
Since the increments of the Poisson process are stationary the distribution of
N (t) − N (t − 10) is the same as the distribution of N (10), which is the Poisson
distribution with mean λ/6 = 1/2. Thus, for any t > 10
P (phone is busy at t) = 1 − P (N (10) = 0) = 1 − e−1/2 = 0.3935.
For b), form a thinned stream of incoming calls by independently removing each
Poisson event with probability e−1/2 and keeping it with probability 1 − e−1/2 .
The resulting thinned process
M (t) = number of lost calls up to time t, t ≥ 0,
is a new Poisson process with intensity 3(1 − e−1/2 ) calls per hour. In particular,
the expected number of lost calls per hour is 3(1 − e−1/2 ) ≈ 1.18.
Chapter 2

Markov Chain Models,


discrete time

We give a short introduction to Markov chains in discrete time and illustrate the
main concepts and ideas with standard examples. Some of the proofs are indicated
but the emphasis is put on introducing Markov models in a descriptive manner.
To demonstrate the relevance of Markov models in modern applications we discuss
two specific examples in more detail. One example is the mechanism behind the
web search algorithm Google Page Rank. The second example is a Markov chain
description of the rating of corporations or countries based on their credit standing as
done by credit rating agencies. For complete introductions to the subject and more
advanced material the reader is referred to such sources as Brémaud [3], Grimmett
and Stirzaker [10], Resnick [18] or Wolff [21].

2.1. The Markov property


A Markov chain is a mathematical model for quantities which change values ran-
domly by making jumps between a set of possible states. In the simplest case we
have a finite number of states and the chain makes one jump every time point. Write
X0 for the initial state of the chain, then X1 for the state after the first jump, X2
for the value after two jumps, and so on. In this manner we get a discrete time
stochastic process Xn , n ≥ 0, taking values in a state space E, Xn ∈ E, which could
be E = {red, blue, green}, E = {1, 2, . . . , 100}, or some other set. The characteris-
tic feature of a Markov chain is that the jump probabilities at any given time only
depends on the current state of the chain and not on previous values that the chain
have passed through.
A discrete time Markov chain can be seen as a stochastic version of a determin-
istic recursion of the form xn = g(xn−1 ), n ≥ 1, where g(x) is a given function. It is
clear that the knowledge of any previous values x0 , . . . , xn−2 , in addition to knowing
the value xn−1 , would not give further information about xn . Similarly, relation
(1.1) for the simple random walk shows that the knowledge of any previous steps
{Xk , 0 ≤ k ≤ n − 2} of the walk in addition to knowing Xn−1 has no influence on

37
38 2. Markov Chain Models, discrete time

the outcome of the nth step. This type of dependence is the content of the Markov
property.

Definition 2.1. A discrete time stochastic process {Xn } with discrete state space
E is called a Markov chain if it satisfies the Markov property:
P (Xn = xn |X0 = x0 , . . . , Xn−1 = xn−1 ) = P (Xn = xn |Xn−1 = xn−1 )
for all x0 , . . . , xn ∈ E and n ≥ 1. It is called a time-homogeneous Markov chain if
for each x, y ∈ E
pxy = P (Xn = y|Xn−1 = x) does not depend on n.
In the time-homogeneous case the probabilities {pxy , x, y ∈ E} are called the tran-
sition probabilities of the Markov chain.

In the following we restrict the formal presentation to the time-homogeneous


case, the most important class of Markov chains. We also follow practice of most
introductory presentations of Markov chain theory and focus on the particular cases
E = {0, 1, . . . , r} and E = {0, 1, . . . }. It is quite convenient to collect the transition
probabilities in matrix form and define the transition probability matrix as
 
p00 p01 . . .
P =  p10 p11 . . . 
.. .. . .
. . .

Example 2.2 (Interpretation of transition matrix). Suppose that the dynam-


ics of a Markov chain X0 , X1 , . . . on the state space {0, 1, 2} is defined by the tran-
sition probability matrix
  0 1 2
0 0 1
0 0 0 1
P =  1/2 1/4 1/4  , sometimes written
1 1/2 1/4 1/4
0 2/3 1/3
2 0 2/3 1/3
The successive values of the chain are updated each discrete time point as follows.
From state 0 the chain jumps with certainty to state 2. From state 1 there are the
three possibilities of a jump to state 0, of remaining in state 1, and of a jump to state
2 with probabilities 1/2, 1/4 and 1/4, respectively. From state 2, the chain jumps to
state 1 with probability 2/3 and remains in state 2 with probability 1/3. The initial
distribution is arbitrary. For example, p(0) = (0 1 0) means that X0 = 1, whereas
p(0) = (1/3 1/3 1/3) corresponds to selecting one of the three states uniformly at
time n = 0.

With every Markov chain can be associated a state transition diagram which is
a graph with one node for each state of the chain and directed edges in the form of
arrows between nodes representing all one-step transitions of the chain that can occur
with positive probability. By labeling each arrow with the corresponding transition
probability the resulting transition diagram contains all relevant information about
the Markov chain.
2.1. The Markov property 39

Example 2.3 (Transition graph). The Markov chain transition matrix


 
0 0.1 0.9
P=  0 0.4 0.6 
0.3 0.5 0.2
corresponds to the state transition diagram

0
0.9
0.1
0.3
0.6
0.4 1 0.5 2 0.2

To see how the matrix and vector formalism of Markov chains work we introduce
the additional notations:
(n) (n) (n)
Absolute (state) probabilities: pi = P (Xn = i), p(n) = (p0 p1 . . . );
(m) (1)
m-step transition probabilities: pij = P (Xm = j|X0 = i), in particular pij =
pij ;
 (m) (m) 
p00 p01 . . .
 (m) (m) 
m-step transition probability matrix: P(m) =  p10 p11 . . .  , in particu-
.. .. ...
. .
lar P(1) = P
If the state space is finite, say E = {0, . . . , r}, the vectors p(n) are row vectors of
length r + 1 and the matrices P(m) are regular matrices of size r + 1 rows and r + 1
columns. In the general case E = {0, 1 . . . } we still think of the absolute probabilities
forming indefinite row vectors and the collection of transition probabilities organized
as in infinitely large matrices. It is clear from the definitions that for all such matrices
P(m) the elements on any fixed row must sum to one:
X (m)
pij = 1, i = 0, 1, . . .
j

Theorem 2.4. [Chapman-Kolmogorov’s relation] The transition probabilities of a


Markov chain satisfy the relations
(i) P(m) = Pm
(ii) P(n+m) = P(n) P(m)
(iii) p(m) = p(0) Pm

Proof. It is clear that (ii) follows from (i). Moreover, (iii) follows from (i) by sum-
ming over the initial distribution p(0) . We demonstrate (i) using proof by induction
over m. The case m = 1 is trivial. Suppose (i) is true for all index values strictly
less than m. Now, to extend (i) to index m we only have to note that
(m)
X (m−1)
pij = pik pkj , i, j ∈ E
k
40 2. Markov Chain Models, discrete time

saying that a move from i to j in exactly m steps is the same as a jump from i to
some state k in the first step, and then a second sequence of jumps from k to j in
exactly m − 1 steps. The above relations written in matrix form correspond to the
operation of matrix multiplication:
P(m) = PP(m−1) ,
hence, by the induction hypothesis, P(m) = PPm−1 = Pm , proving relation (i). 
Example 2.5. With P defined in Example 2.3 we have, for example,
 
0.1163 0.3861 0.4976
P5 ≈  0.1285 0.3969 0.4745 
0.1516 0.4169 0.4315
By Theorem 2.4, the elements in this matrix give the probability distributions for
the state X5 of the Markov chain after 5 jumps, P (X5 = 0|X0 = 0) ≈ 0.1163, etc.
Furthermore, if the initial distribution of X0 is known, let’s say p(0) = (0.3 0.6 0.1),
then the (absolute) distribution of the chain at time 5 can be read off in the vector
p(5) = p(0) P5 ≈ (0.1272 0.3957 0.4772)
Example 2.6 (The Ehrenfest model (1907)). Consider a total of r molecules of
a gas distributed in two containers A and B. At time n = 0 there are x molecules
in container A and r − x in container B. Each time point n = 1, 2, . . . one of the r
molecules is chosen randomly and moved to the other container. Let
Xn = # of molecules in container A at time n, n ≥ 0.
It is clear that this defines a Markov chain {Xn } with initial distribution X0 = x
and transition matrix
 
0 1 0 0 ... 0 0
 1/r 0 1 − 1/r 0 ... 0 0 
 
 0 2/r 0 1 − 2/r . . . 0 0 
P=  ... .. 
 . 

 0 0 1/r 
0 ... ... 1 0
Figure 1 shows two simulated paths of the Ehrenfest Markov chain with r = 100
over a time span of 2000 transitions, one with initial condition x = 0, the other with
x = r. To produce such simulated trajectories one can use MATLAB commands,
for example
m=2000; % number of simulation steps;
x=zeros(1,m);
r=100;
x(1)=r;
k=1;
while k<=m
z=(rand(1,1)>x(k)/r);
k=k+1;
x(k)=x(k-1)+2*z-1;
end
stairs((0:m),x);
2.2. Stationary distribution and steady-state 41

100

90

80

70

60

50

40

30

20

10

0
0 200 400 600 800 1000 1200 1400 1600 1800 2000

Figure 1. Two simulated paths of the Ehrenfest chain

2.2. Stationary distribution and steady-state


What happens in the previous example, and which is illustrated in Figure 1, is that
after an initial period of adaptation where the influence of the initial condition de-
creases, the path of the Markov chain settles into its typical behavior. The chain
appears to stabilize with random fluctuations of a certain magnitude around an av-
erage value. This is typical for many Markov chains and referred to as steady-state
behavior, or a Markov chain in equilibrium. Natural problems are now to find con-
ditions under which a Markov chain has a steady-state and to find the distribution
of the chain in this asymptotic sense. To give partial answers to these questions
we will need the notions of stationary distributions and asymptotic distributions
for Markov chains (some authors prefer alternative terminology such as invariant
distribution and limit distribution).

Definition 2.7. A probability vector π = (π0 π1 . . . ) is called a stationary distri-


bution (invariant distribution) for the Markov chain {Xn } if it has the property
that if π is applied as the initial distribution of the Markov chain, i.e. p(0) = π,
then the distribution of the chain is preserved over time, i.e. p(n) = π for all n ≥ 0.

Definition 2.8. A probability vector π = (π0 π1 . . . ) is an asymptotic distribution


(limit distribution) for the Markov chain {Xn } if
lim P (Xn = k) = πk , k ≥ 0,
n→∞

independently of the initial distribution p(0) for X0 .


42 2. Markov Chain Models, discrete time

Asymptotic distributions are always stationary but the converse is not true. The
basic recipe for finding stationary distributions of a Markov chain is to solve a system
of linear equations.

Proposition 2.9. If a Markov chain has an asymptotic distribution then this dis-
tribution is also stationary. A probability distribution π is a stationary distribution
for a Markov chain if and only if it it is a solution of the linear system of equations
π = πP, that is
X∞
πi = πk pki , i ≥ 0
k=0
π0 + π1 + · · · = 1

Example 2.10 (Two-state chain). Consider the two-state Markov chain with
jump probabilities p01 = a and p10 = b, where 0 ≤ a, b ≤ 1. Let π = (b/(a +
b) a/(a + b)) and assume that the Markov chain starts at time n = 0 with this initial
distribution, p(0) = π. By Chapman-Kolmogorovs relation
 
(1) b a 1−a a b a
p = πP = ( ) =( )=π
a+b a+b b 1−b a+b a+b
Similarly, p(2) = π, and so on. Thus, π is a stationary distribution for this Markov
chain. Consider the special case a = b = 1. This describes the trivial chain which
jumps from one state to the other and back again, indefinitely. There is no asymp-
totic distribution in this case. For example, with X0 = 1 the sequence of probabilities
P (X0 = 0), P (X1 = 0), P (X2 = 0),. . . has the form 0 1 0 . . . , which has no limit.
Example 2.11 (Stationary distribution not unique). A Markov chain can have
many stationary distributions. Take, for example, the transition matrix
 
1/2 0 1/2
P =  0 1 0 ,
1/2 0 1/2
which defines a Markov chain with states E = {0, 1, 2}. Since
 
1/2 0 1/2
(0 1 0)  0 1 0  = (0 1 0)
1/2 0 1/2
and  
1/2 0 1/2
(1/2 0 1/2)  0 1 0  = (1/2 0 1/2)
1/2 0 1/2
we conclude that both vectors (0 1 0) and (1/2 0 1/2) are stationary. But these
are not all. Indeed, if µ and ν are two stationary distributions then any linear
combination (1 − θ)µ + θν, where 0 ≤ θ ≤ 1 is a real number, is another stationary
distribution. Therefore, in this example, all vectors πθ = (θ/2 1−θ θ/2), where the
parameter θ is any number in the interval [0, 1], qualifies as a stationary distribution.

To obtain Markov chains in equilibrium and understand better their stationary


distributions, we restrict to chains that are genuinely random in the sense that it is
2.2. Stationary distribution and steady-state 43

possible in a sequence of jumps to reach any state from any other state. This prop-
erty is called irreducibility. We also need to exclude chains with periodic behavior.

Definition 2.12. A Markov chain is said to be


• irreducible, if P (Xn = j|X0 = i) > 0 for some n ≥ 0, all i, j ∈ E;
• aperiodic, if the greatest common divisor of the set {n : P (Xn = i|X0 = i) > 0}
equals one for each i

We observe that if a chain is irreducible and one can find a state i with pii > 0,
in other words find a positive element on the diagonal of the transition probability
matrix, then the chain is also aperiodic. The Markov chain in Example 2.11 is not
irreducible. In fact, the chain will never reach state 0 (or state 2) starting from state
1. Example 2.10, case a = b = 1, is not aperiodic. In fact, the greatest common
divisor of the set {n : P (Xn = i|X0 = i) > 0} = {2, 4, 6, . . . } is 2, and the chain is
said to be periodic with period 2.
A full account of limit distributions involves additional theory, not covered here,
for recurrence vs. transience of Markov chains. The following useful criteria, how-
ever, may be rigorously established with the tools we have already introduced. The
condition of the theorem implies that the chain is irreducible. See Fristedt, Jain,
Krylov [8], Theorem 2.33, for an accessible proof.

Theorem 2.13. Consider a Markov chain {Xn } defined on a finite state space.
Assume that there exists an integer m ≥ 1 such that all entries of the matrix Pm
are strictly positive, that is
P (Xm = j|X0 = i) > 0 all i, j ∈ E.
Then there is a unique stationary distribution which is also asymptotic.

This result can be strengthened as follows. If there is an m ≥ 1, such that all entries
in some column of the matrix Pm are strictly positive, then the conclusion of the
theorem remains valid. The next result restricts to chains which are irreducible and
aperiodic but is more general in the sense that it covers Markov chains with infinite
state space.

Theorem 2.14 (Equilibrium distribution for Markov chains). Consider a Markov


chain {Xn , n ≥ 0} with state space E and suppose that the chain is irreducible and
aperiodic.
1) If the state space E is finite, then a unique stationary distribution π = (π0 . . . πr ),
πj > 0, r = |E| − 1, exists, which is also the limit distribution.
2) If the state space is infinite, E = {0, 1 . . . }, then, if a stationary distribution
π = (π0 π1 . . . ) exists, this distribution is unique, πj > 0 for all j, and π is
also the limit distribution of {Xn }.

Example 2.15 (Urn model). The following is a special case of the so called
Bernoulli-Laplace’s diffusion model. At time 0 there are two urns with two black
balls in urn 1 and two white balls in urn 2. At each time n = 1, 2 . . . , a randomly
44 2. Markov Chain Models, discrete time

chosen ball is removed from each urn and put back into the other. Let Xn be the
number of white balls in urn 1 at time n, n ≥ 0. This is a Markov chain with states
{0, 1, 2} and transition probability matrix
 
0 1 0
P =  1/4 1/2 1/4 
0 1 0
from which it is seen that this Markov chain is aperiodic and irreducible making
Theorem 2.14 applicable. As an alternative we may check that all nine elements of
 
1/4 1/2 1/4
P2 =  1/8 3/4 1/8 
1/4 1/2 1/4
are positive and apply Theorem 2.13. Or rely on the remark following Theorem
2.13 and just note that all three elements of the middle column of P itself are
positive. In either case it is straightforward to find the limit distribution as the
unique probability solution π = (π1 , π2 , π3 ) with π1 + π2 + π3 = 1 of the stationarity
equation π = πP, that is 
 π1 = π2 /4
π2 = π1 + π2 /2 + π3
 π = π /4
3 2

which yields π = (1/6, 4/6, 1/6).


Example 2.16 (Google Markov chain). The web search engine system Google
involves a particular Markov chain, which is used to decide the order in which
search results are displayed. Let W denote the collection of all web pages that can
be reached by Google. They are stored in a huge database of size n = |W |, of the
magnitude of several billion pages. Regularly Google updates its assignment of so
called PageRank to each of the n pages. The PageRank is a measure of a given
page’s popularity, measured by the degree to which other pages link to the given
one. To each query submitted to the search engine, Google then finds those pages
that match the query and displays the resulting pages in order of their PageRank. In
practice, Google applies additional content based measurements of each web page
before presenting the final list of matching pages. The end result is based on a
weighted score obtained by a combination of popularity and content based ratings.
However, the fundamental principle lies in the self-regulating mechanism of the
popularity index PageRank.
To understand the idea behind PageRank, let G = (gij ) be the connectivity
matrix of the web W . This means that G is an n × n matrix with gij = 1 if there is
a hyperlink from page i to page j and gij = 0 otherwise. The row sums are
n
X
ci = gij = the number of outgoing links from page i.
j=1

A web page that contain no out links has ci = 0 and is called a dangling node. Now
let H = (hij ) be the n × n matrix with entries

gij /ci , if ci ≥ 1
hij =
1/n, if i is a dangling node.
2.3. First return and first passage times 45

Note that the elements in each row of H are nonnegative and sum to one. Hence H
may be viewed as the transition probability matrix of a discrete time Markov chain
with state space W . To visualize this Markov chain, imagine a web surfer that
jumps from one page to another by clicking at each jump a link uniformly among all
available out links. In case the web surfer encounters a dangling node it will move
to an arbitrary node chosen uniformly among all possible web pages.
The Coogle Markov chain is obtained by applying an additional parameter p and
introducing the transition probability matrix
1
P = pH + (1 − p) 1n
n
where 1n is the n×n matrix with all entries equal to one. The corresponding Markov
chain jumps between the web pages in W as follows. With probability p the chain
either follows one of the outgoing links randomly with equal probabilities or, if the
current page has no outbound link, jumps to a randomly chosen page among all in
W . With probability 1 − p the chain moves to a randomly chosen page among all
the n pages, regardless whether the current node is dangling or not. The mechanism
of letting the chain move freely with probability p is thought to prevent the web
surfer from getting stuck for longer periods of time in isolated and perhaps less
representative regions of the web graph. It is known that Google originally used
p = 0.85.
The Google Markov chain is finite, irreducible and aperiodic. Hence there exists
a unique stationary distribution π = (π1 , . . . , πn ). Now write the probabilities πk
in decreasing order r1 ≥ r2 ≥ · · · ≥ rn , where r1 is the largest of all the πk s, r2 is
the second largest, and so on. Thus, r1 is the stationary probability for the “best”
(in Google’s sense) web page, which is therefore assigned PageRank one. The page
corresponding to the second largest stationary probability gets ranked second, and
so on.

Example 2.17 (Numerical solutions). Suppose P is a d × d transition matrix


for a discrete time Markov chain with d states. We want to find a probability vector
solution π of the system of equations π = πP. which may be written equivalently
π(P − I) = 0, where I is the identity matrix of order d and 0 in this equation is a
row vector of length d. Let Z denote a d × d matrix with all elements in the first
column equal to 1 and all remaining elements the same as those in P − I. Moreover,
let u be the length d vector u = (1, 0 . . . 0). This allows us to consider π as the
solution of πZ = u, which takes into account the constraint π1 + · · · + πd = 1 and
uses the fact that the original, overdetermined, system had d + 1 equations but
only d unknown quantities. In can be shown that under conditions such that the
equilibrium distribution of the Markov chain is well-defined, then Z is invertible.
Thus, to find π we only need to compute the inverse matrix Z−1 and put π = uZ−1 .

2.3. First return and first passage times


We begin this section with an alternative interpretation of asymptotic distributions
in terms of first return times.
46 2. Markov Chain Models, discrete time

Definition 2.18. For a Markov chain {X(t)} the first passage times Tij , i, j ∈ E
are defined as
Tij = the time of the first visit in j given X(0) = i.
In particular, the first return to a state i after once leaving i is
Tii = the time of the first return to i given X(0) = i.
(A transition from i to i counts as a return.) Moreover, a state r ∈ E such that
prr = 1 is called absorbing. In this case,
Tir = absorption time in r given X0 = i
and the Markov chain is absorbed and remains in r from this time onwards.

Theorem 2.19. If a Markov chain in discrete time {Xn , n ≥ 0} has a limit distri-
bution {πk }, then
1
E(Tii ) = .
πi

Example 2.20 (Mean return time). A Markov chain has transition probability
matrix  
1/2 1/3 1/6
 1/4 1/2 1/4 
2/3 1/6 1/6
To find the mean recurrence times for the three states we observe by using Theorem
2.13 or Theorem 2.14 that this finite Markov chain has an asymptotic distribution
such that for any i = 1, 2, 3,
P (Xn = j|X0 = i) → πj , j = 1, 2, 3,
where π = (27/61, 22/61, 12/61) is the unique solution of π = πP with π1 +π2 +π3 =
1. By Theorem 2.19 the limits can be expressed in terms of the mean recurrence
times, πj = 1/νj , where νj = E(Tjj ) and Tjj is the mean recurrence time to state j.
Hence
ν1 = 61/27 = 2.26, ν2 = 61/22 = 2.77, ν3 = 61/12 = 5.08.

The principle of conditioning on the first event. While Theorem 2.19 is of


theoretical interest and sometimes useful for calculations, a more flexible method is
based on probabilistic conditioning. We consider a Markov chain {Xn } with finite
state space E = {0, 1 . . . , r} and transition probabilities {pij }, and introduce the
expected passage times
mij = E(Tij ), i, j ∈ E.
To find these expected values we use the simple but powerful idea of conditioning
on the possible outcomes of the first jump of the Markov chain. It follows that the
quantities {mij } must satisfy the equations
mij = pi0 (1 + m0j ) + · · · + pij · 1 + · · · + pir (1 + mrj ).
Indeed, if the first jump results in a visit to k 6= j, then in addition to the time step
just used up the expected remaining time until the first visit in j equals mkj . On
the other hand, with probability pij the first visit to j occurs already after one time
Solved exercises 47

unit. The resulting system of equations is linear and hence it is relatively easy to
find the expected first passage times.
Example 2.21 (Expected time to reach a particular state). It is instructive
to continue the previous Example 2.20 with three states E = {1, 2, 3}. If we fix
j = 1 the conditioning principle yields the three equations
1 1
m11 = p11 + p12 (1 + m21 ) + p13 (1 + m31 ) = 1 + m21 + m31
3 6
1 1
m21 = p21 + p22 (1 + m21 ) + p23 (1 + m31 ) = 1 + m21 + m31
2 4
1 1
m31 = p31 + p32 (1 + m21 ) + p33 (1 + m31 ) = 1 + m21 + m31
6 6

The equations are readily solved giving m11 = 61/27, m21 = 26/9, m31 = 16/9,
where obviously the numerical value of m11 = ν1 must coincide with the one obtained
in Example 2.20 using a completely different argument. The expected recurrence
times mij for j = 2 and j = 3 are obtained analogously.
Example 2.22 (Mean absorption time). Suppose E = {0, 1, 2} and
 
1/2 1/3 1/6
P =  1/4 1/2 1/4 
0 0 1
and so, state 2 is absorbing. Put
m0 = E(T02 ) = mean absorption time starting in 0
m1 = E(T12 ) = mean absorption time starting in 1
By conditioning on the first jump,
1 1 1 1 1
m0 = (1 + m0 ) + (1 + m1 ) + · 1 = 1 + m0 + m1
2 3 6 2 3
1 1 1 1 1
m1 = (1 + m0 ) + (1 + m1 ) + · 1 = 1 + m0 + m1 ,
4 2 4 4 2
from which we obtain the solution m0 = 5, m1 = 9/2.

Solved exercises
1. All nonzero transition probabilities of a Markov chain with states E = {0, 1, 2, 3}
are indicated by directed edges in the following transition graph, where some
but not all probabilities are written out. Find the transition probability matrix.
0.3 0
0.1 0.2
0.4 1 0.4 2 0.3

3
48 2. Markov Chain Models, discrete time

Solution. In the graph we may directly read off the matrix elements
 
0 0.1 . . . 0
 0.3 0.4 0 . . . 
P=  0.2 0 0.3 . . . 

0 . . . 0.4 0
Then add the remaining elements such that each row sum equals one, to get
 
0 0.1 0.9 0
 0.3 0.4 0 0.3 
P=  0.2 0 0.3 0.5


0 0.6 0.4 0
2. A Markov chain with state space E = {0, 1, . . . , 6} has the transition matrix
 
0 1/2 0 1/3 0 0 1/6
 3/4 0 0 0 1/4 0 0 
 
 0 0 0 2/3 1/3 0 0 
 
P= 0 0 0 0 1 0 0 
 
 0 0 1/4 0 0 3/4 0 
 0 0 0 0 0 0 1 
0 0 0 1 0 0 0
Draw a state transition diagram.
Solution. Guided by the position of nonzero elements in P place the nodes of
the graph to help visualize the transitions, for example as done in Figure 2.

1 1/4

1/4 4 3/4

1/3
3/4 1/2 2 1 5
2/3 1
3
1/3 1
1/6
0 6

Figure 2. State transition diagram

3. A Markov chain {Xn , n ≥ 0} with state space E = {1, 2, 3, 4} has transition


matrix  
0.3 0 0 0.7
 0.2 0 0.8 0 
P= 0 0.6 0 0.4 

1 0 0 0
Find the probabilities P (X5 = 1|X3 = 1) och P (X5 = 1|X3 = 3).
Solved exercises 49

Solution. The relation p(5) = p(3) P2 , with


 
0.79 0 0 0.21
 0.06 0.48 0 0.46 
P2 =   0.52 0 0.48 0 

0.3 0 0 0.7
yields P (X5 = 1|X3 = 1) = 0.79 and P (X5 = 1|X3 = 3) = 0.52.
4. On a road where traffic consists of cars and trucks, 3 out of every 4 trucks are
followed by a car, while only 1 out of every 5 cars is followed by a truck. What
fraction of vehicles on the road are trucks?
Solution. Thinking of unidirectional road traffic as a sequence of ordered vehicles,
either cars or trucks, we have a description of the average frequencies of a change
of the next vehicle in line from car to truck or from truck to car. The frequencies
are naturally interpreted as transition probabilities of a Markov chain {Xn } with
state space E = {car, truck} and with Xn being the type of the nth vehicle in
line starting from an initial vehicle at time n = 0. The transition probability
matrix becomes  
4/5 1/5
P=
3/4 1/4
By the theory of equilibrium distributions for finite Markov chains, this chain
has a unique steady-state distribution π = (πcar πtruck ) that solves π = πP, which
gives πcar = 1 − πtruck = 15/19. Hence the fraction of trucks on the road is 4/19,
or approximately 21%.
5. A load indicator is measured once per minute and the load is classified as either
normal or high. At high load a control system starts with the purpose of reducing
the load. The sequence of classifications can be described as a time discrete
Markov chain. The probability that a normal measurement is followed by a
high is 0.10 and the probability that a high is followed by a normal is 0.95.
a) Find the probability that high load is registered during an arbitrary mea-
surement.
b) Find the expected number of minutes during an hour where a high value is
observed and the control system is efficient.
Solution. We recognize a time discrete Markov chain, (Xn )n≥0 , with two states
N and H and transition probability matrix
 
0.90 0.10
P= ,
0.95 0.05
which is clearly irreducible.
The probability required for a), that high load is registered during an arbi-
trary measurement, is obtained from the stationary distribution of the Markov
chain, π, which is given by the solution of the balance equations π = πP with
P
πi = 1. The solution is π = (19/21, 2/21) ≈ (0.905, 0.095), and the probabil-
ity for a high value therefore πH ≈ 0.095.
To answer b) we note that the probability that a high value is followed by a
normal is given by P (Xn = H, Xn+1 = N ) = P (Xn+1 = N |Xn = H)P (Xn = H),
50 2. Markov Chain Models, discrete time

which in equilibrium gives P (Xn+1 = N |Xn = H)πH = 0.95 · 0.095. During one
hour one would expect approximately 60 · 0.95 · 0.095 ≈ 5.4 such measurements.
6. Let  
0 1 0 0
 1/2 0 1/2 0 
P=  0 0 0 1 

1 0 0 0
be the transition probability matrix for a Markov chain. The simulation in
Figure 3 shows a typical path of the process.
a) Is the chain irreducible? Is it aperiodic?
b) Find all stationary distributions.
c) Consider the chain restricted to the even time points 0, 2, 4, . . . . Find the
corresponding transition matrix. Is this new chain irreducible? Is it aperi-
odic? Find all stationary distributions.

10 20 30 40 50 60 70 80 90 100

Figure 3. The Markov chain in Exercise 6

Solution. Since it is possible to visit the states one by one and return to the
initial state following the cycle 1 → 2 → 3 → 4 → 1, the Markov chain is
irreducible. Returns to state {1} can only occur at times 2, 4, 6, . . . , similarly
for the other states, hence the chain is periodic with period 2. Because of this
Theorem 2 is not applicable but we can still find the stationary distributions by
solving the equation π = πP. In this case there is only one solution, which is
thus the unique stationary distribution π = (1/3, 1/3, 1/6, 1/6).
For the even numbered dynamics of c), the transition matrix becomes
 
1/2 0 1/2 0
 0 1/2 0 1/2 
Q = P2 =   1 0

0 0 
0 1 0 0
This chain is not irreducible since {1, 3} 
and {2, 4} form irreducible sub-chains
1/2 1/2
with transition matrix R = . The new Markov chain is aperiodic
1 0
since states 1 and 2 can be revisited in only one step. For R the vector (2/3, 1/3)
is a stationary distribution, and from this follows that the stationary distribu-
tions for Q are of the form (2c/3, 2(1 − c)/3, c/3, (1 − c)/3), where 0 ≤ c ≤ 1.
Solved exercises 51

7. A simple model for DNA is that the sequence is a string of symbols, {Xn }, where
each new nucleotide is independently chosen from {A, C, G, T } with probabilities
0.25, 0.30, 0.15, 0.30, respectively. Suppose that there is an enzyme that breaks
up the sequence as soon as the “word” AC appears in the sequence. To study
the effect of the enzyme, let {Yn } be a process with the three states {A, AC, B},
so that Yn = A if Xn = A, Yn = AC if Xn−1 = A and Xn = C, and Yn = B
otherwise.
(a) Motivate that {Yn } is a Markov chain, and determine its transition matrix.
(b) Compute the asymptotic distribution for {Yn }.
(c) What is the average length of an AC-fragment (in equilibrium)?
Solution. a) Each new symbol Xn+1 is independent of the previous ones X1 , . . . , Xn−1
in the sequence, and hence {Xn } is, trivially, a Markov chain. The transition
probability matrix is
 
pA pC pT pG
 pA pC pT pG 
Pold = 
 pA pC pT pG 

pA pC pT pG
where pA = 0.25, pC = 0.30, pT = 0.30, pG = 0.15. Next, we consider the
sequence Y1 , Y2 , . . . . If, for some n, Yn equals A then Yn+1 will be again A with
probability pA , AC with probability pC , and B with the remaining probability
pT + pG . Similarly, If Yn is AC or B we obtain the distribution of Yn+1 , without
using any knowledge of Y1 , . . . Yn−1 . Hence {Yn } is a Markov chain, and the
transition matrix is
   
pA pC pT + pG 0.25 0.30 0.45
Pnew =  pA 0 1 − pA  =  0.25 0 0.75 
pA 0 1 − pA 0.25 0 0.75
b) Since {Yn } is irreducible and aperiodic the asymptotic distribution is given
by the stationary distribution π = (1/4, 3/40, 27/40), which is obtained as the
solution to π = πPnew .
c) The expected number of nucleotides from one instance of AC to the next
is the “expected return time”, given by the ratio 1/π2 = 40/3 ≈ 13.3 (Theorem
2.19).
8. The following paintball duelling problem has been given as an assignment within
the course Programming for Engineering students at UU, to be studied as a simu-
lation exercise. Here, we find exact solutions by using Markov chain techniques.
Agnes, Beata and Cecilia decides to settle a dispute by performing a paintball
shooting tournament. Cecilia is an excellent shooter who always hits her target.
Beata hits on average every second shot. Least skilled Agnes is known to have a
hitting probability of 0.3. Thus, it is agreed that the firing order will be Agnes
first, then Beata and last Cecilia. Each person is allowed to fire one shot and is
free to choose at whom among any remaining competitor. Any person hit by a
shot is out. Last person remaining is declared the winner. We assume that each
shot is independent of any previous shot.
a) A reasonable strategy for maximizing the chance of winning is to always
aim at the best remaining shooter, since that person is likely to be your
52 2. Markov Chain Models, discrete time

greatest threat. Assuming this strategy find the winning probabilities for
Agnes, Beata and Cecilia, respectively.
b) Agnes might consider the following alternative strategy. As long as both
Beata and Cecilia remain then Agnes fires in the air, deliberately missing
the target, with the motivation that none of Beata or Cecilia would aim
their first shot at her. Suppose Agnes adopts the alternative strategy with
everything else as before. Does this increase her chances to win? Find the
winning probabilities in this case.
Solution for a).
Agnes aims her first shot at Cecilia. If she misses, Beata is also going to
shoot at Cecilia. If Beata misses, she will be hit in the next round by Cecilia,
after which Agnes would take a shot at Cecilia. Thus, the only way for Cecilia
to win the competition is that Agnes misses her first shot, that Beata misses her
first shot, and that Agnes misses her second shot. Since these three events are
independent, the winning probability for Cecilia is 0.7 · 0.5 · 0.7 = 0.245.
Computing Agnes and Beatas winning probabilites is more involved. We may
represent the entire game by using a Markov chain with 10 states as follows:
ABC BAC CAB AB BA AC CA A B C
Here, the three states A, B and C represent Agnes, Beata or Cecilia winning
the game. A two-letter symbol such as BA means that Agnes and Beata remain
and the first person listed, Beata in this case, is about to shoot at Agnes. Three
letters such as BAC indicate that all three remain with the first listed, Beata,
to aim at the third listed, Cecilia, with the second listed, Agnes, safe for the
moment. By going through the various cases which may occur, we obtain the
transition graph of Figure 4. The initial state is ABC, and the states A, B and

0.3 ABC 0.7

BA BAC
0.5 0.5
0.5 0.5
0.7 AB
B 0.3 CAB
1
0.3
A AC
0.7
1
CA C

Figure 4. Transition graph for paintball duel Markov chain

C are absorbing. We recognize Cecilias winning probability by multiplying all


transition probabilities along the branch in the tree leading from ACB to C.
However, there are several paths leading from ABC to A or B, which is why we
now apply the principle of conditioning on the first event.
Let qx denote the probability of Beata winning the game, conditional on
the event that the Markov chain is in state x. Thus, the winning probability for
Solved exercises 53

Beata is qABC . Now, considering the outcome of the first shot of the tournament,
qABC = 0.3 qBA + 0.7 qBAC .
Moreover, starting from BA we have qBA = 0.5 + 0.5qAB and starting from AB
yields qAB = 0.3 · 0 + 0.7qBA . The last two equations together imply qBA = 10/13
and qAB = 7/13. Furthermore, starting from the state BAC the only way for
Beata to win the game is to pass through AB. Hence qBAC = 0.5qAB = 7/26.
Therefore,
3 10 7 7 49
P (Beata winning) = + = = 0.4192
10 13 10 26 260
It follows that the probability for Agnes to win is 1 − 49/260 − 49/200 =
873/2600 = 0.3358.
Chapter 3

Continuous time Markov


chains

Much of the flexibility and versatility of Markov chain modeling come from the fact
that, just as we have useful theory and methods for chains which jump at discrete,
prespecified, time points, there is a parallel theory of Markov chains which carry
out the jumps at random time points on a continuous time scale. A Markov chain
{X(t), t ≥ 0} in continuous time is defined by a collection of transition intensities
{qij , i, j ∈ E}, such that
(3.1) P (X(t + h) = j|X(t) = i) = qij h + o(h), h → 0, i 6= j.
Since these probabilities do not depend on t we are again considering the time
homogeneous case (c.f. discrete time). By comparison with the property of the
Poisson process in Proposition 4 b), it follows from (3.1) that the waiting time for
the chain to jump from state i to state j is an exponential random variable with
intensity qij . In fact, it is a consequence of the Markov property in continuous time
that all holding times between successive jumps must be exponentially distributed
with state dependent parameters. Moreover, the waiting time until the chain leaves
a state i is also exponential. To see this, observe that
P (jump occurs during (t, t + h]|X(t) = i) = (qi0 + qi1 + . . . )h + o(h), h → 0,
and denote
X
qi = qij .
j6=i

If the sum qi converges then


qij = intensity for a jump from i to j
qi = intensity for a jump from i
and we have the following interpretation. If the chain is in state i at time t, X(t) = i,
it will remain in state i during an Exp(qi ) distributed holding time. At the end
of this exponential time of expected length 1/qi the waiting period elapses and the
Markov chain jumps to another state j, randomly chosen according to the conditional

55
56 3. Continuous time Markov chains

probability distribution
P (chain jumps to j|chain leaves i) = qij /qi , j 6= i.
The structure of the continuous time Markov chain is indicated in Figure 1.

X(t)

5 q_45/q_4
q_4 h+o(h)
4

3 q_43/q_4
2 q_42/q_4

1
t
t t+h

Exp(q_5)

Figure 1. Markov chain in continuous time

Definition 3.1. The infinitesimal generator of the continuous time Markov chain
{X(t)} is the matrix
 
−q0 q01 q02 ...
 q10 −q1 q12 ... 
Q=  q20 q21 −q2

... 
.. .. .. ...
. . .
where it should bePnoted that by definition of the diagonal elements all row-sums
are equal to zero, j qij = 0, i = 0, 1, . . . .
Example 3.2 (Poisson process is Markov). It follows from Proposition 1.26
and the previous definition that the standard Poisson process is a continuous time
Markov chain with infinitesimal generator
 
−λ λ 0 ...
 ... 
 0 −λ λ 
Q=  ... 

 0 0 −λ 
.. .. ... ...
. .

The definitions 2.7 and 2.8 given earlier of stationary and asymptotic distribu-
tions for discrete time Markov chains, have straightforward counterparts for Markov
chains in continuous time. Thus, if P (X(0) = k) = πk , k ∈ E, and {πk } is a station-
ary distribution for a Markov chain {X(t), t ≥ 0}, then P (X(t) = k) = πk , k ∈ E,
for all t ≥ 0. Moreover, if P (X(t) = k) → πk , k ∈ E, as t → ∞ regardless of the
distribution of X(0), then {πk } is said to be an asymptotic distribution. Again, as-
ymptotic distributions are always stationary. It is sometimes convenient to describe
3. Continuous time Markov chains 57

an asymptotic distribution π by saying that the Markov process has a steady state
X∞ with πk = P (X∞ = k), k ∈ E.

Theorem 3.3 (Asymptotic distribution for finite Markov chains). Any


Markov chain in continuous time with finite state space E = {0, 1, . . . , r}, which is
irreducible, i.e. each state can be visited with positive probability starting from any
other state, has a unique stationary and asymptotic distribution π = (π0 . . . πr )
obtained as the solution of the system of equations πQ = 0, with the property
π0 + · · · + πr = 1.

Example 3.4 (Spin-flip Markov chain). Consider the two states E = {+, −}
and think of the plus state as spin-up and the minus state as spin-down. The
spin-flip Markov chain is defined by the infinitesimal generator matrix
 
−λ λ
Q=
µ −µ
This is a continuous time Markov chain {X(t), t ≥ 0} on E with the following
dynamical behavior. If the chain is spin-down at a fixed point in time then it will
stay there for an exponentially distributed random time with expected value 1/λ
and then flip to spin-up. If the chain is in state + at a fixed time it remains up for
an exponentially distributed random time with expected value 1/µ and then flips
to state −. Because of the memoryless property of the exponential distribution, the
above description is true regardless of which “fixed time” we choose. In conclusion
the trajectory of the chain consists of consecutive cycles where each cycle is composed
of a spin-down period and a subsequent spin-up. Applying Theorem 3.3 to this
example we rename the states E = {0, 1} and seek a solution π = (π0 , π1 ) with
π0 + π1 = 1 of the matrix equation π Q, that is
−λπ0 + µπ1 = 0, λπ0 − µπ1 = 0, π0 + π1 = 1.
It follows that that the two-state Markov chain does have a unique stationary dis-
tribution π0 = µ/(λ + µ), π1 = λ/(λ + µ). For an arbitrary initial distribution of
X(0), the two-state chain has the asymptotic property
µ λ
lim P (X(t) = 0) = π0 = lim P (X(t) = 1) = π1 = .
t→∞ λ+µ t→∞ λ+µ
The conditioning technique of first-step analysis discussed in Section 2.3 for
discrete time Markov chains applies to continuous time systems as well. We give an
example.
Example 3.5 (Condition on first jump, continuous time). Consider the
Markov chain with states E = {1, 2, 3} and infinitesimal generator matrix
 
−λ λ 0
Q =  2µ −(2µ + λ) λ  ,
0 3µ −3µ
Let
vij = expected time to reach j starting from i.
58 3. Continuous time Markov chains

To demonstrate the method, let’s fix j = 3 and focus on the expected times v13 and
v23 . For each initial state, 1 and 2, by conditioning on the outcome of the first jump,
we obtain the equations
1 1 λ 2µ
v13 = + v23 , v23 = + ·0+ v13
λ 2µ + λ 2µ + λ 2µ + λ
and hence
v13 = 2(λ + µ)/λ2 , v23 = (λ + 2µ)/λ2 .
Similarly for other values of j.

For non-finite state space it is more complicated to give conditions under which
there exists an asymptotic distribution. We restrict this discussion to a particu-
lar class of processes of great importance for applications, namely birth-and-death
processes.

3.1. Birth-and-death processes


A continuous time Markov chain with states {0, 1, 2, . . . } and such that the only
transitions are either one step up (birth) or one step down (death) is called a birth-
and-death process. The conventional choice of notation is to introduce a collection
of birth intensities {λn } and death intensities {µn } by putting
λn = qn,n+1 , n≥0 µn = qn,n−1 , n ≥ 1.
The generator matrix becomes
 
−λ0 λ0 0 0 ...
 µ1 −(λ1 + µ1 ) λ1 0 ... 
Q=  0 µ 2 −(λ 2 + µ 2 ) λ 2

... 
.. .. .. ... ...
. . .
The birth-and-death process is irreducible on E = {0, 1, . . . } if the birth and
death intensities satisfy
λn > 0, n ≥ 0, µn > 0, n ≥ 1,
since this is enough to have a positive probability of reaching any state, regardless of
the current position. For finite state space, E = {0, . . . , r}, the chain is irreducible
if λn > 0, 0 ≤ n ≤ r − 1, and µn > 0, 1 ≤ n ≤ r.
A great advantage of birth-and-death processes compared to more general Markov
chains is that their stationary distributions, if they exist, can be written down in
a rather explicit form in terms of the birth and death intensities. To derive these
relations we note that the linear system of equations 0 = πQ simplifies in this case
to
0 = µ1 π 1 − λ0 π 0
0 = λn−1 πn−1 + µn+1 πn+1 − (λn + µn )πn , n ≥ 1.
By iterating the last of the two relations and at the final step using the first we
obtain for arbitrary n ≥ 1,
µn+1 πn+1 − λn πn = µn πn − λn−1 πn−1 = · · · = µ1 π1 − λ0 π0 = 0.
3.1. Birth-and-death processes 59

We have thus proved, by induction, the balance equations for birth-and-death pro-
cesses in the form
(3.2) µn πn = λn−1 πn−1 , n ≥ 1.
Solving in terms of π0 , this gives
λn−1 λn−1 · · · λ0
πn = πn−1 = · · · = π0 , n ≥ 1.
µn µn · · · µ1
Then it remains to find π0 such that these relations are consistent with the required
normalization !
X∞ X∞
λ0 · · · λk−1
π0 + πk = 1 + π0 = 1.
k=1 k=1
µ 1 · · · µk

Based on the above motivating derivations we can now state the central result for
stationary distributions of birth-and-death processes. This result gives necessary
and sufficient conditions for an asymptotic distribution to exist and, if it exist, an
explicit product formula for the limiting equilibrium probabilities.
First, however, it is important to point out that in the case of an irreducible
birth-and-death process on a finite state space E = {0, 1, . . . , r}, we have
λn−1 · · · λ0
πn = π0 , 1 ≤ n ≤ r,
µn · · · µ1
P
the normalization becomes rk=0 πk = 1, and we can deduce already from Theorem
3.3 that these relations define a stationary and asymptotic distribution.

Theorem 3.6. Assume that an irreducible birth-and-death process {X(t), t ≥ 0}


with states E = {0, 1, . . . } is governed by parameters {λn }n≥0 and {µn }n≥1 , such
that
X∞ X∞
µ1 · · · µk λ0 · · · λk−1
(3.3) = ∞ and < ∞.
k=1
λ 1 · · · λk
k=1
µ 1 · · · µk

Then a unique stationary distribution exists,



!−1
λ0 · · · λk−1 X λ0 · · · λk−1
(3.4) πk = π0 π0 = 1+ ,
µ1 · · · µk k=1
µ1 · · · µk
which is also the limit distribution
lim P (X(t) = k|X(0)) = P (X∞ = k) = πk .
t→∞

We give two examples. In the first example the state space is finite and the
theory is covered by Theorem 3.3. The second example has countable state space
and therefore requires Theorem 3.6.
Example 3.7 (Lost customers). Cars arrive at a village gas station according to
a Poisson process with an average arrival rate of 30 cars per hour. The gas station
is equipped with two pumps. On average a customer spends three minutes at one
of the pumps. The service times are assumed to be exponential and independent of
each other and of the customer arrival patterns. At the station there is only room for
one car at each of the pumps plus one more waiting. If all three spots are occupied
60 3. Continuous time Markov chains

3.5

2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5

Figure 2. Number of cars at a gas station

at the time of arrival of a fourth potential customer, the new car instead continues
to another gas station. We are interested in the proportion of potential customers
leaving the station due to the lack of waiting areas.
Let X(t) ∈ {0, 1, 2, 3}, t ≥ 0, denote the number of cars at the gas station, mod-
eled as an irreducible birth-and-death process with infinitesimal generator matrix
 
−λ λ 0 0
 µ −(λ + µ) λ 0 
 .
 0 2µ −(λ + 2µ) λ 
0 0 2µ −2µ

The given information says that λ is 30 per hour and µ 20 per hour. Figure 2 shows
the typical variation of X(t) over a period of 2.5 hours. It is natural to describe the
current number of cars at the station using the stationary distribution of the birth-
and-death process. According to the above discussion the asymptotic probabilities
satisfy
λ λ2 λ3
π1 = π0 , π2 = 2 π0 , π3 = 3 π0 .
µ 2µ 4µ
After normalization and insertion of the numeric values of the parameters, this
gives the steady state solution π = (32, 48, 36, 27)/143. In particular, E(# cars) =
201/143 ≈ 1.406. Moreover, the probability of losing a customer becomes π3 ≈ 0.189,
i.e. approximately 5.7 customers per hour.
Example 3.8 (The service model M/M/1). The M/M/1 model is the birth
and death process {X(t), t ≥ 0} in continuous time with constant jump intensities
λn = λ, n≥0 µn = µ, n ≥ 1.
3.1. Birth-and-death processes 61

We apply Theorem 3.6. Since



X ∞
X
(µ/λ)k = ∞, (λ/µ)k < ∞ if λ < µ,
k=1 k=1

condition 3.3 is fulfilled under the assumption λ/µ < 1. It is customary to introduce
a new parameter ̺, called the traffic intensity, by setting
̺ = λ/µ
and conclude, in the sub-critical regime ̺ < 1, that (3.4) simplifies to

X
k
πk = ̺ / ̺k = ̺k (1 − ̺).
k=0

Hence if ̺ < 1 the M/M/1 model has an asymptotic steady state given by the
Ge(1 − ̺) distribution. We illustrate this fundamental example in Figure 3 which
shows simulated traces of the process for the three cases with traffic intensity ̺ equal
to 0.9, 1.0 and 1.1, respectively.

120

100

80

60

40

20

0
0 100 200 300 400 500 600 700 800 900 1000

Figure 3. Sample paths of M/M/1, ̺ = 0.9, 1.0, 1.1

The interpretation of the model is that of a Poisson input stream to a station


where each arrival must be processed one at a time by a single service unit. The
time between two arrivals is exponential with mean 1/λ and the service time is
exponential with mean 1/µ. Arrivals that encounter a busy server form a queue
and wait in line for their turn. The Markov process {X(t)} counts the number of
units currently in the system, either in the queue or in the service station. We have
seen that in the case ̺ < 1, there is Pa steady state X∞ governed by the geometric
distribution, in particular E(X∞ ) = kπk = ̺/(1 − ̺).
62 3. Continuous time Markov chains

3.2. Credit rating models


Credit rating is an evaluation of the ability of a corporation or state or city govern-
ment to fulfil its financial obligations. Credit ratings are carried out by specialized
credit agencies and used by investors and authorities to assess debtors and predict
credit risk. The agencies publish their opinions on creditworthiness using rating
scales, such as one consisting of the eight credit classes
E = {AAA, AA, A, BBB, BB, B, CCC, D},
ranging from AAA, which is the premium rating of strongest capacity to meet all
financial commitments, down to CCC, being vulnerable with only speculative in-
vestment value, and finally D for payment default.
Sovereign credit ratings published by the leading credit agencies have achieved
widespread acceptance as expert opinion of a particular country’s credit standing.
Standard & Poor’s current rating (January 2015) lists 127 countries and is summa-
rized in Figure 4, which shows the number of countries currently rated in each class
and leaving the remaining countries “Not Rated” corresponding to class D. Changes
may occur at any time, such as Finland loosing AAA-status in October 2014. The
visualized data suggest that the rating is described by a probability distribution over
the rating classes and that stochastic modeling could be used to understand some
aspects of credit rating. To test this idea we will use the following table, published
30

25

20

15

10

0
AAA AA A BBB BB B CCC

Figure 4

by Standard & Poors, Sept. 2011, which uses credit rating data 1990-2010 for a
large number of global corporations and shows average one-year rates for transitions
between the credit classes.
AAA AA A BBB BB B CCC D
AAA 0.8791 0.0808 0.0054 0.0005 0.0008 0.0003 0.0005 0
AA 0.0057 0.8645 0.0819 0.0053 0.0006 0.0008 0.0002 0.0002
A 0.0004 0.0190 0.8730 0.0537 0.0038 0.0017 0.0002 0.0008
BBB 0.0001 0.0013 0.0371 0.8456 0.0399 0.0066 0.0015 0.0025
BB 0.0002 0.0004 0.0017 0.0522 0.7568 0.0733 0.0076 0.0095
B 0 0.0004 0.0014 0.0023 0.0549 0.7318 0.0449 0.0472
CCC 0 0 0.0019 0.0028 0.0083 0.1300 0.4380 0.2743
For example, close to 88% of all corporations rated AAA on January 1 of a given
year managed to keep its top-rating one year later. Of the A-rated companies, 2%
were upgraded to AA and 5% downgraded to BBB, and so on.
3.2. Credit rating models 63

The credit rate data resembles closely the way we interpret transitions of Markov
chains, except that the probabilities in each row of the table do not sum to one. This
is because the agency for some reason was unable to estimate new ratings for the
remaining percentage of corporations in each class. We will take the simplest ap-
proach in dealing with this issue and essentially ignore the missing data. To proceed
we now implement the Markov property by assuming that the history of a corpo-
ration does not affect its future prospect, only the current standing. For example,
an A-corporation recently downgraded from AA is an equally risky investment as a
corporation graded A for a longer time period.
Now we follow the approach in [16], which is applying a continuous time Markov
chain with the objective of estimating corporate lifespan based on the credit rating
data. The corresponding infinitesimal generator matrix is
 
−0.0883 0.0835 0.0056 0.0005 0.0008 0.0003 0.0005 0
 0.0059 −0.0947 0.0854 0.0055 0.0006 0.0008 0.0002 0.0002 
 
 0.0004 0.0199 −0.0796 0.0564 0.0040 0.0018 0.0002 0.0008 
 
 0.0001 0.0014 0.0397 −0.0890 0.0427 0.0071 0.0016 0.0027 
 
 0.0002 0.0004 0.0019 0.0579 −0.1449 0.0813 0.0084 0.0105 
 0 0.0005 0.0016 0.0026 0.0622 −0.1511 0.0509 0.0535 
 
 0 0 0.0022 0.0033 0.0097 0.1520 −0.4173 0.3207 
0 0 0 0 0 0 0 0

where the last row of zeros represents the absorbing default state D. Following the
method in Example 3.5 we can now find the expected values mi = E(TiD ), the
expected time to default given a corporation belongs to class i. The numerical
calculation is done in [16] with the resulting mean corporate lifespan ranging from
105.5 years for an AAA-company to about 15 years for CCC.
Since the data reflect annually updated corporate ratings it is also natural to use
the discrete time model. Now we take the view that defaulted corporations have a
chance to regain creditworthiness, with a probability of 5% say. For simplicity, we
suppose this happens by a restructuring procedure leading to an immediate AAA
rating. By renormalizing the diagonal elements we obtain the probability transition
matrix
 
0.9087 0.0835 0.0056 0.0005 0.0008 0.0003 0.0005 0
 0.0059 0.9013 0.0854 0.0055 0.0006 0.0008 0.0002 0.0002 
 
 0.0004 0.0199 0.9164 0.0564 0.0040 0.0018 0.0002 0.0008 
 
 0.0001 0.0014 0.0397 0.9048 0.0427 0.0071 0.0016 0.0027 
P= 
 0.0002 0.0004 0.0019 0.0579 0.8393 0.0813 0.0084 0.0105 
 0 0.0005 0.0016 0.0026 0.0622 0.8289 0.0509 0.0535 
 
 0 0 0.0022 0.0033 0.0097 0.1520 0.5121 0.3207 
0.0500 0 0 0 0 0 0 0.9500

This defines a finite, irreducible and aperiodic Markov chain. By Theorem 2.14
there exists a unique asymptotic distribution π which solves π = πP . The solution
is displayed graphically in Figure 5.
64 3. Continuous time Markov chains

0,25

0,2

0,15

0,1

0,05

0
AAA AA A BBB BB B CCC D

Figure 5. Steady-state distribution for global corporate rating

Solved exercises
1. Find the asymptotic distribution of the birth-death process defined by the birth
and death intensities
λn = 2, 0 ≤ n ≤ 3, µn = 1, 1 ≤ n ≤ 4.
Determine the expected value and the variance of the asymptotic distribution.
Solution. We have a finite Markov chain with states {0, 1, 2, 3, 4}, which is irre-
ducible since all λn and µn are strictly positive. By Theorem 3.3, there exists a
unique stationary distribution π = (π0 , . . . , π4 ) which is also asymptotic. Here,
the probability vector π solves the system of equations πQ = 0 and we know for
finite birth-death chains that the solution has the product form
λ0 . . . λk−1
πk = π0 = 2k π0 , k = 1, 2, 3, 4.
µ1 . . . µk
By normalizing the solution, meaning that we use the criteria π0 + · · · + π4 = 1,
(1, 2, 4, 8, 16)  1 2 4 8 16 
π= = , , , , .
1 + 2 + 4 + 8 + 16 31 31 31 31 31
The corresponding expected value is
4
X 0 + 2 + 8 + 24 + 64
m= kπk = = 98/31 ≈ 3.16
k=0
31
and the variance
X4
2 0 + 2 + 16 + 72 + 256 1122
σ = k 2 πk − m2 = − m2 = 2
≈ 1.17
k=0
31 31
2. A continuous time Markov chain X defined on the state space E = {0, 1, 2, 3}
starts in state 0 and has the generator matrix
 
−5 5 0 1
 0 −5 2 3 
 .
 0 2 −5 3 
0 0 0 0
Find the expected time to absorption in state 3.
Solved exercises 65

Solution. As in Definition 2.18, we consider the absorption times


Ti3 = time to first visit in 3, given X(0) = i, i = 0, 1, 2.
The task is to compute E(T03 ). In order to apply the principle of conditioning
on the first jump, we denote mi = E(Ti3 ). Starting in 0, first of all the process
remains in state 0 during a random time which is exponentially distributed.
The intensity parameter is q0 = 5 and hence the expected time in 0 is 1/5 time
units. Then the first jump occurs. With probability q01 /q0 = 4/5 the process
jumps to state 1, in which case the remaining time to absorption is m1 , and
with probability q03 /q0 = 1/5 it jumps to state 3, in which case we have reached
absorption. Summing up
1 4 1
m0 = + m1 + · 0.
5 5 5
Similarly, by repating the same arguments, but starting from 1 and 2 respec-
tively,
1 2 3
m1 = + m2 + · 0
5 5 5
1 2 3
m2 = + m1 + · 0
5 5 5
Simplifying, we have the three equations
5m0 = 1 + 4m1 , 5m1 = 1 + 2m2 , 5m2 = 1 + 2m1 .
By symmetry, m1 = m2 , and hence m1 = m2 = 1/3, and therefore m0 = 7/15.
3. The continuous time birth-death process {X(t)} defined by the birth and death
intensities
λn = λ, n ≥ 0, µn = µn, n ≥ 1,
is known as the M/M/∞ service model. The service system interpretation of
this model is that a Poisson stream of customers arrive at a service unit and each
customer is given an exponential service time immediately upon arrival. The
birth parameter λ is the intensity of the Poisson process which counts customer
arrivals. The death parameter µ is the service intensity, in the sense that each
customer leaves after spending an exponential service time of mean 1/µ in the
system. Then the varying number of currently served customers in the system
is given by the Markov process {X(t)}. Show that the process has a limit
distribution for any values of the parameters λ and µ, and find the equilibrium
distribution.
Solution. We use Theorem 3.6. For the special choice of parameters in this model
the first condition of the criteria (3.3) is satisfied, since

X ∞
X
µ1 · · · µk
= (µ/λ)n n! = ∞.
k=1
λ1 · · · λk k=1

The second condition in (3.3) is also satisfied, since



X ∞
X
λ0 · · · λk−1 (λ/µ)n
= = exp{λ/µ} − 1 < ∞.
k=1
µ1 · · · µk k=1
n!
66 3. Continuous time Markov chains

It follows that there exists a unique distribution {πk }, stationary and asymptotic,
given by (3.4) as
πk = (λ/µ)n π0 π0 = exp{−λ/µ},
which we recognize as the Poisson distribution with mean value λ/µ.
4. A birth-death process has the jump intensities λk = λ(N − k) and µk = µN k.
Here λ and µ are positive parameters, N is a positive integer, and k varies
between 0 and N . Show that there is an asymptotic distribution, which is
given by a particular standard distribution. What happens to the asymptotic
distribution as N → ∞?
Solution. The balance equations for this particular birth-death process give the
result  
λ0 · · · · · λk−1 N
πk = π0 = (λ/µN )k π0 , 0 ≤ k ≤ N.
µ1 · · · · · µk k
Normalization as a probability distribution shows that π0 = 1/(1 + λ/µN )N ,
hence  
N λ/µN k 1 N −k
πk = .
k 1 + λ/µN 1 + λ/µN
λ/µN
We recognize the stationary distribution as a Bin(N, 1+λ/µN ). Since the process
has finite state space and is irreducible, this distribution is also asymptotic.
A general result in probability theory says that a binomial distribution
Bin(n, a/n) converges in distribution to the Poisson distribution Po(a) as n → ∞.
In this case we obtain the Poisson distribution with mean λ/µ in the limit, since
λ/µN λ
N = → λ/µ
1 + λ/µN µ + λ/N
Chapter 4

Some non-Markov
models

The Markovian paradigm - that given the past up to now, the future only depends
on the present, is mathematically tractable and widely applicable. Nevertheless, the
Markov property singles out a particular class of models and excludes other natural
random mechanisms. In this direction we will discuss two types of non-Markov
extensions, renewal processes and stationary time series models.

4.1. Renewal models


The standard (delayed) renewal model is the following. Compare Figure 1.

Definition 4.1. Consider a sequence of independent, non-negative random vari-


ables U1 , U2 , . . . where U1 has distribution function F1 (t) = P (U1 ≤ t), t ≥ 0, and
U2 , U3 , . . . are identically distributed with common distribution function F (t) =
P (Ui ≤ t), t ≥ 0, i ≥ 2. Assume
E(U1 ) < ∞ ν = E(U2 ) < ∞.
Pn
Let Tn = i=1 Ui , n ≥ 1, T0 = 0, denote the partial sums, and suppose that
renewal events occur on the real line at times T1 , T2 , . . . . The renewal process
associated with the i.i.d. sequence {Ui } is the counting process

X
N (t) = 1{Tn ≤t} t ≥ 0.
n=1

The special choice F1 (t) = F (t) = 1 − e−λt in Definition 4.1 is the case where all
variables U1 , U2 , . . . are exponentially distributed with expected value 1/λ. Then,
by Definition 1.24, N (t) is the Poisson process with intensity λ. This is the only
choice of F which makes the renewal process a Markov process, and hence we can
think of renewal processes as non-Markov generalizations of the Poisson counting
process.

67
68 4. Some non-Markov models

N(t)

3
2

T_1 T_2 t

U_1 U_2

Figure 1. The standard renewal model

Figure 1 is helpful in order to make two useful observations about the quantities
Tn and N (t) in Definition 4.1, namely

(4.1) {N (t) ≥ n} = {Tn ≤ t}

and
TN (t) ≤ t < TN (t)+1 .
The latter ordering property shows that
(4.2)
N (t) N (t)+1
1 X t 1 X N (t) + 1
Ui ≤ ≤ Ui · on the set {N (t) > 0}.
N (t) i=1 N (t) N (t) + 1 i=1 N (t)
P
The strong law of large numbers gives us n1 nk=1 Uk → ν as n → ∞, in the sense of
almost sure convergence. As t → ∞, the number of renewals also tends to infinity.
Actually N (t) → ∞ passing through every integer value one by one. Thus
N (t)
1 X
Uk → ν, t → ∞.
N (t) k=1

But then it follows from (4.2) that t/N (t) is asymptotically squeezed in between two
quantities, below and above, both converging to ν. This shows that N (t)/t → 1/ν
as t → ∞. More advanced methods, see e.g. [10], [21], lead to the corresponding
property for the mean number of renewals. This type of result is known as the
elementary renewal theorem.
4.2. Renewal reward processes 69

Theorem 4.2 (The elementary renewal theorem). In the asymptotic limit as t


tends to infinity, the number of renewals in a renewal process with mean inter-
renewal time ν grows linearly with t and proportionally to 1/ν, in the sense
1 1
(4.3) N (t) → almost surely, as t → ∞.
t ν
Moreover,
1 1
E(N (t)) → as t → ∞.
t ν

The stationary renewal process is obtained by choosing for F1 the equilibrium


distribution Feq (t) associated with F (t), defined by
Z
1 t
(4.4) Feq (t) = (1 − F (s)) ds.
ν 0
With this choice of distribution for the time to the first renewal it can be shown that
E(N (t)) = t/ν, so that the asymptotic relation in the elementary renewal theorem
is in fact an identity for any fixed t > 0.

4.2. Renewal reward processes


The renewal reward theorem is an extension of the renewal theorem that may be
used to study the long time asymptotics in various models. The idea is to associate
to the renewal cycles not only their lengths Ui , i ≥ 1, but also a further sequence
of random variables, Ri , i ≥ 1, where Ri represents the reward accumulated during
cycle i. The total reward up to time t is given by
N (t)
X
R(t) = Ri + partial reward from interval (TN (t) , t].
i=1

The aim is to find the asymptotic mean reward in the sense of a time average, i.e.
the limit as t → ∞ of
N (t)
R(t) N (t) 1 X
= Ri + fraction of partial reward.
t t N (t) i=1
It is clear from this relation what to expect. The renewal theorem shows that
N (t)/t → 1/ν, and so it should follow from the strong law of large numbers that
R(t)/t → E(R)/ν as t → ∞. The typical assumptions which are imposed on the
rewards in order to guarantee the expected behavior are that for each j, Rj may
depend on Uj but is independent of all Ui , i 6= j, and that {Ri } is an independent
sequence, identically distributed except possibly R1 which is allowed to have a differ-
ent distribution. Moreover, it is assumed that E|Ri | < ∞ for any i. It can be shown
that under these assumptions the details of assigning rewards to renewal events does
not affect the end result. It does not matter whether a reward is counted at the
beginning or at the end of a renewal interval, or if it is gradually allocated contin-
uously over time. In either case the partial rewards vanish asymptotically, and the
renewal reward theorem states that the time averaged total reward converges to the
cycle averaged reward.
70 4. Some non-Markov models

Theorem 4.3 (The renewal-reward theorem). For a renewal reward process as


above with mean inter-renewal time ν it holds that
R(t) E(R) E(R(t)) E(R)
→ almost surely and → , as t → ∞,
t ν t ν
where E(R) is the common expected value of the rewards Ri , i ≥ 2.

For proofs and more general versions related to regenerative processes, see e.g.
Wolff [21].
Example 4.4 (Markov two-state process). Consider the two-state Markov pro-
cess in Example 3.4. Denote the successive on-period durations by S1 , S2 , . . . and
the off-period durations by T1 , T2 , . . . , so that the on and the off periods form i.i.d.
sequences with exponential distributions of parameter µ and λ, respectively. Put
Ui = Si + Ti , i ≥ 2 and define U1 to be S1 + T1 if X(0) = 1 and T1 otherwise. Let
N (t) be the renewal process associated with the sequence {Ui } and define rewards
Ri = Si for each renewal cycle i. Then N (t) counts the number of on-periods up
to time t and the corresponding renewal reward process R(t) gives the total dura-
tion of on-periods up to time t. By the renewal reward theorem the ratio R(t)/t,
which is the fraction of time that the two-state Markov chain spends in the on-state,
converges to the ratio of expected reward to expected cycle length, that is
R(t) E(R) 1/µ λ
→ = = .
t E(S) + E(T ) 1/µ + 1/λ λ+µ
This shows that the limit of the time average studied in this example is the same as
the asymptotic probability π1 in Example 3.4.
Example 4.5 (On-off process). The Markov property of the on-off process in
the the previous example is not used for the application of the renewal reward
theorem. In fact, we may take arbitrary distributions with finite expected values
E(S) and E(T ) respectively, to model any system which goes through consecutive
on and off periods which satisfy the independence criteria of the renewal model.
By the renewal reward theorem we have the intuitively appealing result that the
asymptotic availability, meaning the asymptotic fraction of time during which the
system is on, is given by the ratio E(S)/(E(S) + E(T )).

4.3. Reliable data transfer


In this application we investigate the throughput of protocols for reliable data trans-
fer of the type Go-Back-N (GBN). The aim is to obtain under idealized conditions
expressions for the effective throughput as a function of the packet loss probability.
To solve this task we apply the renewal-reward theorem.
For the exact principles of reliable data transfer see the computer networks litera-
ture, e.g. Schwartz [19] and Kurose and Ross [15], Ch. 3.4. In this presentation it is
assumed that fixed size packets are transmitted from a sender to a receiver, subject
to constant delay times. Each packet delivered to the receiver is acknowledged by
the return of an ACK packet in the opposite direction and the arrival of the ACK at
the sender marks the end of a successful transmission round. A time unit is selected
by letting tR = 1 be the round-trip-time. In addition a time-out clock with expiry
4.3. Reliable data transfer 71

time T0 is used to handle the loss of an ACK; according to certain rules expiry of
the clock at the sender results in retransmission of one or several unacknowledged
packets. Suppose that with probability p, either the packet or its ACK is lost during
transmission, hence unacknowledged at the sender. Suppose also that packet losses
are independent of each other.
We begin with the Go-Back-1 protocol. The sender transmits a single packet on
the channel. Then waits a round-trip-time for the corresponding ACK to arrive. If
the packet is successfully delivered the return of the ACK packet marks the start of
the next round. If the packet is lost so that no ACK packet arrives in the expected
time period, then the time-out clock is activated and the packet is retransmitted
with an additional delay of T0 round-trip-times.
The time periods between consecutive packet loss events will now form cycles
where each cycle consists of a random number of rounds. Let K denote the number
of such rounds in a cycle,

K = # rounds until a packet loss.

Clearly, K has the positive geometric distribution (first time distribution) P (K =


k) = (1 − p)k−1 p, k ≥ 1. The random length of such a cycle is K + T0 rounds and
during this time K − 1 packets are successfully delivered. Let Nt denote the number
of cycles up to time t. Then {Nt , t ≥ 0} is the renewal process associated with the
sequence of inter-renewal times (Ui ), Ui = Ki + T0 , i ≥ 1. The mean inter-renewal
time is therefore
1
ν = E(U1 ) = E(K1 ) + T0 = + T0 .
p
To each cycle i we further associate the reward Ri = Ki − 1 with mean reward
E(Ri ) = (1 − p)/p. The throughput over time [0, t] can be written
N
1X t

Throughput(t) = Ri ,
t i=1

and so, by the renewal-reward theorem,


N
1X t
E(R1 ) 1−p
ThroughputGB1 = lim Ri = = ,
t→∞ t ν 1 + pT 0
i=1

measured in packets per round-trip-time.


We discuss briefly the extension of this model to the more general Go-Back-N
protocol. It is assumed that the packet size is small compared to the round-trip-time
and a parameter N is introduced, which gives the maximum number of packets that
the sender is allowed to send without waiting for acknowledgments. A successful
round consists of transmitting N packets, each packet subject to loss independently
and with the same probability p. In case one or several of the N packets are lost all
the N packets in the same round have to be retransmitted. The throughput can be
calculated analogously as for the case GB1, again using the renewal reward theorem
as the main tool. We leave the calculations in this case as an exercise.
72 4. Some non-Markov models

4.4. Time series models


A filter operates on an input sequence {x(t), t ∈ I} and generates a new output
sequence {y(t), t ∈ I}. Filtering is the process of letting the input pass through
the filter and determining the output. We will study so called linear, time invariant
filters (linear filters for short), for which the output is a weighted linear sum of input
elements, represented as

X
y(n) = h(k)x(n − k) (discrete time)
k=−∞
Z ∞
y(t) = h(s)x(t − s) ds (continuous time).
−∞

Here {h(n)} and {h(t)} are called the impulse response functions of the linear filters.
A typical situation is that the input signal {x(n)} is a sequence of independent
random variables and the output {y(n)} hence a sequence of dependent random
variables, where the nature of the dependence structure varies with the choice of
impulse response. Another typical situation is that the input is a signal subject to
disturbing noise and the purpose of letting it through the filter is to obtain a less
noisy output. In fact, filters can be designed so that they change the nature of the
signal in a particular direction.

Definition 4.6. A linear filter is said to be


• stable, if the impulse response function is such that
X∞ Z ∞
|h(k)| < ∞ (discrete time) |h(t)| dt < ∞ (cont. time),
k=−∞ −∞

• causal, if h(t) = 0, t < 0 (discrete or continuous time), so that


X∞
y(n) = h(k)x(n − k) (discrete time)
Zk=0

y(t) = h(s)x(t − s) ds (continuous time).
0

The stability condition is a natural restriction put on the impulse response func-
tion which makes the output sequence well-defined and allows to start building up
a mathematical theory for filters. Causality is a natural assumption in particular if
the filter describes evolution over time. In such a case, a filter is causal if the output
depends only on the past and the present of the input but not on future values of
the input.
The most important time series models for engineering applications are those
for which the input sequence is a weakly stationary process. If a weakly stationary
stochastic process is applied as input and fed through a stable, causal, linear filter,
then the output has equally tractable properties. Recalling Definition 1.8, the key
property of a weakly stationary process is that the autocovariance function r(s, t), of
two variables, only depends on the time difference |t − s|. It is therefore convenient
4.4. Time series models 73

to simplify notation and denote the auto-covariance function by


r(t) = r(s, s + t) = C(X(s), X(s + t)), t ≥ 0.
In particular, at t = 0 we obtain the variance
r(0) = v(s) = V (X(s)), s ≥ 0.
Consequently, the variance of a weakly stationary process is a fixed positive number
that stays the same for all times s. Similarly, the autocorrelation function is
C(X(s), X(s + t))
ρ(t) = p p = r(t)/r(0).
V (X(s)) V (X(t))

Theorem 4.7. Assume that {X(t)} is a weakly stationary process with mean value
function mX and covariance function rX . Let h be the impulse response function
of
P∞ a stable, causal, linear filter. Then, in discrete time, the output sequence Yn =
k=0 h(k)Xn−k from the filter is weakly stationary with mean value and covariance
given by
X∞
mY = mX h(k)
k=0
∞ X
X ∞
rY (τ ) = h(i)h(j)rX (τ + i − j).
i=0 j=0
R∞
In continuous time, the output Y (t) = 0 h(s)X(t − s) ds is weakly stationary with
Z ∞
mY = mX h(t) dt
0
Z ∞Z ∞
rY (τ ) = h(s)h(t)rX (τ + t − s) dsdt.
0 0

A stochastic process {X(t)} is said to be Gaussian if for each t, X(t) has a normal
distribution. Similarly in discrete time. When we recall from basic probability
theory that the normal distribution is preserved under summation of a finite number
of random variables, it comes as no surprise that linear filters also preserve the
Gaussian property. Indeed, the following result is shown using the corresponding
properties for infinite series and integrals of random processes, see e.g. Hoel et al.
[13].

Theorem 4.8. If the input to a stable, linear filter is a weakly stationary Gaussian
process with mean value mX and covariance rX , then the output from the filter is
again a weakly stationary Gaussian process, with mean and covariance given by
mY and rY in Theorem 4.7.

Example 4.9. The discrete time filter with impulse response functionPh(k) = ak ,
k ≥ 0, and h(k) = 0, k < 0, is causal and stable if |a| < 1, since then ∞ k
k=0 |a| =
1/(1 − |a|) < ∞. This filter applied to an input signal {Xn } yields the output signal
{Yn }, given by
(4.5) Yn = Xn + aXn−1 + a2 Xn−2 + . . .
74 4. Some non-Markov models

The effect of this filter is illustrated in Figure 2 for the particular case when the
input {Xn } is a sequence of independent Gaussian random variables with mean 0
and variance 1. The top graph is the input noise, the middle graph is the output for
a = 0.5 and the lower graph is the output in the case when the parameter is set to
a = 0.9. For small values of a the output sequence is more or less the same sequence
as the input, but with increasing values of a the strength of the dependence between
the values of the signal at different time points increases. In the lower graph this
is visible in that some of the fluctuations over short time scales are reduced. The
filter is called a low pass filter since such high frequency fluctuations are dampened.
We apply Theorem 4.7 to this model. The input sequence {Xn } has mX = 0

−5
0 50 100 150 200 250 300 350 400

−5
0 50 100 150 200 250 300 350 400

−5
0 50 100 150 200 250 300 350 400

Figure 2. Input (top) and output of the linear filter (4.5), a = 0.5 (middle) and
a = 0.9 (bottom)

(mean zero), rX (0) = 1 (unit variance) and rX (k) = 0, k 6= 0, (independence).


Consequently, using the theorem, mY = 0 and
∞ X
X ∞
rY (τ ) = h(i)h(j)rX (τ + i − j)
i=0 j=0
X∞
= h(i)h(τ + i)rX (0)
i=0

X
2i+τ a|τ |
= rX (0) a 1{τ +i≥0} = , τ = 0, ±1, . . .
i=0
1 − a2

Moreover, using now Theorem 4.8 it follows that the output sequence from the
filter is Gaussian. Hence for each n, the random variables Yn all have the normal
4.5. Autoregressive and moving average processes 75

distribution N(0, (1 − a2 )−1 ) and the covariance of any two elements Yn and Ym is
given by rY (m − n) = a|m−n| /(1 − a2 ).

4.5. Autoregressive and moving average processes


This section is devoted to a more detailed study of linear filters in discrete time for
which the input is independent, or at least uncorrelated. In this case, when we have
 2
σ if n = m
E(Xn ) = 0 C(Xn , Xm ) =
0 if n 6= m,
the input sequence {Xn , −∞ < n < ∞} (or sometimes {Xn , n ≥ 0}) is called white
noise. If, in addition, the sequence is Gaussian we have Gaussian white noise.

Definition 4.10. Let {Xn } denote white noise with variance parameter σ 2 .
• A moving average process of order q (MA(q) for short) is defined as a finite
linear combination
Yn = c0 Xn + c1 Xn−1 + · · · + cq Xn−q , c0 = 1,
of the white noise input sequence.
• An autoregressive process of order p (AR(p) for short) is a sequence {Yn }
defined by the recursion
(4.6) Yn = −a1 Yn−1 − · · · − ap Yn−p + Xn ,
where it is assumed that p initial values are specified. One may take, for
example, Y−1 = · · · = Y−p = 0 and white noise {Xn , n ≥ 0} in order to
generate {Yn , n ≥ 0}.

The following result is a consequence of Theorem 4.7.

Theorem 4.11. The MA(q) process is always stable and weakly stationary with
mY = 0 and  q
 σ2 X c c

j j−τ if |τ | ≤ q
rY (τ ) =

 j=0
0 otherwise.

Proof. By assumption we have mX = 0, rX (0) = σ 2 , and rX (k) = 0 for k 6= 0.


Moreover, Yn is the output of a causal, linear filter with impulse response function
h(k) = ck for k = 0, . . . , q and h(k) = 0 for k > q. Hence by Theorem 4.7, mY = 0
and
q q q q
X X X X
2 2
rY (τ ) = ci cj σ 1{τ +i−j=0} = σ cj ci 1{i=j−τ } .
i=0 j=0 j=0 i=0
For any τ such that |τ | ≥ q +1 there are no indices j with 0 ≤ j −τ ≤ q, hence in this
case rY (τ ) = 0. For |τ | = q there is one such index, namely j = q so that rY (±q) =
σ 2 cq , and so on for other values of τ . In particular, rY (0) = σ 2 (1 + c21 + · · · + c2q ).
The given formulaPis one way to summarize this result. Since rY (τ ) is symmetric,
the expression σ 2 qj=0 cj cj+τ , |τ | ≤ q, provides an alternative. 
76 4. Some non-Markov models

Next we turn to the study of AR(p) filters. In order for an autoregressive filter
to be stable and hence generate a well-defined autoregressive sequence of order p,
it is required that the coefficients a1 , . . . , ap are chosen appropriately. To make this
precise, let z ∈ C be the complex variable in the complex plane and let A(z) be the
generating polynomial
A(z) = 1 + a1 z + a2 z 2 + · · · + ap z p .
The equation
(4.7) z p A(z −1 ) = z p + a1 z p−1 + · · · + ap = 0,
called the characteristic equation of the AR(p) model, is known to have exactly p
solutions, or roots, z1 , . . . , zp in C.

Theorem 4.12. The AR(p) filter is stable if the coefficients a1 , . . . , ap are such
that all p roots z1 , . . . , zp of the characteristic equation (4.7) are located inside of
the unit circle in the complex plane, that is |zk | < 1, 1 ≤ k ≤ p. Equivalently,
all roots z1′ , . . . , zp′ of the equation A(z ′ ) = 0 must be outside of the unit circle,
that is |zk′ | > 1. In this case the AR(p) process {Yn } is weakly stationary with
mY = 0. The covariance function r = rY is obtained as the solution to the so
called Yule-Walker equations


 r(0) + a1 r(1) + · · · + ap r(p) = σ2

 r(1) + a1 r(0) + · · · + ap r(1 − p) = 0



 r(2) + a1 r(1) + · · · + ap r(2 − p) = 0
..

 .

 r(k) + a1 r(k − 1) + · · · + ap r(k − p) = 0, k ≥ 1



 ..
.

Proof. The proof of the stability criterion goes beyond the scope of these notes.
Therefore we assume that the AR(p) filter is stable and verify the other claims of
the lemma.
Take expected values of both sides of (4.6). This gives
mY + a1 mY + · · · + ap mY = mX = 0,
in other words mY A(1) = 0. Since any solution z of z p A(1/z) = 0 have |z| < 1 we
can not have A(1) = 0, hence mY = 0.
To show that r satisfies the Yule-Walker equations we consider the covariance of
Yn−k and Xn . By the defining relation for the AR(q) filter
C(Yn−k , Xn ) = C(Yn−k , Yn + a1 Yn−1 + · · · + ap Yn−p ).
where it is seen from the weak stationarity that the right hand side equals
r(k) + a1 r(k − 1) + · · · + ap r(k − p).
But since Yn−k is independent of Xn for any k ≥ 1, we have C(Yn−k , Xn ) = 0 for
such k, and hence the left hand side vanishes. Hence
r(k) + a1 r(k − 1) + · · · + ap r(k − p) = 0 k ≥ 1.
4.5. Autoregressive and moving average processes 77

It remains to verify the first of the Yule-Walker equations, for the case k = 0. To do
so we rewrite the covariance C(Yn , Xn ) in two different ways. First, by the causality,
C(Yn , Xn ) = C(−a1 Yn−1 − · · · − ap Yn−p + Xn , Xn ) = C(Xn , Xn ) = σ 2 .
Second,
C(Yn , Xn ) = C(Yn , Yn + a1 Yn−1 + · · · + ap Yn−p ) = r(0) + a1 r(−1) + · · · + ap r(−p)
and so
r(0) + a1 r(−1) + · · · + ap r(−p) = r(0) + a1 r(1) + · · · + ap r(p) = σ 2 .

Example 4.13. We want to compare two stationary processes, one AR(1)-process
{Yn } and one MA(2)-process {Zn }, defined by the filter relations
Yn + 0.8Yn−1 = Xn
and
Zn = Xn + 0.4Xn−1 + 1.2Xn−2 ,
respectively, where {Xn } is an input signal. The simulations a) and b) of Figure

a)
5

−5
0 10 20 30 40 50 60 70 80 90 100

b)
5

−5
0 10 20 30 40 50 60 70 80 90 100

Figure 3. Autoregressive and moving average processes

(3) show one realization each of the two time series, where {Xn } is simulated white
noise of mean zero and variance one. Which simulation is the AR(1) process, and
which is the MA(2) process?
78 4. Some non-Markov models

Example 4.14 (First-order autoregressive model). In each step of its evolution


the AR(1) process picks up a fraction of the previous value and adds an uncorrelated
noise component, Yn = −a1 Yn−1 + Xn . The generating polynomial is A(z) = 1 + a1 z,
and the characteristic equation z + a1 = 0. Of course, the root z1 = −a1 of the
characteristic equation has |z1 | < 1 if and only if |a1 | < 1, which is therefore the
stability condition. The same conclusion follows by observing that A(z ′ ) = 0 gives
|z ′ | = | − 1/a1 | > 1 whenever |a1 | < 1.
The covariance function r, and thus the correlation function ρ, are obtained from
the Yule-Walker equations as
(−a1 )|τ |
r(τ ) = σ 2 , ρ(τ ) = r(τ )/r(0) = (−a1 )|τ | , τ = 0, ±1, . . .
1 − a21
Recall the linear filter in Example 4.9 with impulse response h(k) = ak for k ≥ 0.
This process can also be written

X ∞
X ∞
X
k k−1
Yn = a Xn−k = Xn + a a Xn−k = Xn + a aj Xn−1−j
k=0 k=1 j=0

where we change index in the sum from k to j = k − 1. Hence


Yn = Xn + aYn−1 ,
and so, in fact, the sequence {Yn } which was defined as an “infinite MA process”
also turns out to be an AR(1) process! This explains that the calculations of the co-
variance function for the AR(1) model in this example and for the model in Example
4.9 give the same result.
Example 4.15 (Second-order auto-regressive model). The AR(2) process uses
in each step information from the two most recent values and again adds an uncor-
related noise component, Yn = −a1 Yn−1 − a2 Yn−2 + Xn . The generating polynomial
is A(z) = 1 + a1 z + a2 z 2 and the characteristic equation z 2 + a1 z + a2 = 0. By
investigating the corresponding roots it is seen that the AR(2) process is stable for
all (a1 , a2 ) such that
(4.8) |a2 | < 1 |1 + a2 | > |a1 |.
The Yule-Walker equations take the form


 r(0) + a1 r(1) + a2 r(2) = σ2

 r(1) + a1 r(0) + a2 r(−1) = 0



 r(2) + a1 r(1) + a2 r(0) = 0
..

 .

 r(k) + a1 r(k − 1) + a2 r(k − 2) = 0, k≥1



 ..
.
and from this follows
1 + a2
(4.9) r(0) = V (Yn ) = σ 2
(1 − a2 )((1 + a2 )2 − a21 )
and recursively other values of r(τ ).
4.5. Autoregressive and moving average processes 79

Example 4.16 (Sunspot numbers). The system of equations required to de-


termine the covariance function of an autoregressive process is named after G U
Yule and G M Walker, who used time series analysis to study the so called Wolfer
sunspot numbers in 1927. These are measurements of the annual solar surface ac-
tivity, which have become well-known due to the very distinct cyclic pattern with
a peak of maximum activity approximately every eleventh year. The updated se-
quence of Wolfer numbers is shown in Figure 4 (available partly in MATLAB as the
data file sunspot.dat). The observed mean value in the relevant measurement unit

200

180

160

140

120

100

80

60

40

20

0
1700 1750 1800 1850 1900 1950 2000

Figure 4. Wolfer sunspot numbers 1700 - 2008

is 49.75 and the standard deviation 40.45.


The question posed by Yule and Walker, which we address now without going
into details, is whether it is possible to use the simple recursive principle of autore-
gressive time series to generate data with a similar statistical behavior as those of
sunspot numbers. Put differently, can we choose coefficients a1 , . . . , ap and σ 2 such
that the corresponding AR(p) process would serve as an explanatory stochastic
model for solar activity, featuring 11 year cycles and generally the shape of observed
data? To answer such questions properly one should apply spectral analysis of time
series. This refers to using the mathematical techniques of Fourier analysis to de-
compose the sequences into wave-formed components and analyze how the choice of
parameters affects the dominating frequencies. To avoid the mathematical machin-
ery of spectral theory in this presentation (see e.g. Söderström [20]), we consider
80 4. Some non-Markov models

instead a given AR(2) type process of the form


Yn − 1.42Yn−1 + 0.73Yn−2 = 15.5 + Xn ,
where as usual {Xn } is white noise with variance σ 2 . To get a feel for the subject we
merely ask if there is a value of σ 2 , which gives the resulting time series a reasonable
degree of resemblance to sunspot numbers. To begin with stationarity the first
condition is that the mean value mY must satisfy
mY − 1.42mY + 0.73mY = 15.5 + 0,
hence mY = 15.5/0.31 = 50, which is the approximate mean value of the observed
sequence.
The zeros of the generating polynomial are obtained by solving the equation
1 − 1.42z + 0.73z 2 = 0. If we change to u = 1/z this is the characteristic
√ equation
u2 − 1.42u + 0.73 = 0, which has the solution u =√ 0.71 ± 0.712 − 0.73. Thus
u = 0.71 ± 0.4753i, and so both solutions have |u| = 0.712 + 0.47532 = 0.8544 < 1.
This shows that for this choice of parameters the AR(2) model is stable.
We have found in (4.9) the relation of the white noise variance σ 2 to V(Yn ) = r(0),
a1 and a2 . If we now estimate r(0) by r(0)∗ = 40.55402 and insert a1 = −1.42 and
a2 = 0.73 this relation yields
1 − a2 0.27
(σ 2 )∗ = r(0) ((1 + a2 )2 − a21 ) = 40.55402 (1.732 − 1.422 ) = 250.6439
1 + a2 1.73
It is simple to simulate realizations of {Yn }, and other autoregressive or mov-
ing average processes, using MATLAB. The command y = filter(1, a, x); re-
turns the input data in the vector x filtered as an AR(p) process using the vector
a = [1 a1 . . . ap ].
A typical output sequence y is shown in Figure 5, where the estimated value
(σ ) of σ 2 is used to create the input of normally distributed random numbers x.
2 ∗

Of course an obvious drawback of this method is that the filtered output sequence
may attain negative values, whereas sunspot numbers do not.

4.6. Statistical methods


In this section we discuss briefly optimal linear prediction for random sequences
{Yn }. For simplicity we assume that the sequence is weakly stationary with mY = 0.
To study sequences with non-zero mean m one can often simply add the constant
afterwords and draw appropriate conclusions for Yn + m. Assume Y1 , . . . , Yn have
been observed and suppose we are interested in Yn+k for some k ≥ 1. A common
approach is to estimate Yn+k with a linear combination
Xn
Ybn+k = aj Y j
j=1

of the known variables, and try to choose the coefficients a1 , . . . an such that the
resulting mean squared error (variance of the prediction error)
 Xn 2
b 2
E(Yn+k − Yn+k ) = E Yn+k − aj Y j
j=1
4.6. Statistical methods 81

200

150

100

50

−50
1700 1750 1800 1850 1900 1950 2000

Figure 5. Simulation of AR(2) process designed to resemble sunspot numbers

is minimized. In a dynamical setting one would like to find a1 , . . . an such that


Pm b
for each m the sum i=1 ai Yn−m+i is the optimal prediction of Ym+k based on
n previous values Ym−n+1 , . . . , Ym . The resulting sequence Ybm+k , m ≥ 1, is then
itself the output of a linear filter with input {Ym } and impulse response function
h(m) = an−m , 0 ≤ m ≤ n − 1. Thus, the estimation technique is often called linear
prediction filtering.
It turns out that thePnbest choice of weights a1 , . . . , an is obtained when the
prediction error Yn+k − j=1 aj Yj is uncorrelated with each of the observations in
the following sense.
82 4. Some non-Markov models

Theorem 4.17 (Projection theorem for linear space). Suppose that a finite number
of variables Y1 , . . . , Yn has been observed and we want to predict Yn+k with a linear
combination n
X
b
Yn+k = aj Y j .
j=1
The optimal choice of coefficients a1 , . . . , an which makes the mean squared error
minimal, is obtained as the solution of the normal equations
 P

 C(Yn+k − Pnj=1 aj Yj , Y1 ) = 0

 C(Yn+k − n aj Yj , Y2 ) = 0
j=1
 ..

 P .
 C(Y − n a Y , Y ) = 0.
n+k j=1 j j n

Proof. Assume a1 , . . . , an satisfy the normal equations and that b1 , . . . , bn is an


arbitrary collection of coefficients. First write
 X n   X n n
X 
V Yn+k − bj Yj = V Yn+k − aj Y j + (aj − bj )Yj .
j=1 j=1 j=1

The right hand side is


 Xn  X
n   n
X n
X 
V Yn+k − aj Y j + V (aj − bj )Yj + 2C Yn+k − aj Y j , (ai − bi )Yi .
j=1 j=1 j=1 i=1

In this expression the covariance term is zero because of the normal equations. In-
deed,
 Xn n
X  h Xn X
n i
C Yn+k − aj Y j , (ai − bi )Yi = E Yn+k − aj Y j (ai − bi )Yi
j=1 i=1 j=1 i=1
n
X h Xn  i
= (ai − bi )E Yn+k − aj Yj Yi = 0,
i=1 j=1

and so we have
 n
X   n
X  X
n 
V Yn+k − bj Yj = V Yn+k − aj Y j + V (aj − bj )Yj
j=1 j=1 j=1
 Xn 
≥ V Yn+k − aj Y j ,
j=1

with equality when bj = aj . Hence a1 , . . . , an gives the optimal predictor. 


Example 4.18 (Prediction of future output). Consider a time series {Yn } which
is a moving average process of order 2 defined by the filter relation
Yn = Xn + 0.7Xn−1 + 0.5Xn−2 ,
where {Xn } is Gaussian white noise with mean zero and variance σ 2 = 2. Let us
suppose that the series has been observed at the times n = 1 and n = 2 with
observed measurement results y1 = 2.463 and y2 = 0.436. Given these data we wish
Solved exercises 83

to predict a value y3 of Y3 at time n = 3. The method of linear prediction imply


putting Yb3 = aY1 + bY2 for the choice of constants a and b such that the normal
equations
C(Y3 − aY1 − bY2 , Y1 ) = rY (2) − arY (0) − brY (1) = 0
C(Y3 − aY1 − bY2 , Y2 ) = rY (1) − arY (1) − brY (0) = 0
are satisfied. The predicted behavior of Y3 is then obtained as the linear combination
ŷ3 = ay1 + by2 .
The covariance function of the MA(2) process Y is given by rY (0) = V (Yn ) =
σ (1+0.72 +0.52 ) = 3.48, rY (±1) = σ 2 (0.7+0.7·0.5) = 2.10 and rY (±2) = σ 2 ·0.5 = 1,
2

in addition to rY (k) = 0, |k| ≥ 2. Solving the normal equations yield a = −0.12077


and b = 0.67633, consequently ŷ3 = 2.463a + 0.436b = −0.0033.

Solved exercises
1. A delivery truck repeatedly drives back and forth between Stockholm and Upp-
sala, located a distance of 70 kilometers apart. On each trip (either Stockholm
to Uppsala or the reverse trip) traffic disturbances varies so that the driver is
equally likely either to see clear roads or congested roads. When the roads are
congested, the average speed for the trip is 60 km/h. When the roads are clear,
the average speed is 80 km/h. Over many hours of travel, what is the average
speed of the truck?
Solution. Some reflection shows that the naive answer 70 km/h is wrong, and the
renewal theorem helps formalize the correct calculation. During each cycle the
truck accumulates a “reward” of 70 km driving distance. The time to complete
this task is random with mean value ν = 0.5 · 70/60 + 0.5 · 70/80 = 49/48
hours and thus, by the renewal reward theorem, the average speed is the ratio
48 · 70/49 ≈ 68.6 km/h.
2. Recall the M/M/1 service system discussed in Example 3.8. Suppose that the
Poisson arrival process has intensity 2 per hour and the service times are expo-
nentially distributed with mean 20 minutes, independent of each other and of the
arrival process. In this system there is unlimited queuing space and customers
who arrive when the server is busy always wait in line.
Seen from the single server’s perspective, time can be divided into busy pe-
riods of actively serving a customer, and vacant periods during which there are
no customers in the system. Consider the system in equilibrium.
(a) What proportion of time is the server active?
(b) Determine the expected length of a vacant period.
(c) Determine the expected length of a busy period.
Hint: It is useful to note that the time points when customers arrive to an
empty system are renewals!
Solution. Figure 6 shows a simulated trace of this M/M/1 process with X(0) = 0.
The traffic intensity is ρ = 2/3. Since ρ < 1 we know from Example 3.8 that there
exists an asymptotic distribution, which is given by the geometric distribution
πk = ρk (1 − ρ), k = 0, 1, . . . .
84 4. Some non-Markov models

0
0 2 4 6 8 10 12 14 16 18 20

Figure 6. Vacant periods and busy periods of the queueing system M/M/1

a) Using the steady-state distribution π, it follows that the probability to


find the system empty at an arbitrary time is π0 = 1 − ρ = 1/3. Hence the
probability to find the server busy is 1 − π0 = ρ = 2/3. Over long time we may
interpretat this by saying that the server is busy 2/3 of the time. Indeed, the
simulated piece of a trajectory in Figure 6 appears to spend about a third of its
time in 0 and the rest in some non-zero state.
b) A vacant period V in this model is the same as the waiting time in
state zero of the continuous time Markov chain X(t). By definition this is
an exponential time with intensity λ = 2, V ∈ Exp(2), hence expected value
E(V ) = 1/2.
c) We observe that each time when a customer arrives to an empty system,
in other words, each time when we have a jump from 0 to 1, is a renewal event
of a renewal process. One cycle in this renewal process consist of a busy period
(during which X(t) ≥ 1) immediately followed by a vacant period (X(t) = 0).
The length of cycle i is a random variable
Ui = Bi + Vi ,
where
Bi = length of busy period during cycle i
Vi = length of vacant period during cycle i
Let the reward of cycle i be Ri = Vi . By the renewal-reward theorem 4.3, the
total reward R(t) = total vacant time up to t satisfies
1 E(R) E(V )
R(t) → =
t E(U ) E(B) + E(V )
Using a) this limit is π0 = 1/3. Using b) the limit is 1/(2E(B) + 1). Combining
these we obtain that the expected length of a busy period is E(B) = 1.
3. The TCP protocol for Internet packet traffic is designed to have the property
that the longer a transmission session is in operation uninterrupted by losses of
single packets, the more traffic is sent per time unit. As a session experiences
a packet loss due to congestion on the digital pathways or other reasons, the
transmission capacity is tuned down. As a model for TCP suppose that during
a time interval of length U seconds the protocol transmits R = U 2 /2 Mbytes.
The system runs during time U1 with such a result R1 , then U2 seconds with
transmission result R2 , and so on, where U1 , U2 , . . . are independent random
Solved exercises 85

variables all uniformly distributed on the interval [0, 1]. Find the asymptotic
Thruput of the protocol, measured in Mbytes per second.
Solution. Let U1 , U2 , . . . form cycles, times between renewals, of a renewal pro-
cess and associate with cycle i of length Ui a reward Ri = Ui2 /2. By the renewal-
reward theorem, the total workload of data packets, R(t),R 1 transmitted over time
t behaves as R(t)/t → E(R)/E(U ). Here, E(U ) = 0 u du = 1/2 seconds and
R1
E(R) = E(U 2 )/2, where E(U 2 ) = 0 u2 du = 1/3. Thus, E(R) = 1/6 and the
asymptotic capacity is obtained as the cycle average 1/3 Mbytes per second.
4. Assume that {Yn } is wide sense stationary with covariance function given by
rY (0) = 3, rY (k) = 2 for |k| = 1 and rY (k) = 0 for |k| ≥ 2. For m ≥ 1 find the
optimal linear filter for prediction of Ybm+1 based on two previous values Ym−1 ,
Ym .
Solution. With k = 1 and n = 2 the normal equations of Theorem 4.17 attain
the form 
C(Yn+1 − a1 Yn−1 − a2 Yn , Yn−1 ) = 0
C(Yn+1 − a1 Yn−1 − a2 Yn , Yn ) = 0,
which is the same as

C(Yn+1 , Yn−1 ) − a1 C(Yn−1 , Yn−1 ) − a2 C(Yn , Yn−1 ) = 0
C(Yn+1 , Yn ) − a1 C(Yn−1 , Yn ) − a2 C(Yn , Yn ) = 0.
In terms of the covariance function rY this is the system of equations

rY (2) − a1 rY (0) − a2 rY (1) = −3a1 − 2a2 = 0
rY (1) − a1 rY (1) − a2 rY (0) = 2 − 2a1 − 3a2 = 0
and thus we find the solution a1 = −4/5, a2 = 6/5. Hence the optimal linear
prediction filter is given by Ybn+1 = −4Yn−1 /5 + 6Yn /5.
Chapter 5

Applications in Biology
and Bioinformatics

This chapter gives an introduction to stochastic models in population genetics and


evolutionary biology and to probabilistic methods used in DNA sequence analysis.
For further reading we recommend Durrett [6], Ewens and Grant [7] and Renshaw
[17].

5.1. The Wright-Fisher model


In a population of fixed size N we assume that each of the N individuals is repre-
sented by a single genetic locus (haploid individuals) which can exist in either of two
allelic types, called A1 and A2 . The Wright-Fisher model describes the evolution
from one generation to the next of the allelic types under a simple reproduction
mechanism. The state variable is

Xn = the number of individuals of type A1 in generation n, n ≥ 0.

Hence in generation n, Xn individuals are of type A1 and N − Xn are or type A2 .


Given an initial distribution for X0 that describes the proportions of A1 and A2 in
the initial generation, the dynamics of the allelic types is determined by the following
sampling scheme: Each new generation n is generated from the previous one n − 1
using N independent draws, one for each individual. In each draw the allelic type
of that individual is obtained by selecting randomly and uniformly, with probability
1/N for each choice, one of the individuals in generation n − 1 to be its parent, and
by letting the type of the parent be directly inherited to the new individual, the
offspring. To visualize the procedure we may think of a population of annual plants
each with either red or blue flowers. The number of plants is preserved from one
summer to the next. To obtain the mix of colors for the upcoming summer “pick”
randomly a plant from the previous summer and note if it is red or blue. Repeat
independently until a whole new population has formed.

87
88 5. Applications in Biology and Bioinformatics

From the above construction it is clear that {Xn } is a discrete time Markov
chain. The transition probabilities pij = P (Xn = j|Xn−1 = i) are given by
 
N j i
pij = pi (1 − pi )N −j , pi = , 0 ≤ i, j ≤ N.
j N
Indeed, if Xn−1 = i then in each of the N independent draws the probability is
pi = i/N to pick an A1 parent and the probability is 1 − pi to pick a parent of type
A2 . The number of A1 offspring is therefore a random variable with the Bin(N, pi )
distribution, explaining the form of the transitionP probabilities. Based on this we
can continue by deriving the mean value E(Xn ) = rj=0 jP (Xn = j). Namely, since
the law of total probability shows that
N
X N
X
P (Xn = j) = P (Xn = j|Xn−1 = i)P (Xn−1 = i) = pij P (Xn−1 = i),
i=0 i=0
we have
N
X N
X N
X N
X
E(Xn ) = j pij P (Xn−1 = i) = P (Xn−1 = i) jpij .
j=0 i=0 i=0 j=0
PN
But for each fixed i, the sum j=0 jpij is the mean value N · i/N = i of a random
variable with the binomial distribution Bin(N, i/N ). Hence
N
X
(5.1) E(Xn ) = P (Xn−1 = i) i = E(Xn−1 ).
i=0

In the same way E(Xn−1 ) = E(Xn−2 ), and so on, which shows that the expected
value of the Wright-Fisher Markov chain is preserved over time, E(Xn ) = E(X0 ),
n ≥ 0, and determined by the expected value E(X0 ) in the initial distribution.
As an alternative the above can be derived using conditional expected values.
Namely, we have
N
X N
X
E(Xn |Xn−1 = i) = jP (Xn = j|Xn−1 = i) = jpij = i,
j=0 j=0

which is a so called martingale property of {Xn }. Conclude that


N
X N
X
E(Xn ) = E(Xn |Xn−1 = i)P (Xn−1 = i) = i P (Xn−1 = i) = E(Xn−1 ),
i=0 i=0

which is again (5.1).

Fixation of genes. Both of the states {0} and {N } are absorbing. If the Markov
chain gets absorbed in state {0} then allele A1 is extinct and allele A2 has been fix-
ated, whereas absorption in {N } corresponds to fixation of A1 . As for any irreducible
finite Markov chain with absorbing states, absorption is in fact certain to take place.
This effect, which forces the population into a more homogeneous state (all A1 ’s or
all A2 ’s) is called random genetic drift. There are two natural questions to be asked:
a) How likely is an allele to be lost due to genetic drift?
b) What is the expected time for an allele to fix?
5.1. The Wright-Fisher model 89

To discuss a) suppose we start with i alleles of type A1 , that is X0 = i. Let


ri = the probability of fixation in A1 , given X0 = i.
Clearly r0 = 0, rN = 1, and by the principle of conditioning on the first event (see
Section 2.3),
N
X
(5.2) ri = pij rj , 1 ≤ i ≤ N.
j=0
P
Because of the relation Nj=0 jpij = i used above (which said that if X ∈ Bin(N, i/N )
then E(X) = i), we have
X N
i j
= pij , 1 ≤ i ≤ N.
N j=0
N

Hence ri = i/N is a solution of (5.2), and in fact the unique solution consistent with
rN = 1, which hence provides the answer to the first question a).
To discuss b), note that the time to fixation in A1 is the absorption time TiN and
the time to fixation in A2 is the absorption time Ti0 (see Section 2.3). The minimum
of these is the ultimate fixation time
Ti = the first time n for which Xn = 0 or Xn = N , given X0 = i.
Again by the principle of conditioning on the first event, it is seen that the collection
of expected values
mi = E(Ti ) = expected time to fixation in either A1 or A2 , given X0 = i,
can be obtained as the unique solution of the system of equations
N
X −1 N
X −1
(5.3) mi = pi0 · 1 + piN · 1 + pij (1 + mj ) = 1 + pij mj , 1≤i≤N −1
j=1 j=1

(m0 = mN = 0). Unfortunately, equation (5.3) becomes complicated even for mod-
erate size N . An approximation formula is known, valid for large N :
mi ≈ −2(i log(i/N ) + (N − i) log(1 − i/N )).

Effect of mutations. Mutation of genes, which change the allelic type of the
individuals in between reproduction events, cause genetic variability and thus have
the opposite effect compared to genetic drift. The balance of genetic drift and mu-
tation is an important aspect of population genetics for which some understanding
can be gained from stochastic models.
We add to the model mutation probabilities u12 and u21 :
uij = probability that due to mutation an allele shifts from Ai to Aj in a given
generation.
The Wright-Fisher model with mutation is the Markov chain with transition proba-
bilities
  
N j i i
pij = pi (1 − pi )N −j , pi = (1 − u12 ) + 1 − u21 , 0 ≤ i, j ≤ N.
j N N
90 5. Applications in Biology and Bioinformatics

Mutation changes the nature of the process in a very significant way. One allele
can never remain fixed forever. Sooner or later there will be a mutation event that
prevents alleles to be lost from the population. Since now all transition probabilities
are positive, pij > 0, the Markov chain is irreducible and aperiodic, and possesses a
steady state X∞ with a stationary distribution π. The stationary probabilities are
complicated, but we can rather easily find the expected value E(X∞ ). Indeed, the
analog of (5.1) becomes
(5.4) E(Xn ) = E(Xn−1 )(1 − u12 − u21 ) + N u21 ,
which as n → ∞ yields the limit
u21
E(X∞ ) = N .
u12 + u21

Probability of identity by descent. In a Wright-Fisher population of size N ,


let us assume that the genes (individuals) are subject to mutation with a small
probability u of changing type in any given generation, being the same for both
alleles (u12 = u21 = u). Consider a pair of two individuals. We know that each
of the two genes inherited its type from a randomly chosen parent in the previous
generation. With probability 1/N they have the same parent and thus a common
ancestor one generation back, and with probability 1−1/N they have distinct parents
one generation back. If the parents are distinct we may continue backwards and ask
if the grandparents are the same, providing a common ancestor for the pair two
generations back. Eventually a common ancestor will be found, and the required
number of generations backwards that have to be traced is the random variable
(2)
TMRCA = time to the most recent common ancestor (for a sample of size 2).
The two genes are said to coalesce as they find their common ancestor, and the time
to the most recent ancestor is also called the coalescence time.
We say that the two genes are identical by descent (IBD) if during the time
backwards to their most recent common ancestor, no mutation takes place. To find
(approximately) the probability of identity by descent,
(2)
P (IBD) = P (none of the genes is subject to mutation during TMRCA generations),
we give two different demonstrations. The first demonstration highlights that during
each generation there are two competing forces acting on the pair of genes. In any
given generation, as the ancestry of the pair is traced backwards, the genes may
coalesce or at least one of the genes may suffer a mutation. The genes are IBD
if coalescence occurs before mutation. We have observed already that coalescence
has probability 1/N in each generation. The probability of at least one mutation is
1 − (1 − u)2 = 2u − u2 ≈ 2u. By the principle of conditioning on the first event we
get
 
1 1
P (IBD) = + 1− (1 − 2u)P (IBD).
N N
Hence
1
P (IBD) = .
1 + 2u(N − 1)
5.1. The Wright-Fisher model 91

An alternative derivation of the probability of identity by descent is based on


(2)
the observation that T (2) = TMRCA has a geometric distribution, in particular
 n
(2) 1
P (T > n) = 1 − , n ≥ 1.
N
Since  [N t]
(2) (2) 1
P (T /N > t) ≈ P (T > [N t]) = 1 − ≈ e−t ,
N
it follows for large population size N that T (2) /N has approximately an exponential
distribution with mean one. The time τ backwards until the first mutation takes
place has the geometric distribution
P (τ = k) = (1 − 2u)k−1 2u, k ≥ 1.
Since the parameter u is small, τ is approximately exponential with parameter 2u.
Hence conditioning on τ ,
Z ∞
(2)
P (IBD) = P (T ≤ τ ) = 1 − P (T (2) ≥ t)2ue−2ut dt.
0
Here
P (T (2) ≥ t) = P (T (2) /N ≥ t/N ) ≈ e−t/N
and so
Z ∞
2u 1
P (IBD) = 1 − e−t/N 2ue−2ut dt = 1 − = .
0 1/N + 2u 1 + 2N u
Hence the two methods lead to very similar expressions for the identity by descent
probability.
The probability of IBD can be seen as a measure of the genetic diversity in the
population. A high value means that the genetic material is close to fixed at one
allele, a population with a low value of probability of descent is genetically more
diversified.

The Moran model. This is a continuous time version of the Wright-Fisher model.
Under Wright-Fisher dynamics the state of each individual of the population is
updated each discrete generation step. In comparison, the Moran model is suitable
for populations in which only only one individual changes type at a time. We
consider again a population of N individuals each of type A1 or type A2 . Each
individual carries an exponential clock with rate one. When a clock rings this
individual randomly selects one individual in the population (including itself). The
chosen individual is subject to mutation (an A1 mutates to A2 with probability
u12 and an A2 mutates to A1 with probability u21 ) and then replaces the original
individual for which the clock rang.
Let X(t) be the number of individuals of type A1 at time t. The Moran model
is the birth and death process with states E = {0, 1 . . . , N } and intensities
 
i i
λi = (1 − i/N )pi , µi = (i/N )(1 − pi ), pi = (1 − u12 ) + 1 − u21 .
N N
We need to check that these birth and death jump rates correspond to the given
model dynamics. To go from i to i + 1 individuals of type A1 first of all the clock
92 5. Applications in Biology and Bioinformatics

must ring for a type A2 individual, which occurs with probability 1 − i/N . To be
sure that this A2 is replaced by an A1 , we may either chose an A1 , probability i/N ,
and avoid mutation, probability 1 − u12 , or chose A2 and have a mutation to A1 .
Similarly for the death rates.
Figure 1 shows three simulated trajectories of the Moran model in the case of no
mutation, u12 = u21 = 0, and population size N = 200. One population starts with

200

180

160

140
number of A1 alleles

120

100

80

60

40

20

0
0 20 40 60 80 100 120 140 160 180
time

Figure 1. Three paths of a Moran model

X0 = 50 and is fixed at A2 around time t = 167, one population starts symmetrically


with X0 = 100 but is fixed in the state containing only A2 alleles already before time
t = 50, and the third simulation run, with initial distribution X0 = 150, fixes in
A1 after approximately t = 32 time units. MATLAB simulation of continuous time
Markov chains such as in this example can be carried out along the following lines.
5.2. Kingman’s coalescent process 93

function y=moransim(init,time);
N=200;
y=zeros(1,nmb);
y(1)=init;
x=1:N+1;
lambda(x)=(x-1).*(1-(x-1)./N);
mu(x)=(x-1).*(1-(x-1)./N);
q(x)=lambda(x)+mu(x);
x=init;
t=1;
for i=2:nmb
if (x==1 | x==N+1)
y(i)=x;
elseif (lambda(x)/q(x)>rand)
y(i)=x+1;
x=y(i);
else
y(i)=x-1;
x=y(i);
end;
end;
z=cumsum(-log(rand(1,nmb))./q(y));
stairs(z,y);
axis([0 z(nmb) 0 N+1]);
In the case u12 > 0, u21 > 0, there is a unique steady-state distribution, which
is obtained in the usual way from Theorem 3.6.

5.2. Kingman’s coalescent process


The ancestry of a population is contained in its genealogical tree, which in the simple
case of haploid individuals studied here, keeps track over time of who is the parent
of each new individual. With this information at hand we may trace the ancestry of
a sample or the ancestry of the whole population backwards in time until the most
recent ancestor is identified. For the discrete time model and for a sample of size
(2)
two we already studied this situation and introduced the random time TMRCA . More
generally, consider a sample of size k ≥ 2 in generation n. For the moment we are
not interested in their types, only if some of them have the same parents. Define
(k)
TMRCA = the number of generations in the past until the most recent
common ancestor in a sample of size k.
Figure 2 shows an example of an ancestral tree for the case k = 9 and N = 12. In
the generation at time n (bottom part of figure) we have chosen a sample of size
k. The arrows associated with each individual in the sample are pointing from an
offspring in generation n to a parent in generation n − 1. Proceeding in this way the
ancestry of the k-sample can be traced backwards in time (upwards in the figure).
In generation n − 1 there are 7 distinct ancestors remaining, in generation n − 2
there are 6, and so on. Counting 11 generations in the past we are down to a single
(9)
common ancestor, hence for this realization we have TMRCA = 11.
94 5. Applications in Biology and Bioinformatics

n−2

n−1

Figure 2. The genealogical structure in a population of size N = 12

Returning to the general case with a sample of size k in a population of size N


we make the following observation: When each of k individuals in the sample selects
a parent in the previous generation, then the probability of having no coalescence
events at all (no pair, triple, etc, selecting the same parent) is given by
 
k 1
(1 − 1/N )(1 − 2/N ) · · · · · (1 − (k − 1)/N ) ≈ 1 − .
2 N
Here, terms of order 1/N 2 , etc, are ignored in the approximation. The nature of the
approximation is to ignore the event of having more than one pair coalescing in the
same generation and also the event of three or more individuals choosing the same
parent. This means that, for large N ,
 
k 1
pk = P (exactly one coalescence in the k-sample) ≈ ,
2 N
5.2. Kingman’s coalescent process 95

The formula can  be understood as giving the total probability for coalescence result-
k
ing from the 2 distinct pairs that can be formed out of the k individuals, and each
pair having the same parent with probability 1/N .
(k)
The interesting observation now is that we can decompose TMRCA into the sum
k
X
(k) (j) (j−1) (1)
TMRCA = Vj , Vj = TMRCA − TM RCA , j ≥ 2, TMRCA = 0,
j=2

where Vj has the geometric distribution P (Vj = i) = (1 − pj )i−1 pj , i ≥ 1. In


particular,
Xk Xk
(k) 1
E(TMRCA ) = E(Vj ) =
j=2
p
j=2 j

which yields the approximation


k
X X k  
(k) 1 2 1
(5.5) E(TMRCA ) ≈ N j = N = 2N 1− .
j=2 2 j=2
j(j − 1) k

Hence for any reasonably large sample of individuals taken in a population of size N ,
the expected number of generations that has to be traced backwards in order to find
a common ancestor for the sample is approximately 2N . More exactly, with k = N ,
(N )
considering the ancestry of the whole population, we find E(TMRCA ) = 2(N − 1).
Note also that in the light of (5.5), the tree depicted in Figure 2 appears to be
nontypical. Indeed, since E(V2 ) = 1/p2 = N , typical ancestral trees are stretched
out toward the root. On average about half of the length of the tree corresponds to
the waiting time for two ancestors to coalesce into a single, common ancestor.
The by now familiar technique to approximate geometric distributions with ex-
ponential ones, reveals the following structure: If we measure time in units of [N t]
generations, then the time during which there are j distinct ancestors in the sample
is approximately exponential with parameter 2j . This leads us to the coalescent,
which was introduced by Kingman in 1982. The coalescent is a random tree that
allows one to characterize ancestral relationships between genes in a sample when
the population size is reasonably large.
The probabilistic structure of Kingman’s coalescent is quite simple. If we start
with a sample of k individuals,then after a random time Vek , which is exponentially
distributed with parameter k2 , two randomly chosen ancestral lineages coalesce,
leaving k − 1 distinct lineages. The lineages continue coalescing in this way until we
reach a single common ancestor for the sample. We thus obtain a sequence Vek , . . . , Ve2
of intercoalescense times thatare independent and exponentially distributed with
expected values E(Vej ) = 1/ 2j . The time to reach the most recent ancestor is the
(k)
sum TeMRCA = Vek + · · · + Ve2 with mean 2(1 − 1/k) (compare (5.5)). If we let X(t)
denote the number of lineages at time t, the standard coalescent process may also
be viewed as the Markov pure death process in continuous time, {X(t), t ≥ 0}, 
with states E = {1, . . . , k}, initial value X(0) = k, and death intensities µi = 2i ,
(k)
2 ≤ i ≤ k. The absorption time of {X(t)} is the random time TeMRCA with mean
value 2(1 − 1/k) ≈ 2.
96 5. Applications in Biology and Bioinformatics

5.3. Some models for DNA sequences


The concept of the human genome coded as an ordered sequence of nucleotides rests
on the property that at any given site along the genome there is a predominant
nucleotide, shared by most individuals in the population. Over the time span of
many generations the nucleotides are subject to mutations and random variations
during reproduction. As a result the predominant nucleotide at a site may eventually
get replaced by another, an event referred to as a nucleotide substitution. The new
nucleotide will be the typical one at that site for say 100,000 generations until
another substitution takes place, and so on.
To construct a simple model of this evolution we begin with a simplified scenario
where the nucleotides are only classified as purines (A and G) or pyrimidines (C and
T). A site substitution results if a purine at that site is replaced by a pyrimidine, or
vice versa. Consider a population of size N and suppose that at a given site in each
generation and independent of earlier substitutions
P(a purine changes to a pyrimidine) = β/N
P(a pyrimidine changes to a purine) = γ/N,
where β, γ > 0 are substitution parameters. Letting X(n) be 0 if the nucleotide
at the site in generation n is a purine and 1 otherwise, this results in a two-state
Markov chain {X(n), n ≥ 0} with transition probability matrix
 
1 − β/N β/N
P= .
γ/N 1 − γ/N
Now we consider the evolution over a time scale of approximately N t generations.
Using the new measure of time indexed by t ≥ 0, the time-changed process
Y (N ) (t) = X([N t]), t ≥ 0,
captures the dynamics on the evolutionary time scale which is natural for nucleotide
substitutions. Moreover, if we take h = 1/N and for some t denote n = [N t], then
P (Y (N ) (t + h) = 1|Y (N ) (t) = 0) = P (X([N (t + h)]) = 1|X([N t]) = 0)
= P (X(n + 1) = 1|X(n) = 0)
= β/N = βh,
and similarly for a substitution of a pyrimidine to a purine. Hence we have motivated
that in the approximation of large populations, N → ∞, a reasonable model for
nucleotide substitutions is the continuous time Markov chain {Y (t), t ≥ 0} with
infinitesimal generator
 
−β β
Q= .
γ −γ
According to Theorem 3.3, the stationary distribution for the fractions of purines
and pyrimidines in steady state is given by the solution π = (π0 , π1 ) of the system
of equations πQ = 0. The equations are −βπ0 + γπ1 = 0 and βπ0 − γπ1 = 0 and
thus  
γ β
π= , .
β+γ β+γ
5.3. Some models for DNA sequences 97

Returning to the original situation where we distinguish all four nucleotides, A,


G, C, and T, the corresponding model is a continuous time Markov chain with
state space E = {A, G, C, T } specified by an appropriate set of substitution intensi-
ties. Two typical models are the Jukes-Cantor model having infinitesimal generator
matrix  
−3α α α α
 α −3α α α 
Q=  α

α −3α α 
α α α −3α
and the (generalized) Kimura model defined by
 
−α − 2β α β β
 α −α − 2β β β 
Q= 


γ γ −α − 2γ α
γ γ α −α − 2γ
where α, β, γ are the substitution intensities. It is left as an exercise to show that
the stationary distribution for the Jukes-Cantor model is
π = (1/4, 1/4, 1/4, 1/4)
and for the Kimura model
 
γ γ β β
π= , , , .
2(β + γ) 2(β + γ) 2(β + γ) 2(β + γ)
According to Durrett [6], the frequencies of nucleotides A, G, C, T found empiri-
cally in strands of human mitochondrial DNA are (0.247, 0.302, 0.139, 0.313). This
indicates that neither the Jukes-Cantor nor the the three-parameter Kimura model
are versatile enough to account for observed human DNA sequencies.

Number of segregating sites. Consider a large population of N individuals,


each represented by a sequence of L nucleotide sites. As an approximation we
suppose Wright-Fisher reproduction from one generation to the next and that sub-
stitutions occur independently at each site according to the Jukes-Cantor model.
Hence 3α is a per site mutation rate and 3αL becomes a locus mutation rate, both
measured on an approximate time scale of N t generations. Assuming that L is
large compared to N this can be thought of as an infinite alleles model, in the sense
that each substitution is likely to result in a new sequence not represented earlier
in the population. This is simply because the total number of possible nucleotide
combinations is 4L , which is immensely large for typical values of L.
The probability of identity by descent can be found in this model in just the
same way as in the simplest Wright-Fisher case. Recall that a pair of individuals
is identical by descent if coalescence occurs before the first substitution in one of
the two sequences. The time T (2) until coalescence is approximately exponential
with mean value 1 and the time τ , say, until the first substitution takes place is
approximately exponential with intensity 2 · 3αL, again using rescaled time N t.
Hence
Z ∞
(2) 1
P (IBD) = P (T ≤ τ ) ≈ 1 − e−t 6αLe−6αL dt = .
0 1 + 6αL
98 5. Applications in Biology and Bioinformatics

The drawback of the infinite alleles assumption is that two sequences are consid-
ered distinct regardless of the degree to which they differ. In contrast, the infinite
sites model provides a refined measure of how close two or several sequences are to
each other. Again sequences of length L are subject to mutations with rate u per
site (so u = 3α in the Jukes-Cantor model). As in the infinite alleles model we are
ignoring the possibility that a substitution at a site at a later time is followed by a
reversed substitution at the same site, but now we keep record of the number of sites
where two sequences differ. More generally, in a sample consisting of n sequences
we put
Sn = the number of sites where at least two sequences differ.
This the number of segregating sites in a sample of size n.
To analyze this quantity we refer to the complete coalescence process, starting
with n loci and working backwards in time marking successive coalescence events.
At each event the number of lineages shrinks by one until eventually the common
ancestor has been traced. It was found earlier that the time Vj during which there are

j lineages in the coalescent is approximately exponential with mean E(Vj ) = 1/ 2j .
Using the random variables {Vj }, the total length of all branches in the coalescence
tree (or the total time in the tree) can be expressed as the sum
n
X
(5.6) Ttot = jVj
j=2

so the expected total time in the tree is


n
X n
X  
1 1 1
E(Ttot ) = jE(Vj ) = 2 = 2 1 + + ... .
j=2 j=2
j−1 2 n−1

Now we combine the above observations with the property of the Jukes-Cantor
model that the successive substitution events at a given site forms a Poisson process
with intensity u = 3α. The totality of substitutions in a given sequence is therefore a
Poisson process with intensity uL (recall that sites were supposed to be independent).
Thus each lineage in the coalescent is subject to substitution events with intensity uL.
To visualize, think of the coalescent as given and let independently Poisson events
occur with constant intensity along all branches of the coalescent tree with each
event marking one nucleotide substitution. The total number of such substitutions
in the tree is given by N (Ttot ), where {N (t)} is a Poisson process with intensity
uL and Ttot as in (5.6). But this number must also be the same as the number of
segregating sites. Since N (t) and Ttot are independent, we have
 
1 1
E(Sn ) = 3αL E(Ttot ) = 6αL 1 + + . . . .
2 n−1
In principle the above theory can be used as a basis for statistical estimation of the
parameter α. If a count Sn∗ is available from data measurements, then
Sn∗
α∗ = 
1
6L 1 + 12 + . . . n−1
is a point estimate of α.
5.4. Recovering the genome from fragments 99

Patterns in DNA sequences. We close this section with an application to DNA


sequences of the theory of recurrence times in discrete Markov chains, Theorem 2.19.
In this section so far we have considered Markov models for the time evolution of
DNA sequences. In this example we change view point and apply a Markov chain
model to the (spatial) evolution of the sequence as a string of letters as it is read
site by site. Hence {Xn } will be a Markov chain with state space {A, G, C, T } and
the “time” index n is a site sequence number. The transition probability matrix is
 
pA pG pC pT
 pA pG pC pT 
P=  pA pG pC pT 

pA pG pC pT
so that in each step the probabilities are pA , pG , pC , pT that the next letter is A, G, C
or T . It is easily seen that the stationary distribution is also given by these same
probabilities.
Now think of a restriction enzyme reading along the DNA and cutting every
time it sees the “word” AA. We want to find the average size of the DNA frag-
ments formed in this manner. In other words, what is the typical number of steps
required between two occurrences of the combination AA? To simplify the analysis
we construct another Markov chain {Yn }, based on {Xn }, as follows. Let the state
of the Y -sequence be A if X sees the first letter of the word, let it be AA if X has
just observed the word AA in its last two steps, or let Y take the value B if X
sees anything else than A or AA. Ordering the state space as E = {A, B, AA}, the
transition probability matrix of {Yn } becomes
 
0 1 − pA pA
P =  pA 1 − pA 0 
pA 1 − pA 0
Now we can find the stationary distribution
 
pA 2
π= , 1 − pA , pA
1 − pA
and so, by Theorem 2.19
1 1
the average length of an AA-fragment is = 2.
π3 pA

5.4. Recovering the genome from fragments


Because of technical limitations it is not possible to sequence very long pieces of
DNA all at once. Typically the sequenced DNA strings are restricted in length
to about 500 base pairs. The technique known as shotgun sequencing amounts to
assembling such fragments so that they cover much longer pieces of the genome, of
size 100,000 bases, say. We discuss a simple probabilistic model for this method of
recovery of DNA from overlapping fragments, for details and references see Ewens
and Grant [7], Ch 5.
Assume there are N fragments each of length L bases taken from various parts of
copies of a DNA chain of length G bases, and that L is much smaller than G. In order
100 5. Applications in Biology and Bioinformatics

to study shotgun sequencing we apply the stochastic model that the fragments are
picked randomly and independently over the full length of the DNA. More exactly,
we make a continuous approximation and assume that the left-end of each fragment
is uniformly distributed in (0, G) and that the positions of different fragments are
independent of each other. The result of selecting N fragments in this way is that
a part of the DNA will be covered by a collection of overlapping fragments, contigs,
separated by intervals of DNA that each of the fragments missed. The bases in
between contigs will remain unsequenced. For 0 ≤ x ≤ G, let

X(x, x + h) = number of fragments starting at a location in (x,x+h).

If we now take 0 ≤ x ≤ G − h, in order to avoid the effect of the boundary for


x close to G, then the random variable X(x, x + h) has the Binomial distribution
Bin(N, h/G). We will assume that N is relatively large and h is relatively small,
and apply the Poisson approximation of the Binomial distribution. The conclusion
is that the distribution of X(x, x + h) is the same for any x and approximately a
Po(N h/G) distribution. In particular, X(x, x+L), the number of fragments starting
in an interval of length L, is approximately distributed as a Poisson random variable
with expected value given by the parameter
total length of fragments
a = N L/G = ,
total length of DNA
which we call the coverage of the fragments.
Based on this observation we can immediately find the mean proportion of the
DNA string that is covered by contigs. In fact, this mean proportion is the same
as the probability that a randomly chosen point in (0, G) is covered by at least one
fragment. But if we select a random point in (0, G − L) and call it y, then the
number of fragments covering y is the random variable X(L) = X(y − L, y). Clearly,
X(L) has the same distribution as X(y, y + L), which we know has an approximate
Poisson(a) distribution. Hence the desired probability is

P (X(L) ≥ 1) = 1 − P (X(L) = 0) ≈ 1 − e−a .

Next we find the mean number of contigs. Since each contig has a unique right-
most fragment, we have

number of contigs ∈ Bin(N, q),

where
q = P (a given fragment is the rightmost member of a contig).
Now,

q = P (no other fragment has its leftmost point on the given fragment) = e−a

and so
mean number of contigs = N e−a = N e−N L/G .
5.5. Pairwise alignment methods 101

5.5. Pairwise alignment methods


Consider two sequences of a common origin. This could be nucleotide sequences rep-
resenting DNA or amino acid sequences representing proteins. At the time in the
past of the most recent common ancestor the sequences were identical but over time
they have been exposed to evolutionary forces. Substitutions may have changed
the residue at some sites causing mismatches of the sequences. In addition, other
mechanisms of insertions or deletions, indels for short, may also have changed the ap-
pearance of the sequences as they are observed today. Pairwise alignment techniques
aim at finding the most likely explanation of substitutions and indels to account for
the patterns of differences found in such polymorph forms of the sequences.
For example, observing two DNA strings of the forms
aagcttaa and aacctta
it is reasonable to believe that a substitution between g and c has taken place at
site three and an indel of residue a has caused the further difference between the
sequences.
The same techniques are applicable even if we disregard the ancestry of the
sequences and just want to search for similarity in terms of the number of matching
residues of two sequences. This is natural in situations such as comparing a newly
sequenced genome with data base material.
We give a first example of a pairwise alignment scoring method by expanding
on the above example of DNA substitutions. Suppose matches receive a score of
+2, mismatches a score of +1, and gaps caused by inserting a blank space instead
of a character in one of the sequences receive a penalty −1. The total score of an
alignment is obtained by summing the individual scores of each pair. In the above
example we find that the score of the alignment
a a g c t t a a
a a c c t t − a
is 2 + 2 + 1 + 2 + 2 + 2 − 1 + 2 = 12. The alternative alignment
a a g c t t a a
a a − c c t t a
has total score 2 + 2 − 1 + 2 + 1 + 2 + 1 + 2 = 11 and should therefore be considered
somewhat less likely under the given scoring scheme. In this way we can go through
all possible alignments and select the one with maximum score (the one above with
score 12). It is clear already from this simple example that the choice of scores for
match, mismatch and gap will affect the result greatly. Moreover, the naive approach
of listing all possible alignments and selecting the one (or those) with maximal score
will only work for very short sequences. For sequences of length 1000, say, the sheer
number of possibilities makes it computationally not feasible to try to examine all
alignments separately.
The scoring systems normally applied to amino acid sequences are based on
empirical findings and biological knowledge of which transitions between the 20
different amino acids that are more or less likely. The resulting scores are collected
in a 20 × 20 (symmetric) matrix with one row and one column representing each
102 5. Applications in Biology and Bioinformatics

A R N D C Q E G H I L K M F P S T W Y V
A 4 −1 −2 −2 0 −1 −1 0 −2 −1 −1 −1 −1 −2 −1 1 0 −3 −2 0
R −1 5 0 −2 −3 1 0 −2 0 −3 −2 2 −1 −3 −2 −1 −1 −3 −2 −3
N −2 0 6 1 −3 0 0 0 1 −3 −3 0 −2 −3 −2 1 0 −4 −2 −3
D −2 −2 1 6 −3 0 2 −1 −1 −3 −4 −1 −3 −3 −1 0 −1 −4 −3 −3
C 0 −3 −3 −3 9 −3 −4 −3 −3 −1 −1 −3 −1 −2 −3 −1 −1 −2 −2 −1
Q −1 1 0 0 −3 5 2 −2 0 −3 −2 1 0 −3 −1 0 −1 −2 −1 −2
E −1 0 0 2 −4 2 5 −2 0 −3 −3 1 −2 −3 −1 0 −1 −3 −2 −2
G 0 −2 0 −1 −3 −2 −2 6 −2 −4 −4 −2 −3 −3 −2 0 −2 −2 −3 −3
H −2 0 1 −1 −3 0 0 −2 8 −3 −3 −1 −2 −1 −2 −1 −2 −2 2 −3
I −1 −3 −3 −3 −1 −3 −3 −4 −3 4 2 −3 1 0 −3 −2 −1 −3 −1 3
L −1 −2 −3 −4 −1 −2 −3 −4 −3 2 4 −2 2 0 −3 −2 −1 −2 −1 1
K −1 2 0 −1 −3 1 1 −2 −1 −3 −2 5 −1 −3 −1 0 −1 −3 −2 −2
M −1 −1 −2 −3 −1 0 −2 −3 −2 1 2 −1 5 0 −2 −1 −1 −1 −1 1
F −2 −3 −3 −3 −2 −3 −3 −3 −1 0 0 −3 0 6 −4 −2 −2 1 3 −1
P −1 −2 −2 −1 −3 −1 −1 −2 −2 −3 −3 −1 −2 −4 7 −1 −1 −4 −3 −2
S 1 −1 1 0 −1 0 0 0 −1 −2 −2 0 −1 −2 −1 4 1 −3 −2 −2
T 0 −1 0 −1 −1 −1 −1 −2 −2 −1 −1 −1 −1 −2 −1 1 5 −2 −2 0
W −3 −3 −4 −4 −2 −2 −3 −2 −2 −3 −2 −3 −1 1 −4 −3 −2 11 2 −3
Y −2 −2 −2 −3 2 −1 −2 −3 2 −1 −1 −2 −1 3 −3 −2 −2 2 7 −1
V 0 −3 −3 −3 1 −2 −2 −3 −3 3 1 −2 1 −1 −2 −2 0 −3 −1 4

Table 1. BLOSUM62 Substitution matrix for amino acids

amino acid. Each diagonal entry in the matrix gives the score for a matching of
the corresponding amino acid. The non-diagonal entries list the scores for each
possible mismatch of a character with one of the 19 other characters. One commonly
used substitution matrix or scoring matrix is called BLOSUM62. This matrix is
reproduced in Table 5.5. The scores in BLOSUM matrices are obtained by rounding
off to integers certain log-likelihood ratios of estimated substitution rates found in
empirical sequence data.
Before using the scoring matrix we need to include the gap penalty in the model.
The simplest assumption is that of a linear gap penalty of the form −dg to be
subtracted from the score for each gap of length g in between two pairs of amino
acids in the alignment, so that d is the cost of inserting a blank character. Typical
values are d = 4 or d = 8.
As an example we apply the BLOSUM62 substitution matrix with a gap penalty
of d = 4 to see that the score of the alignment
R D I S L V − − − K N A G I
R N I − L V S D A K N V G I
adds up to a total of
5 + 1 + 4 − 4 + 4 + 4 − 3 × 4 + 5 + 6 + 0 + 6 + 4 = 23.
Next we discuss the so called dynamic programming approach to pairwise align-
ments, using what is known as a Needleman-Wunsch algorithm. Given two sequences
of amino acids of lengths ℓ1 and ℓ2 , ℓ1 ≤ ℓ2 , we want to find an alignment with at
least ℓ2 − ℓ1 inserted gaps such that the total score calculated from a given substi-
tution matrix is maximized.
The algorithm can be performed using some auxiliary matrices as we discuss
next. We begin by letting the two sequences to be aligned define the rows and the
5.5. Pairwise alignment methods 103

columns in a ℓ2 × ℓ1 -matrix and fill the new matrix with the scores s(i, j) taken
from the substitution matrix and resulting out of pairing the amino acids from each
row i with the amino acids corresponding to each column j. In the example case of
starting with the sequences QGLK and QGKLLK, and using BLOSUM62, we find
Q G L K

Q 5 −2 −2 1
G −2 6 −4 −2
K 1 −2 −2 5
L −2 −4 4 −2
L −2 −4 4 −2
K 1 −2 −2 5
We denote the elements in this matrix
B(i, j), 1 ≤ i ≤ ℓ2 , 1 ≤ j ≤ ℓ1 .
Next we need to initialize the algorithm by noting that each time we make an indel
in one of the sequences the score will be reduced by the amount d. In the example,
using d = 4, we insert this as follows:
Q G L K
0 −4 −8 −12 −16
Q −4 5 −2 −2 1
G −8 −2 6 −4 −2
K −12 1 −2 −2 5
L −16 −2 −4 4 −2
L −20 −2 −4 4 −2
K −24 1 −2 −2 5
We include these extra elements in the matrix by extending the indexing as
B(i, j), 0 ≤ i ≤ ℓ2 , 0 ≤ j ≤ ℓ1 .
Now we are going to search for an optimal alignment by changing the entries
{B(i, j)} systematically, starting with B(1, 1) in the upper left corner and then
moving towards the lower right corner. The update rule is to let in each step B(i, j)
be the maximum of three numbers known from the previous step, namely
(5.7) B(i, j) = max{B(i − 1, j − 1) + s(i, j), B(i − 1, j) − d, B(i, j − 1) − d}
If we fill out in this manner the first row, for amino acid Q, in the example this
gives us the modified matrix
Q G L K
0 −4 −8 −12 −16
Q −4 5 1 −3 −7
G −8 −2 6 −4 −2
K −12 1 −2 −2 5
L −16 −2 −4 4 −2
L −20 −2 −4 4 −2
K −24 1 −2 −2 5
104 5. Applications in Biology and Bioinformatics

The resulting modified score matrix after updating all elements in the same way is
Q G L K
0 → −4 → −8 → −12 → −16
↓ ց
Q −4 5 → 1 → −3 → −7
↓ ↓ ց
G −8 1 11 → 7 → 3
↓ ↓ ↓ ց ց
K −12 −3 7 9 12
↓ ↓ ↓ ց ↓
L −16 −7 3 11 8
↓ ↓ ↓ ց ↓ ց
L −20 −11 −1 7 9
↓ ↓ ↓ ↓ ց
K −24 −15 −5 3 12
In this final matrix we have also as customary indicated by arrows which of the
three choices in (5.7) that gave rise to the new entry. In this way one can trace
the algorithm backwards and read off the optimum alignment(s) that led to the
maximum score. In the example the maximum score is 12 and the two different
paths of arrows leading to the lower right cell show that both of the alignments
Q G − L − K Q G − − L K
Q G K L L K Q G K L L K
are optimal with total score 12.

5.6. Hidden Markov models


A natural step after having discussed pairwise alignment methods is to turn to
multiple alignments. One feature of multiple alignments is that gaps tend to line
up with each other, leaving blocks of ungapped regions without indels in any of
the sequences. It is interesting for example to try to verify that a new sequence
belongs to a known family of data base sequences by using such statistical features
of the family in the search. One approach to modeling multiple sequences of protein
families is based on hidden Markov models.
A hidden Markov model (HMM) has two main components. First of all an under-
lying discrete time, irreducible Markov chain on a finite state space. The additional
aspect is that each time the Markov chain visits one of its states another random
mechanism sets in and emits a symbol chosen from a finite alphabet. The probabil-
ity distribution used for selecting symbols typically depend on the underlying state.
The output from the model is the sequence of emitted symbols, whereas the transi-
tions of the Markov chain are “hidden” from an observer. Conceptually the Markov
chain jumps around on a skeleton of states or nodes. In one node the emission of
a certain symbol may be much more likely than the emission of the same symbol
from another node. Hence the sequence of emitted symbols from the HMM carries
indirectly information about the hidden part of the dynamics.
5.6. Hidden Markov models 105

Example 5.1. In this standard example the Markov chain switches between two
states {A, B}, described in the usual way by a 2 × 2 transition probability matrix P.
The alphabet is {H, T } for Head and Tail of a coin flip, and the emission probabilities
are represented by either a fair coin with fifty-fifty chance for H or T in state A, or
a biased coin with probabilities p and q for H or T in state B. In the case p > 1/2,
an output sequence of the form
HHT T T T HT HHHT T T HHHHHHHHHHHHT HHHHHHHHHT T HHT T H
would suggest that the Markov chain started in A and remained there for approxi-
mately 15 time steps, then visited B for another 20 steps or so, and then returned to
A. Of course, such inference on the behavior of the Markov chain can only be stated
in a statistical sense. Possibly we are just observing a fair coin with some unusually
long sequences of successive heads. To get a sense of what is likely and not likely
in this example we apply a piece of classical probability theory. De Moivre studied
patterns in independents sequences as early as 1738, see Blom, Holst, Sandell [2].
He showed among many other things that if we let N be the number of coin flips
until for the first time either r heads or r tails come up in sequence, then
E(N ) = 2 + 22 + 23 + · · · + 2r .
To get r = 12 heads as in the above example, one would thus expect to do on the
average 8 190 coin flips.
Example 5.2. It is a common feature of DNA strings that separate regions differ in
the composition of nucleotides. In one segment perhaps A and T are most common,
in another segment all four nucleotides appear to be equally frequent, and in a third
segment of the DNA chain C and G occur most regularly. A hidden Markov model
provides a framework for modeling such patterns in observed data. For this example
we may assume that there exists an underlying Markov chain with three states
{1, 2, 3}, which represents the current type of segment as we move along the DNA
chain. As long as the Markov chain is in state 1, successive nucleotides are generated
with the emission probabilities p1 (A) = p1 (T ) = 0.3 and p1 (C) = p1 (G) = 0.2.
During periods when the Markov chain visits state 2 the emission probabilities
change into p2 (A) = p2 (T ) = p2 (C) = p2 (G) = 0.25, and during visits of state 3
into p3 (A) = p3 (T ) = 0.1 and p3 (C) = p3 (G) = 0.4. The Markov chain is hidden
in the sense that we are unable to observe the state transitions, only the resulting
sequence of emitted nucleotides.

Returning to the application of HMM to protein families, we fix a length param-


eter ℓ and take the set
E = {m0 , . . . , mℓ } ∪ {i0 , . . . , iℓ−1 } ∪ {d1 , . . . , dℓ−1 }.
as state space of the Markov chain. These are match states, insert states and delete
states. The chain always starts in initial state m0 . The index numbers correspond
to a location number along the sequence, counting insertions and deletions. The
alphabet consists of the 20 amino acids plus one extra symbol † for delete. If the
Markov chain is in state mj the HMM emits a symbol for one of the amino acids
according to a given probability distribution
pj (a) = P (observing amino acid a in position j).
106 5. Applications in Biology and Bioinformatics

Similarly, if the Markov chain is in the insert state ij a symbol is emitted representing
the insertion of an amino acid and this time the symbol is drawn using a probability
distribution qj (a). Often one would simply assume that insertions are uniform, that
is qj (a) = 1/20 for each symbol a. Finally, in each state dj the output is always †.
To complete the description of the hidden Markov model it remains to specify the
transition probabilities of the underlying Markov chain. The following is a typical
choice. Assume transitions from match states are given by
P (mj → ij ) = ǫ P (mj → dj+1 ) = δ P (mj → mj+1 ) = 1 − ǫ − δ, 0 ≤ j ≤ ℓ − 1,
and
P (mℓ−1 → iℓ−1 ) = ǫ P (mℓ−1 → mℓ ) = 1 − ǫ.
For the insert states put
P (ij → ij ) = ǫ P (ij → mj+1 ) = 1 − ǫ, 0 ≤ j ≤ ℓ − 1.
Finally, the possible transitions from delete states are given by
P (dj → dj+1 ) = δ P (dj → mj+1 ) = 1 − δ, 0 ≤ j ≤ ℓ − 1..
The parameters of a HMM are the emission probability distributions pj and qj
and the insertion and deletion probabilities ǫ and δ. In terms of the parameters of
the model it is now possible to write down the probability of any particular sequence
generated by the HMM. For example, let us take ℓ = 5 and assume that the Markov
chain while successively visiting the states m0 m1 m2 i2 i2 m3 d4 m5 (after starting in m0 )
emits the sequence of symbols a1 a2 a3 a4 a5 †a6 . The corresponding probability is
P (m0 m1 m2 i2 i2 m3 d4 m5 = a1 a2 a3 a4 a5 †a6 )
(1 − ǫ − δ)2 ǫ2 (1 − ǫ)δ
= p1 (a1 )p2 (a2 )p3 (a5 )p5 (a6 ).
202
Example 5.3. We assume that the parameters of the model have been fixed. Each
output sequence is the result of a corresponding trajectory of the underlying Markov
chain. Hence the path m0 m1 m2 i2 i2 m3 d4 m5 may have produced the output symbols
GQHHA†A and the path m0 m1 d2 m3 m4 m5 the symbols G†AGA. The alignment
induced in this way is found by aligning positions that were generated by the same
match state:
m 0 m 1 m 2 i2 i2 m 3 d 4 m 5
G Q H H A A

G A G A
m0 m1 d2 m3 m4 m5
This leads to the alignment
G Q H H A − A
G − − − A G A
A typical problem in HMM theory is to start with a given family of protein
sequences and try to select the parameters of the model so that output sequences
of the same length with high probability are “close” to the given ones, meaning
that most of the symbols match. The known family of sequences is supposed to
be correctly aligned and is therefore used as “training” data for the HMM in the
procedure of actually fitting the parameters. While a particular choice of parameters
Solved exercises 107

may lead to output sequences that are all very similar, another choice, however, may
result in highly varying outputs. It is not surprising, therefore, that such parameter
estimation algorithms discussed in the HMM research literature are quite complex.
They are based on statistical methods that have been devised for analyzing an
observed string of DNA or protein with the goal of estimating where along the
chain it is most likely that hidden transitions occur. In practice, given an output
sequence from the hidden Markov model, the most likely path to have produced the
output is found by using the Viterbi algorithm. This technique yields the particular
state sequence which has the highest conditional probability to have occurred given
the observed symbol output. Using this information for a number of sequences the
multiple alignment can be found. See Durbin et. al [5] for a detailed account of
these ideas, and also Ewens and Grant [7] for further examples and discussion.
Although we have referred mainly to protein sequences in the previous discussion
the same arguments are applicable to DNA strings. We give an example of multiple
alignments.
Example 5.4. Suppose that we have properly selected parameters of a HMM of
length ℓ = 5 to match the following given family of DNA sequences:
G C G A G
G C G G G
G − G G G
G T G G G
Having observed the new sequence GGAAG we are faced with the problem of de-
termining which of all possible alignments is the most likely one. The alignment
GGAAG has probability
P (m1 m2 m3 m4 m5 = GGAAG) = (1 − ǫ − δ)4 (1 − ǫ)p1 (G)p2 (G)p3 (A)p4 (A)p5 (G).
The alternative alignment G†GA(insertA)G, which corresponds to breaking up the
given family between the last two bases, gives
P (m1 d2 m3 m4 i4 m5 = G†GAAG)
(1 − ǫ − δ)2 δ(1 − δ)ǫ(1 − ǫ)
= p1 (G)p3 (G)p4 (A)p5 (G),
4
and so on. Select the alignment which has the largest probability. In practice more
refined methods may be used at this point, such as maximizing properly chosen
likelihood ratios rather than plain probabilities.

Solved exercises
1. Solution.
2. Solution.
3. Solution.
4. Solution.
Bibliography

[1] C. Bertsekas, R. Gallager, Data networks, 2nd Ed, Prentice Hall, Englewood Cliffs
NJ, 1992.
[2] G. Blom, L. Holst, D. Sandell, Problems and snapshots from the world of probability,
Springer-Verlag, New York, 1994.
[3] P. Brémaud, Markov chains; Gibbs fields, Monte Carlo simulation, and queues,
Springer-Verlag, New York, 1999.
[4] M. Denny, S. Gaines, Chance in biology; using probability to explore nature, Prince-
ton University Press, Princeton 2002.
[5] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological sequence analysis, proba-
bilistic models of proteins and nucleic acids, Cambridge University Press, Cambridge,
1998.
[6] R. Durrett, Probability models for DNA sequence evolution, Springer-Verlag, New
York, 2002.
[7] W.J. Ewens, G.R. Grant, Statistical methods in bioinformatics, an introduction,
Springer-Verlag, New York, 2001.
[8] B. Fristedt, N. Jain, N. Krylov, Filtering and Prediction: A Primer. AMS Student
Mathematical Library, Vol 38. American Mathematical Society 2007.
[9] R. Gaigalas, I. Kaj, Stochastic simulation using MATLAB, tutorial and code available
at http://www.math.uu.se/research/telecom/software, last updated Dec 2005.
[10] G.R. Grimmett, D.R. Stirzaker, Probability and random processes, 2nd Ed, Oxford
Science Publications, Oxford, 1992.
[11] A. Gut, An intermediate course in probability, Springer, New York, 1995.
[12] P.G. Harrison, N.M. Patel, Performance modelling of communication networks and
computer architectures, Addison-Wesley, Reading MA, 1993.
[13] P.G. Hoel, S.C. Port and C.J. Stone, Introduction to stochastic processes, Houghton
Mifflin Co., Boston 1972.
[14] I. Kaj, Stochastic modeling in broadband communications systems, SIAM Mono-
graphs in Mathematical Modeling and Computation 8, SIAM Philadelphia PA, 2002.
[15] J.F. Kurose, K.W. Ross, Computer networking, a top-down approach featuring the
Internet, Addison-Wesley, Boston MA, 2001.
[16] O. Machek, J. Hnilica, A stochastic model of corporate lifespan based on corporate
credit ratings, Int. J. Eng. Bus. Manag. 5:45 (2013).

109
110 Bibliography

[17] E. Renshaw, Modelling biological populations in space and time, Cambridge Univer-
sity Press, Cambridge, 1991.
[18] S.I. Resnick, Adventures in stochastic processes, Birkhäuser, Boston, 1992.
[19] M. Schwartz, Telecommunication networks: protocols, modeling and analysis,
Addison-Wesley, Reading MA, 1987.
[20] T. Söderström, Discrete-time stochastic systems, 2nd Ed., Springer-Verlag, London,
2002.
[21] R. Wolff, Stochastic modeling and the theory of queues, Prentice-Hall, Englewood
Cliffs NJ, 1989.

You might also like