Ikaj Stochmod Lectnotes
Ikaj Stochmod Lectnotes
Chapter 1. Introduction 1
1.1. Brief review of Probability Theory 2
1.2. Stochastic Processes 7
1.3. Simulation of stochastic processes 17
1.4. The life-length process 19
1.5. Reliability of systems 22
1.6. The Poisson process 25
1.7. Properties of the Poisson process 26
Solved exercises 31
Chapter 2. Markov Chain Models, discrete time 37
2.1. The Markov property 37
2.2. Stationary distribution and steady-state 41
2.3. First return and first passage times 45
Solved exercises 47
Chapter 3. Continuous time Markov chains 55
3.1. Birth-and-death processes 58
3.2. Credit rating models 62
Solved exercises 64
Chapter 4. Some non-Markov models 67
4.1. Renewal models 67
4.2. Renewal reward processes 69
4.3. Reliable data transfer 70
4.4. Time series models 72
4.5. Autoregressive and moving average processes 75
4.6. Statistical methods 80
Solved exercises 83
iii
iv Contents
Introduction
Everywhere in science and technology we see random change and patterns of varia-
tion due to chance. Be it
- changes in weather or climate;
- spread of disease;
- traffic congestion;
- noise or shut-down in communication systems;
- quality variation in a manufacturing line;
- the intrinsic randomness in the evolution of all living species,
or, something else. This course gives an introduction to mathematical ideas, tools,
and models, which are widely used to study some of the basic mechanisms of ran-
domness in such systems. The course notes aims at providing a semi-rigorous math-
ematical introduction to a collection of models based on the theory of stochastic
processes and to give a number of modern examples from various areas of engineer-
ing and natural sciences where these models come to use.
The selection of material starts at a level corresponding to a standard course
in introductory probability and statistics but does not assume prior knowledge of
stochastic processes. Following a brief review of tools from probability theory, we
introduce basic concepts and ideas of stochastic processes for studying stochastic
models and give introductory examples of random processes. This is followed by a
discussion on how to instruct a computer to simulate random behavior. We discuss
briefly the theoretical background of computer simulation of stochastic processes and
look at some examples of simulation code. Additional example code and simulation
output are scattered throughout the text. Then we present two basic stochastic
models in more detail. First the life-length process, which describes events that
occur after a random waiting time. One may think of the failure time of a mechanical
device or the life-length of a human being as examples. Second, the Poisson process
is the fundamental model for a sequence of events which occur randomly in time and
independently of each other. In the subsequent chapters we cover Markov chains in
discrete and continuous time, renewal processes, and time series models. Variations
of the basic models appear in case studies. In greater detail, we provide theory,
examples and exercises to help
1
2 1. Introduction
Here, the expected value E(X) and the variance V (X) of X, when they exist, are
the standard means of measuring “average value” and “average squared variation” of
random variables. Suppose that we also have access to measurement data x1 , . . . , xn ,
which we believe represent “typical, independently choosen, values of X”. Then
x1 , . . . , xn is a sample of observations of size n from X and we may apply statistical
estimators, such as
n
X
Sample mean: µ
b = x̄ = xi .
i=1
n
1 X
Sample variance: σb2 = (xi − x̄)2 .
n − 1 i=1
Next, we collect some standard probability distributions, writing p(k) for probability
function, f (x) for density function, µ for expected value and σ 2 for variance.
4 1. Introduction
Discrete distributions
n k n−k
Binomial: X ∈ Bin(n, p) if p(k) = p q , k = 0, 1, 2, . . . , n.
k
0 ≤ p ≤ 1, q = 1 − p, µ = np, σ 2 = npq.
Geometric: X ∈ Ge(p) if p(k) = pq k , k = 0, 1, 2, . . .
q q
0 < p ≤ 1, q = 1 − p, µ = , σ 2 = 2 .
p p
µk −µ
Poisson: X ∈ Po(µ) if p(k) = e , k = 0, 1, 2, . . .
k!
µ > 0, σ 2 = µ.
Continuous distributions
Uniform:
1 a+b (b − a)2
X ∈ Re(a, b) if f (x) = , a ≤ x ≤ b, µ= , σ2 = .
b−a 2 12
Exponential:
1 1
X ∈ Exp(λ) if f (x) = λe−λx , x ≥ 0, λ > 0, µ= σ2 = .
λ λ2
Gamma distribution:
λn xn−1 −λx n n
X ∈ Γ(n, λ) if f (x) = e , x ≥ 0, λ > 0, µ= , σ2 = .
(n − 1)! λ λ2
Normal (Gaussian):
1 (x−µ)2
X ∈ N(µ, σ 2 ) if f (x) = √ e− 2σ2 , −∞ < x < ∞ σ > 0.
2π σ
2
where µ = m is the mean and σ is the variance. For the N (0, 1)-distribution we
write Φ(x) for the distribution function and λα for the α-quantiles.
C(X, Y )
Correlation coefficient: ρ(X, Y ) = .
D(X) · D(Y )
1.1. Brief review of Probability Theory 5
The expected value of a sum of random variables is the sum of the expected values:
E(X1 + · · · + Xn ) = E(X1 ) + · · · + E(Xn )
The variance of a sum of random variables is the sum of covariances of all pairs:
Xn X n X n
V Xi = Cov(Xi , Xj )
i=1 i=1 j=1
Conditional distribution
The law of large numbers says that if X1 , X2 , . . . are independent random variables
with the same distribution and finite expected value µ, then the average of a large
number of terms is also given by µ, in the sense of a limit
X1 + · · · + Xn
→ µ, n → ∞,
n
where the convergence holds in a strong probabilistic sense called almost sure con-
vergence.
The central limit theorem says that if X1 , X2 , . . . are independent random vari-
ables with the same distribution, finite expected value µ and finite variance σ 2 ,
0 < σ 2 < ∞, then the distribution of the centered and normalized partial sum is
6 1. Introduction
asymptotically normal,
n
1 X
√ (Xk − µ) ⇒ N (0, 1), n → ∞,
nσ k=1
where the arrow notation means that the distribution function of the quantity on
the left hand side converges to the distribution function of the standard normal
distribution.
Various
Geometric sums:
∞
X ∞
X
1 an
ak = , |a| < 1, ak = , |a| < 1.
k=0
1−a k=n
1−a
∞
X ak x n
Exponential function: = ea , lim 1+ = ex .
k=0
k! n→∞ n
20
15
10
0
-40 -35 -30 -25 -20 -15 -10
normalizing with the number of observations, n = 176, we may interpret the data
as an estimate of the probability distribution of a discrete version Z of X, such
that Z is the minimum temperature rounded off to an integer. The shape of the
corresponding probability function appears to be rather close to a Gaussian curve,
1.2. Stochastic Processes 7
0.12
0.1
0.08
0.06
0.04
0.02
0
-40 -35 -30 -25 -20 -15 -10
Example 1.2 (Telephone traffic). A classical model for telephone switch traffic
assumes that the interarrival times of calls, that is, the duration between the starting
times of any two consecutive calls directed to the switch, are statistically independent
and follow the same exponential distribution. Hence, to describe the incoming calls
let U1 , U2 , . . . be a sequence of independent random variables all with the same
exponential distribution with parameter λ. Then, starting to trace calls at time
t = 0,
T1 =U1 = time of the first call
T2 =U1 + U2 = time at which second call arrives
..
.
Tn =U1 + · · · + Un = time of nth call
U1 U2 U3 Un Un+1
Tn
For each n, the sum Yn is a new random variable with expected value and variance
n n
E(Tn ) = E(U1 ) + · · · + E(Un ) = , V (Tn ) = V (U1 ) + · · · + V (Un ) = 2 .
λ λ
It can be shown that Tn has the the Γ(n, λ)-distribution.
15
10
-5
-10
-15
0 5 10 15 20 25 30 35 40 45 50
In greater generality than the random walk we may form the partial sums
n
X
Sn = Yk , n ≥ 1,
k=1
for a given sequence of random variables {Yk }. We assume that the summation
terms are independent with the same distribution, such that the expected value
µ = E(Yk ) and the variance σ 2 = V (Yk ) exist as finite numbers. By the law of large
numbers the average value Sn /n will be closer and closer to µ the larger n we take.
By the central limit theorem the distribution of the centered and normalized partial
sum is asymptotically normal:
n
Sn − E(Sn ) 1 X
p =√ (Yk − µ) ⇒ N (0, 1), n → ∞.
V (Sn ) nσ k=1
10 1. Introduction
100
50
0
0 200 400 600 800 1000 1200
100
50
0
0 200 400 600 800 1000 1200
100
50
0
0 200 400 600 800 1000 1200
Example 1.6 (Random walk, mean and variance). For the random walk,
E(Z1 ) = 1 · p + (−1) · (1 − p) = 2p − 1, E(Z12 ) = 12 · p + (−1)2 · (1 − p) = 1
1.2. Stochastic Processes 11
so
Thus,
hence
The random walk and, more generally, partial sum processes are examples in dis-
crete time of processes with independent increments and with stationary increments,
according to
Random processes having both independent and stationary increments are called
Lévy processes. For the random walk these two properties are “built-in”, since if
we take k integers n1 ≤ · · · ≤ nk then the increments are the consecutive sums
Z1 +· · ·+Zn1 , . . . , Znk−1 +1 +· · ·+Znk . These are independent since all of the random
variables Z1 , Z2 . . . are independent. Moreover, Xn+m − Xn = Zn+1 + . . . Zn+m is a
sum of m independent terms all with the same distribution, regardless of the number
n, which means that the increments are stationary.
Another concept of stationarity which is of great importance in engineering sci-
ences is that of weakly stationary stochastic processes. Weak stationarity means
that the mean value and the covariance structure of the process are preserved under
a shift of the time scale. In contrast, strictly stationary random processes preserve
all statistical properties over time not just the first and second order moments. Strict
stationarity is of theoretical importance but often mathematically difficult to use in
applications. The definitions for continuous time are as follows, with straightforward
modifications for discrete time.
12 1. Introduction
Thus, a weakly stationary random process describes a random entity which fluc-
tuates around a constant average value and is such that the covariance of any two
values depends only on the distance in time between them, and not on “clock time”.
The next example, which is used here to illustrate stationarity, is entirely different
from the random walk. The random wave process has strong dependence over time.
In fact, the complete trajectory of the process is determined by the outcome of only
two random variables.
-1
-2
-3
-4
0 1 2 3 4 5 6 7 8 9 10
Example 1.9 (Random wave). A random wave is a continuous time and contin-
uous state stochastic process of the form
X(t) = A cos(t + φ), t ≥ 0,
where A and φ are independent random variables, A > 0 has finite second moment
E(A2 ) < ∞ and φ is uniformly distributed on [0, 2π]. Figure 5 shows five simulated
trajectories of a random wave obtained by letting A have the distribution of the
absolute value of the normal distribution N (0, 1), in which case, by symmetry,
Z ∞
1 −x2 /2 1 h −x2 /2
i∞ p
E(A) = 2 x√ e dx = 2 √ −e = 2/π
0 2π 2π x=0
and Z ∞
2 1 2
E(A ) = x2 √ e−x /2 dx = 1,
−∞ 2π
where in the last step we recognize the integral expression for the variance in the
N(0, 1) distribution.
1.2. Stochastic Processes 13
Now let us verify that the random wave process is weakly stationary. By inde-
pendence, E(X(t)) = E(A)E(cos(t + φ)), where
Z 2π
1 1h iy=2π
E(cos(t + φ)) = cos(t + y) dy = sin(t + y) = 0,
0 2π 2π y=0
so the mean value function m(t) is constant m = 0. The covariance function r(s, t)
equals
Z
E(A2 ) 2π
r(s, t) = E(X(s)X(t)) = cos(s + y) cos(t + y) dy.
2π 0
Next, using the trigonometric formula 2 cos(x) cos(y) = cos(x + y) + cos(x − y),
Z 2π Z
1 2π
cos(s + y) cos(t + y) dy = (cos(s + t + 2y) + cos(t − s)) dy
0 2 0
1
= 0 + cos(t − s),
2
and hence the random wave process is weakly stationary with mean m = 0 and
covariance function
E(A2 )
r(s, t) = cos(t − s).
2
Time series data, by which we mean series of successive data point measurements
collected over time, are used in just about all areas of empirical sciences, such as
econometrics, weather forecasting and production engineering. Time series analysis
refers to the vast collection of statistical and mathematical methods used to sort
out meaningful information from the time series data. We return to these topics in
Section 5 and close this section with a few additional examples.
Example 1.10 (EEG). Electroencephalography, EEG for short, measures voltage
fluctuations within the neurons of the brain and hence records the spontaneous
electrical activity of the brain over a period of time, typically a few minutes. A
measurement series comes out as a wave-formed pattern of voltage variation and
may be recorded separately in different frequency bands. Figure 6 shows three one-
second segments of normal EEG-measurements. The upper curve is an example
of an alpha-wave signal in the frequency range 7-14 Hz, the middle curve shows
fluctuations of gamma-waves, meaning that the frequency is approximately 30-100
Hz, and the lower curve is the superposed signal over all frequency bands (Source:
Wikipedia).
EEG measurements reflect highly complex mechanisms of the human brain with
huge variability depending on the activity and health status of the observed person,
and may appear quite “non-stationary” over time. Yet, the theory of weakly station-
ary stochastic processes is a natural modeling approach to EEG-data over shorter
times. Observed deviations from stationarity or tracking of nontypical frequencies
in data might be the basis of tools to detect pathological effects or effect of drugs
on the brain.
50
45
40
35
30
25
20
15
10
0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010
15
0.3
0.2 10
0.1
0
5
−0.1
−0.2
0
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
around a mean close to zero. The curve overlaid the histogram in Figure 8 is the
probability density of the normal distribution with mean zero and standard devia-
tion fitted to that of the sample. In comparison with the normal distribution, the
data exhibits a higher concentration of mass in the tails and close to the mean value,
which is typical for logreturns data in general. Yet in modeling it is a common as-
sumption to apply the zero mean normal distribution. To simplify even further it is
tempting to assume that the successive logreturn values are independent. This leads
to the following model. Let X1 , X2 , . . . be independent and identically distributed
random variables with the standard normal distribution. Let
Yn = logreturn value of the stock week n = σXn , n≥1
Then Yn ∈ N (0, σ 2 ). It is natural to take for the constant σ the sample standard
deviation of the logreturn data. Now set
Yn = log(Vn /Vn−1 ).
16 1. Introduction
2.5
1.5
0.5
−0.5
−1
1
0.5 2
1
0 0
−1
−0.5 −2
−3
−1 −4
To apply the inversion property, start with ui and let xi be defined by the relation
F (xi ) = ui . However, it is not always computationally efficient to use inversion
for simulating continuous random variables, in many cases alternative methods are
used.
Example 1.13. In Matlab a sequence u of n uniform random numbers is produced
by the command u=rand(1, n). The sequence of spin variables {Zk } in Example 1.4
is obtained as z=2.*(u<p)-1 and the path of the random walk as cumsum(z). The
graph in Figure 3 is produced by specifying p and the number of simulation steps n
and then evaluating stairs(0:n-1),[0 cumsum(2.*(rand(1,n-1)<p)-1)]);.
The most well-known example of a continuous inversion is the logarithmic trans-
formation x = − log(1 − u), which gives exponentially distributed random numbers.
Indeed, let U have the uniform distribution on [0, 1] and put X = − ln(1 − U ). For
any x > 0 we have X ≤ x if and only if U ≤ 1 − e−x , and hence
FX (x) = P (X ≤ x) = P (U ≤ 1 − e−x ) = 1 − e−x .
Moreover, let c be a constant and put Y = cX. Then
FY (y) = P (Y ≤ y) = P (X ≤ x/c) = 1 − e−x/c , x > 0,
which is the distribution function of an exponential random variable with inten-
sity 1/c. Hence − ln(1 − rand(1, n))/λ; produces a sample of n observations from
Exp(λ).
Finally, we mention the graphs of Figure 5 that are generated by the commands
t=linspace(0,10*pi);
A=abs(randn(1,1));
phi=rand(1,1).*2*pi;
plot(t./(2*pi),A.*cos(t+phi));
X(t)
t
t t+h
T
The key concept for life-length processes is the life-length intensity, which is a
function λ(t) that at any given time t gives the typical rate of system failure at that
time. The idea is to look at a system we know works at time t, find the probability
that it does not work at a later time t + h, and then close in on the approximate
failure time t by letting h tend to zero.
To carry out this calculation of failure intensity, let h > 0 be a small number and
consider the interval [t, t + h]. The probability that a system functioning at time t
fails no later than at time t + h, can be expressed using conditional probabilities as
P (X(t + h) = 0|X(t) = 1) = P (T ≤ t + h|T > t).
By definition of conditional probability we have, moreover,
P (t < T ≤ t + h) F (t + h) − F (t)
P (T ≤ t + h|T > t) = = .
P (T > t) 1 − F (t)
Now we make the additional assumption that the life-length variable T is a con-
tinuous random variable with a density function f (t). Since the density function
is the derivative of the distribution function, f (t) = F ′ (t), we have the difference
approximation
F (t + h) − F (t) = f (t)h + o(h), h → 0,
where the remainder term o(h) is such that o(h)/h → 0 as h → 0. Thus,
f (t)h + o(h)
P (system failure in (t, t + h]) = P (X(t + h) = 0|X(t) = 1) = ,
1 − F (t)
which shows that by defining the life-length intensity function as
f (t)
λ(t) = ,
1 − F (t)
we have
P (system failure in (t, t + h]) = λ(t)h + o(h), h → 0,
and hence the desired interpretation of the function λ(t) as an infinitesimal intensity
of system failure. The life-length intensity function λ(t) is also called the failure rate,
the hazard function or the instantaneous error function.
1.4. The life-length process 21
Proposition 1.15. The intensity function λ(t) and the life-length distribution
function F (t) are obtained from each other via the relations
d n Z t o
λ(t) = − log(1 − F (t)) F (t) = 1 − exp − λ(s) ds .
dt 0
Moreover, Z ∞ n Z t o
E(T ) = exp − λ(s) ds dt.
0 0
Since
P (T > s + t)
P (T > s + t|T > s) =
P (T > s)
n Z s+t o nZ s o
= exp − λ(u) du exp λ(u) du ,
0 0
we have immediately
0.3
0.25
0.2
0.15
0.1
0.05
0
0 10 20 30 40 50 60 70 80 90 100
Figure 11. Life mortality data, death probability in the next year (y-axis) as
function of current age (x-axis), upper curve for males, lower curve females
are independent, each working with probability p and having failed with probability
1 − p. Then, clearly, the function probability of the system will be
Rser (p) = P (2-series system works) = p2 .
p p
it follows that
n Z t o
(1.4) Rser (t) = exp − λser (s) ds .
0
0.4
0.3
0.2
0.1
-0.1
0 10 20 30 40 50 60 70 80 90 100
To help visualize the Poisson process it is useful to consider the random times
Tk to be the time epochs of the occurrences of abstract “Poisson events” and the
random variables Uk to be the corresponding inter-occurrence times of these events.
Then
N (t) = the number of Poisson events in the interval [0, t].
In other words, the Poisson process {N (t)} is a jump process in continuous time
which, starting from N (0) = 0, climbs from each integer value to the next larger value
26 1. Introduction
in such a way that the waiting times on each level are independent and exponential
with mean 1/λ.
The definition given above immediately suggests a method to simulate trajecto-
ries of the Poisson process. Building on Example 1.13, the Matlab commands
interarr = − log(rand(1,n))./lambda;
stairs(cumsum(interarr), 0 : n − 1);
produce an output trajectory such as those in Figure 13. Here, given that lambda
and n are assigned parameters, rand(1, n) is a vector of n uniform random num-
bers on the interval [0, 1] and thus interarr becomes a vector with values that
represent typical outcomes of the exponential random variables U1 , . . . , Un (c.f. Ex-
ample 1.13). Consequently the vector cumsum(interarr) of the cumulative sums
of interarr give typical values of the times T1 , . . . , Tn of occurrence of the Pois-
son events. Figure 13 illustrates the variation which arise naturally in independent
realizations of Poisson paths.
30
25
20
15
10
0
0 5 10 15 20 25
and therefore
d λn tn−1 e−λt λn+1 tn e−λt
P (N (t) = n) = fTn (t) − fTn+1 (t) = − .
dt (n − 1)! n!
On the other hand, the expression on the right side above can be recognized as the
derivative of a well-known function:
n
d −λt (λt) λn tn−1 λn+1 tn
e = e−λt − e−λt .
dt n! (n − 1)! n!
Therefore,
(λt)n
P (N (t) = n) = e−λt +C
n!
where C is a constant of integration. But if we also take into account the initial
condition P (N (0) = n) = 0, it follows that C = 0 and hence for any t the random
variable N (t) is Poisson distributed with expected value E(N (t)) = λt.
With this we have demonstrated one part of the following proposition, which
characterizes the Poisson process as a stochastic process with independent, station-
ary, Poisson-distributed increments.
of series systems. Indeed, if U1k is the time of the first jump of Nk , k = 1, . . . , n, then
the first jump of Mn occurs at time min(U11 , . . . , U1n ). Thus, the inter-occurrence
times of the component processes relate to the superposition process in a similar
way as component life-lengths relate to the series system life-length.
Thinning property. The Poisson process is also preserved under an operation
which can be viewed as the opposite of superposition, namely thinning. We start
with a Poisson process N with intensity λ, which jumps at the successive times Tk ,
k ≥ 1, as in Definition 1.24. In addition, let {Jn , n ≥ 1} denote a sequence of i.i.d.
random variables with the possible values 0 or 1 and such that P(Jn = 0) = p and
P(Jn = 1) = 1 − p where p, 0 ≤ p ≤ 1, is a parameter. Similarly as in Definition
1.24, let
X∞
M (t) = 1{Tk ≤t} Jk , t ≥ 0.
k=1
Thinking of p as a thinning parameter, so that each time a Poisson event in N occurs
it is independently removed with probability p and kept with probability 1 − p, it is
clear that the random process M , which counts only the remaining Poisson events,
is a thinned version of N . It is a nice and useful feature of Poisson processes that the
thinned process M is also a Poisson process, the intensity of which is now λ(1 − p).
Several proofs of this property can be found in Gut [11].
As an application of this model assume that the arrivals of messages to an e-
mail address inbox is given by a Poisson process, and the sequence of messages after
having installed an e-mail spam filter is a thinned Poisson process that corresponds
to the original one.
To complete the presentation of various aspects of the standard Poisson process
it remains to mention the dynamical approach, which emphasizes the alternative
interpretation of λ as an infinitesimal jump intensity.
Figure 14. Incoming calls to health care center during one week (axis labels may
be ignored and merely indicate that data is part of a longer series of measurements)
30 1. Introduction
Example 1.28 (Phone call arrivals). Suppose that phone call inquiries to a local
health care center are placed at random times throughout the opening hours of a
typical work day. Is it realistic to assume that the calls are well described by the
Poisson process? Perhaps not, for several reasons. Is the intensity really constant
throughout the day? Is it the same for each day of the week? What about the
required independence between the Poisson events given that the calls are made
from essentially the same geographical area? To get some insight into these matters
we look again at real data. Figure 14 shows the counting process which results from
plotting the times of all incoming calls placed to a mid-sized health care center in a
town in mid Sweden during one week of 2011. The plot confirms that some aspects
of the data deviate from Poisson behavior. The intensity appears to be larger in the
morning of each day and then gradually decrease. Also, the number of incoming
calls is larger during Monday and Friday in comparison to the other days of the
week. This suggests the alternative of a time inhomogenous Poisson process where
we replace the constant intensity λ with a nonnegative function λ(t), t ≥ 0. Then
the corresponding counting process {N (t)} willR t be such that for each t, N (t) has
the Poisson distribution with expected value 0 λ(s) ds. In this case the process no
longer has independent or stationary increments.
Example 1.29 (Spatial Poisson process). A spatial point process is a collection
of random points in a spatial region S of the plane or in all of R2 . We write N (A)
for the number of points in a set A ⊂ S, and allow arbitrary (but well-defined) sets
A. Denote by |A| the volume of the set. We say that {N (A), A ⊂ S} is a spatial
Poisson point process with intensity λ > 0 if
• for each set A, N (A) has the Poisson distribution with expected value λ|A|;
• for any pair of disjoint sets A and B in S, the random variables N (A) and
N (B) are independent.
Figure 15 shows a simulation of a spatial Poisson process in the unit square of the
plane. Typical areas of application include describing the positions of bacteria in a
cell culture, or the locations of subscribers to a mobile phone system.
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
-10
-15
-20
-25
-30
-35
-40
1840 1860 1880 1900 1920 1940 1960 1980 2000 2020
Solved exercises
1. Figure 16 shows a plot of the annual series of temperatures studied in Example
1.1. Assuming the Gaussian model for the temperature variable X discussed in
this example, find the probability that the temperature of the coldest day next
year
a) falls below −25◦ C;
b) yields a new all-time record low (since 1840).
Solution. The assumption is that X ∈ N(µ, σ 2 ) with parameters µ = −22.28 and
σ = 4.497. Let Y be the centered and scaled random variable Y = (X − µ)/σ.
Clearly,
E(X) − µ V (X − µ) V (X)
E(Y ) = = 0, V (Y ) = 2
= = 1,
σ σ σ2
and since the normal distribution is preserved under addition and multiplikation
with constants, we have Y ∈ N(0, 1).
a)
P (X ≤ −25) = P (X − µ ≤ −25 − µ) = P (Y ≤ (−25 − µ)/σ)
= P (Y ≤ −2.72/4.497) = P (Y ≤ −0.6165) = Φ(−0.6165),
where Φ is the distribution function of the standard normal. By symmetry,
Φ(−x) = 1 − Φ(x). Hence P (X ≤ −25) ≈ 1 − 0.73 = 0.27.
b) P (X < −39.5) = P (X ≤ −39.5) = · · · = 1 − Φ(3.841) ≈ 0.00006
2. For the teletraffic model in Example 1.2, find the covariance and the correlation
coefficient between the variables T1 and Tn , n ≥ 1. Same questions for the pair
Tn−1 and Tn , n ≥ 2. Interpretations?
Solution. We have
C(T1 , Tn ) = C(U1 , U1 + · · · + Un ) = C(U1 , U1 ) + · · · + C(U1 , Un ).
For n ≥ 2, since U1 and Un are independent, C(U1 , Un ) = E(U1 Un )−E(U1 )E(Un ) =
0. Also, C(U1 , U1 ) = V (U1 ) = 1/λ. Hence C(T1 , Tn ) = 1/λ and
C(T1 , Tn ) 1/λ 1
ρ(T1 , Tn ) = =p p =√ .
D(T1 )D(Tn ) 1/λ · n/λ n
32 1. Introduction
Similarly,
n−1
C(Tn−1 , Tn ) = V (Tn−1 ) =
λ
and r
(n − 1)/λ 1
ρ(Tn−1 , Tn ) = p p = 1− .
(n − 1)/λ · n/λ n
Hence, for large n, the correlation between T1 and Tn is close to zero and the
correlation between Tn−1 and Tn is close to one. This is the probabilistic way
of saying that, if we know when the first call comes in we still know very little
about the timing of caller one hundred, say, but if we know when the 99th call
is placed then we are in a good position to predict the time of the next call after
that. Of course, covariance and correlation are symmetric in the two variables,
so these statements can also be turned around.
3. With reference to Example 1.11, let Un = Vn /V0 denote the stock price relative
to the initial price over n weeks.
a) Express the probability P (Un ≤ u) in terms of the standard normal disti-
bution function Φ.
b) Based on the GM data suggest an estimate of the parameter σ.
c) Using the estimate of σ from b), what is the probability that the stock price
more than doubles in one year? What is the probability that the stock price
increases tenfold over a period of 20 years?
P
Solution. Put Sn = nk=1 Xk . By the summation property of the normal √ dis-
tribution, the distribution of Sn is N (0, n). Thus, the distribution of Sn / n is
N (0, 1) with distribution function Φ(x). We have
Un = Vn /V0 = eσSn
so
P (Un ≤ u) = P (eσSn ≤ u) = P (Sn ≤ ln(u)/σ).
Hence S
n ln(u) ln(u)
P (Un ≤ u) = P √ ≤ √ =Φ √ .
n σ n σ n
For b), a crude visual inspection of the histogram in Figure 8 indicates that 95%
of all log return values fall in the interval ±2σ, if we take σ ≈ 0.05, which is
thus a reasonable point estimate of σ compliant with the Gaussian assumption.
The first probability in c) is
ln 2
P (U50 > 2) = 1 − Φ √ ≈ 0.025
0.05 50
and the second
ln 10
P (U1000 > 10) = 1 − Φ √ ≈ 0.073.
0.05 1000
4. Carry out a simulation of typical values of a lifelength variable T , with lifelength
intensity
1
λ(t) = √ , t ≥ 0.
2 t
Solution. By Proposition 1.15,
Solved exercises 33
n Z t o √
1
F (t) = P (T ≤ t) = 1 − exp − √ du = 1 − e− t .
0 2 u
By the inversion principle for simulation of continuous random variables, we
seek numbers ti such that F (ti ) = ui , where (ui )1≤i≤n is a given sequence of
uniformly distributed random numbers, that is
ti = (− ln(1 − ui ))2 , i = 1, . . . , n.
Comparing this result to the situation of the generalized random walk in Figure
4, it follows that T has the same distribution as the square of an exponential
random variable with mean one.
5. Two independent components with constant failure intensitites are connected in
parallel. The expected life-lengths of the two components are c and 1 − c, for
some parameter c with 0 ≤ c ≤ 1. Find the expected life-length of the system as
a function of c, and determine for which values of c the expected life is minimal
and maximal, respectively.
Solution. We have two independent, exponentially distributed life-length vari-
ables T1 and T2 with expected values E(T1 ) = c and E(T2 ) = 1 − c, and wish to
compute E(Tpar ) where Tpar = max(T1 , T2 ). One way to do this is to start with
Rpar (t) = P (Tpar > t) = 1 − P (Tpar ≤ t) = 1 − P (T1 ≤ t)P (T2 ≤ t).
By setting λ1 = 1/c and λ2 = 1/(1 − c) this becomes
Rpar (t) = 1 − (1 − e−λ1 t )(1 − e−λ2 t ) = e−λ1 t + e−λ2 t − e−(λ1 +λ2 )t ,
and so
Z ∞
1 1 1
E(Tpar ) = Rpar (t) dt = + − = 1 − c(1 − c),
0 λ1 λ2 λ1 + λ2
which is maximal and equal to 1 for the limiting cases c = 0 and c = 1 and
attains it minimal value 0.75 for c = 0.5.
6. A system consists of three independent components in parallel all with the same
failure intensity λ(t) = 2t (per hour). Find the system failure intensity λpar (t).
Plot the two functions together in the same graph and interpret the result.
Solution. The life-length distribution function for each component is
n Z t o 2
F (t) = 1 − exp − 2s ds = 1 − e−t , t ≥ 0,
0
5
Lifelength intensity
0
0 0.5 1 1.5 2 2.5 3
Time
Figure 17. Lifelength intensities for single component (blue) and system (red)
We give a short introduction to Markov chains in discrete time and illustrate the
main concepts and ideas with standard examples. Some of the proofs are indicated
but the emphasis is put on introducing Markov models in a descriptive manner.
To demonstrate the relevance of Markov models in modern applications we discuss
two specific examples in more detail. One example is the mechanism behind the
web search algorithm Google Page Rank. The second example is a Markov chain
description of the rating of corporations or countries based on their credit standing as
done by credit rating agencies. For complete introductions to the subject and more
advanced material the reader is referred to such sources as Brémaud [3], Grimmett
and Stirzaker [10], Resnick [18] or Wolff [21].
37
38 2. Markov Chain Models, discrete time
the outcome of the nth step. This type of dependence is the content of the Markov
property.
Definition 2.1. A discrete time stochastic process {Xn } with discrete state space
E is called a Markov chain if it satisfies the Markov property:
P (Xn = xn |X0 = x0 , . . . , Xn−1 = xn−1 ) = P (Xn = xn |Xn−1 = xn−1 )
for all x0 , . . . , xn ∈ E and n ≥ 1. It is called a time-homogeneous Markov chain if
for each x, y ∈ E
pxy = P (Xn = y|Xn−1 = x) does not depend on n.
In the time-homogeneous case the probabilities {pxy , x, y ∈ E} are called the tran-
sition probabilities of the Markov chain.
With every Markov chain can be associated a state transition diagram which is
a graph with one node for each state of the chain and directed edges in the form of
arrows between nodes representing all one-step transitions of the chain that can occur
with positive probability. By labeling each arrow with the corresponding transition
probability the resulting transition diagram contains all relevant information about
the Markov chain.
2.1. The Markov property 39
0
0.9
0.1
0.3
0.6
0.4 1 0.5 2 0.2
To see how the matrix and vector formalism of Markov chains work we introduce
the additional notations:
(n) (n) (n)
Absolute (state) probabilities: pi = P (Xn = i), p(n) = (p0 p1 . . . );
(m) (1)
m-step transition probabilities: pij = P (Xm = j|X0 = i), in particular pij =
pij ;
(m) (m)
p00 p01 . . .
(m) (m)
m-step transition probability matrix: P(m) = p10 p11 . . . , in particu-
.. .. ...
. .
lar P(1) = P
If the state space is finite, say E = {0, . . . , r}, the vectors p(n) are row vectors of
length r + 1 and the matrices P(m) are regular matrices of size r + 1 rows and r + 1
columns. In the general case E = {0, 1 . . . } we still think of the absolute probabilities
forming indefinite row vectors and the collection of transition probabilities organized
as in infinitely large matrices. It is clear from the definitions that for all such matrices
P(m) the elements on any fixed row must sum to one:
X (m)
pij = 1, i = 0, 1, . . .
j
Proof. It is clear that (ii) follows from (i). Moreover, (iii) follows from (i) by sum-
ming over the initial distribution p(0) . We demonstrate (i) using proof by induction
over m. The case m = 1 is trivial. Suppose (i) is true for all index values strictly
less than m. Now, to extend (i) to index m we only have to note that
(m)
X (m−1)
pij = pik pkj , i, j ∈ E
k
40 2. Markov Chain Models, discrete time
saying that a move from i to j in exactly m steps is the same as a jump from i to
some state k in the first step, and then a second sequence of jumps from k to j in
exactly m − 1 steps. The above relations written in matrix form correspond to the
operation of matrix multiplication:
P(m) = PP(m−1) ,
hence, by the induction hypothesis, P(m) = PPm−1 = Pm , proving relation (i).
Example 2.5. With P defined in Example 2.3 we have, for example,
0.1163 0.3861 0.4976
P5 ≈ 0.1285 0.3969 0.4745
0.1516 0.4169 0.4315
By Theorem 2.4, the elements in this matrix give the probability distributions for
the state X5 of the Markov chain after 5 jumps, P (X5 = 0|X0 = 0) ≈ 0.1163, etc.
Furthermore, if the initial distribution of X0 is known, let’s say p(0) = (0.3 0.6 0.1),
then the (absolute) distribution of the chain at time 5 can be read off in the vector
p(5) = p(0) P5 ≈ (0.1272 0.3957 0.4772)
Example 2.6 (The Ehrenfest model (1907)). Consider a total of r molecules of
a gas distributed in two containers A and B. At time n = 0 there are x molecules
in container A and r − x in container B. Each time point n = 1, 2, . . . one of the r
molecules is chosen randomly and moved to the other container. Let
Xn = # of molecules in container A at time n, n ≥ 0.
It is clear that this defines a Markov chain {Xn } with initial distribution X0 = x
and transition matrix
0 1 0 0 ... 0 0
1/r 0 1 − 1/r 0 ... 0 0
0 2/r 0 1 − 2/r . . . 0 0
P= ... ..
.
0 0 1/r
0 ... ... 1 0
Figure 1 shows two simulated paths of the Ehrenfest Markov chain with r = 100
over a time span of 2000 transitions, one with initial condition x = 0, the other with
x = r. To produce such simulated trajectories one can use MATLAB commands,
for example
m=2000; % number of simulation steps;
x=zeros(1,m);
r=100;
x(1)=r;
k=1;
while k<=m
z=(rand(1,1)>x(k)/r);
k=k+1;
x(k)=x(k-1)+2*z-1;
end
stairs((0:m),x);
2.2. Stationary distribution and steady-state 41
100
90
80
70
60
50
40
30
20
10
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Asymptotic distributions are always stationary but the converse is not true. The
basic recipe for finding stationary distributions of a Markov chain is to solve a system
of linear equations.
Proposition 2.9. If a Markov chain has an asymptotic distribution then this dis-
tribution is also stationary. A probability distribution π is a stationary distribution
for a Markov chain if and only if it it is a solution of the linear system of equations
π = πP, that is
X∞
πi = πk pki , i ≥ 0
k=0
π0 + π1 + · · · = 1
Example 2.10 (Two-state chain). Consider the two-state Markov chain with
jump probabilities p01 = a and p10 = b, where 0 ≤ a, b ≤ 1. Let π = (b/(a +
b) a/(a + b)) and assume that the Markov chain starts at time n = 0 with this initial
distribution, p(0) = π. By Chapman-Kolmogorovs relation
(1) b a 1−a a b a
p = πP = ( ) =( )=π
a+b a+b b 1−b a+b a+b
Similarly, p(2) = π, and so on. Thus, π is a stationary distribution for this Markov
chain. Consider the special case a = b = 1. This describes the trivial chain which
jumps from one state to the other and back again, indefinitely. There is no asymp-
totic distribution in this case. For example, with X0 = 1 the sequence of probabilities
P (X0 = 0), P (X1 = 0), P (X2 = 0),. . . has the form 0 1 0 . . . , which has no limit.
Example 2.11 (Stationary distribution not unique). A Markov chain can have
many stationary distributions. Take, for example, the transition matrix
1/2 0 1/2
P = 0 1 0 ,
1/2 0 1/2
which defines a Markov chain with states E = {0, 1, 2}. Since
1/2 0 1/2
(0 1 0) 0 1 0 = (0 1 0)
1/2 0 1/2
and
1/2 0 1/2
(1/2 0 1/2) 0 1 0 = (1/2 0 1/2)
1/2 0 1/2
we conclude that both vectors (0 1 0) and (1/2 0 1/2) are stationary. But these
are not all. Indeed, if µ and ν are two stationary distributions then any linear
combination (1 − θ)µ + θν, where 0 ≤ θ ≤ 1 is a real number, is another stationary
distribution. Therefore, in this example, all vectors πθ = (θ/2 1−θ θ/2), where the
parameter θ is any number in the interval [0, 1], qualifies as a stationary distribution.
possible in a sequence of jumps to reach any state from any other state. This prop-
erty is called irreducibility. We also need to exclude chains with periodic behavior.
We observe that if a chain is irreducible and one can find a state i with pii > 0,
in other words find a positive element on the diagonal of the transition probability
matrix, then the chain is also aperiodic. The Markov chain in Example 2.11 is not
irreducible. In fact, the chain will never reach state 0 (or state 2) starting from state
1. Example 2.10, case a = b = 1, is not aperiodic. In fact, the greatest common
divisor of the set {n : P (Xn = i|X0 = i) > 0} = {2, 4, 6, . . . } is 2, and the chain is
said to be periodic with period 2.
A full account of limit distributions involves additional theory, not covered here,
for recurrence vs. transience of Markov chains. The following useful criteria, how-
ever, may be rigorously established with the tools we have already introduced. The
condition of the theorem implies that the chain is irreducible. See Fristedt, Jain,
Krylov [8], Theorem 2.33, for an accessible proof.
Theorem 2.13. Consider a Markov chain {Xn } defined on a finite state space.
Assume that there exists an integer m ≥ 1 such that all entries of the matrix Pm
are strictly positive, that is
P (Xm = j|X0 = i) > 0 all i, j ∈ E.
Then there is a unique stationary distribution which is also asymptotic.
This result can be strengthened as follows. If there is an m ≥ 1, such that all entries
in some column of the matrix Pm are strictly positive, then the conclusion of the
theorem remains valid. The next result restricts to chains which are irreducible and
aperiodic but is more general in the sense that it covers Markov chains with infinite
state space.
Example 2.15 (Urn model). The following is a special case of the so called
Bernoulli-Laplace’s diffusion model. At time 0 there are two urns with two black
balls in urn 1 and two white balls in urn 2. At each time n = 1, 2 . . . , a randomly
44 2. Markov Chain Models, discrete time
chosen ball is removed from each urn and put back into the other. Let Xn be the
number of white balls in urn 1 at time n, n ≥ 0. This is a Markov chain with states
{0, 1, 2} and transition probability matrix
0 1 0
P = 1/4 1/2 1/4
0 1 0
from which it is seen that this Markov chain is aperiodic and irreducible making
Theorem 2.14 applicable. As an alternative we may check that all nine elements of
1/4 1/2 1/4
P2 = 1/8 3/4 1/8
1/4 1/2 1/4
are positive and apply Theorem 2.13. Or rely on the remark following Theorem
2.13 and just note that all three elements of the middle column of P itself are
positive. In either case it is straightforward to find the limit distribution as the
unique probability solution π = (π1 , π2 , π3 ) with π1 + π2 + π3 = 1 of the stationarity
equation π = πP, that is
π1 = π2 /4
π2 = π1 + π2 /2 + π3
π = π /4
3 2
A web page that contain no out links has ci = 0 and is called a dangling node. Now
let H = (hij ) be the n × n matrix with entries
gij /ci , if ci ≥ 1
hij =
1/n, if i is a dangling node.
2.3. First return and first passage times 45
Note that the elements in each row of H are nonnegative and sum to one. Hence H
may be viewed as the transition probability matrix of a discrete time Markov chain
with state space W . To visualize this Markov chain, imagine a web surfer that
jumps from one page to another by clicking at each jump a link uniformly among all
available out links. In case the web surfer encounters a dangling node it will move
to an arbitrary node chosen uniformly among all possible web pages.
The Coogle Markov chain is obtained by applying an additional parameter p and
introducing the transition probability matrix
1
P = pH + (1 − p) 1n
n
where 1n is the n×n matrix with all entries equal to one. The corresponding Markov
chain jumps between the web pages in W as follows. With probability p the chain
either follows one of the outgoing links randomly with equal probabilities or, if the
current page has no outbound link, jumps to a randomly chosen page among all in
W . With probability 1 − p the chain moves to a randomly chosen page among all
the n pages, regardless whether the current node is dangling or not. The mechanism
of letting the chain move freely with probability p is thought to prevent the web
surfer from getting stuck for longer periods of time in isolated and perhaps less
representative regions of the web graph. It is known that Google originally used
p = 0.85.
The Google Markov chain is finite, irreducible and aperiodic. Hence there exists
a unique stationary distribution π = (π1 , . . . , πn ). Now write the probabilities πk
in decreasing order r1 ≥ r2 ≥ · · · ≥ rn , where r1 is the largest of all the πk s, r2 is
the second largest, and so on. Thus, r1 is the stationary probability for the “best”
(in Google’s sense) web page, which is therefore assigned PageRank one. The page
corresponding to the second largest stationary probability gets ranked second, and
so on.
Definition 2.18. For a Markov chain {X(t)} the first passage times Tij , i, j ∈ E
are defined as
Tij = the time of the first visit in j given X(0) = i.
In particular, the first return to a state i after once leaving i is
Tii = the time of the first return to i given X(0) = i.
(A transition from i to i counts as a return.) Moreover, a state r ∈ E such that
prr = 1 is called absorbing. In this case,
Tir = absorption time in r given X0 = i
and the Markov chain is absorbed and remains in r from this time onwards.
Theorem 2.19. If a Markov chain in discrete time {Xn , n ≥ 0} has a limit distri-
bution {πk }, then
1
E(Tii ) = .
πi
Example 2.20 (Mean return time). A Markov chain has transition probability
matrix
1/2 1/3 1/6
1/4 1/2 1/4
2/3 1/6 1/6
To find the mean recurrence times for the three states we observe by using Theorem
2.13 or Theorem 2.14 that this finite Markov chain has an asymptotic distribution
such that for any i = 1, 2, 3,
P (Xn = j|X0 = i) → πj , j = 1, 2, 3,
where π = (27/61, 22/61, 12/61) is the unique solution of π = πP with π1 +π2 +π3 =
1. By Theorem 2.19 the limits can be expressed in terms of the mean recurrence
times, πj = 1/νj , where νj = E(Tjj ) and Tjj is the mean recurrence time to state j.
Hence
ν1 = 61/27 = 2.26, ν2 = 61/22 = 2.77, ν3 = 61/12 = 5.08.
unit. The resulting system of equations is linear and hence it is relatively easy to
find the expected first passage times.
Example 2.21 (Expected time to reach a particular state). It is instructive
to continue the previous Example 2.20 with three states E = {1, 2, 3}. If we fix
j = 1 the conditioning principle yields the three equations
1 1
m11 = p11 + p12 (1 + m21 ) + p13 (1 + m31 ) = 1 + m21 + m31
3 6
1 1
m21 = p21 + p22 (1 + m21 ) + p23 (1 + m31 ) = 1 + m21 + m31
2 4
1 1
m31 = p31 + p32 (1 + m21 ) + p33 (1 + m31 ) = 1 + m21 + m31
6 6
The equations are readily solved giving m11 = 61/27, m21 = 26/9, m31 = 16/9,
where obviously the numerical value of m11 = ν1 must coincide with the one obtained
in Example 2.20 using a completely different argument. The expected recurrence
times mij for j = 2 and j = 3 are obtained analogously.
Example 2.22 (Mean absorption time). Suppose E = {0, 1, 2} and
1/2 1/3 1/6
P = 1/4 1/2 1/4
0 0 1
and so, state 2 is absorbing. Put
m0 = E(T02 ) = mean absorption time starting in 0
m1 = E(T12 ) = mean absorption time starting in 1
By conditioning on the first jump,
1 1 1 1 1
m0 = (1 + m0 ) + (1 + m1 ) + · 1 = 1 + m0 + m1
2 3 6 2 3
1 1 1 1 1
m1 = (1 + m0 ) + (1 + m1 ) + · 1 = 1 + m0 + m1 ,
4 2 4 4 2
from which we obtain the solution m0 = 5, m1 = 9/2.
Solved exercises
1. All nonzero transition probabilities of a Markov chain with states E = {0, 1, 2, 3}
are indicated by directed edges in the following transition graph, where some
but not all probabilities are written out. Find the transition probability matrix.
0.3 0
0.1 0.2
0.4 1 0.4 2 0.3
3
48 2. Markov Chain Models, discrete time
Solution. In the graph we may directly read off the matrix elements
0 0.1 . . . 0
0.3 0.4 0 . . .
P= 0.2 0 0.3 . . .
0 . . . 0.4 0
Then add the remaining elements such that each row sum equals one, to get
0 0.1 0.9 0
0.3 0.4 0 0.3
P= 0.2 0 0.3 0.5
0 0.6 0.4 0
2. A Markov chain with state space E = {0, 1, . . . , 6} has the transition matrix
0 1/2 0 1/3 0 0 1/6
3/4 0 0 0 1/4 0 0
0 0 0 2/3 1/3 0 0
P= 0 0 0 0 1 0 0
0 0 1/4 0 0 3/4 0
0 0 0 0 0 0 1
0 0 0 1 0 0 0
Draw a state transition diagram.
Solution. Guided by the position of nonzero elements in P place the nodes of
the graph to help visualize the transitions, for example as done in Figure 2.
1 1/4
1/4 4 3/4
1/3
3/4 1/2 2 1 5
2/3 1
3
1/3 1
1/6
0 6
which in equilibrium gives P (Xn+1 = N |Xn = H)πH = 0.95 · 0.095. During one
hour one would expect approximately 60 · 0.95 · 0.095 ≈ 5.4 such measurements.
6. Let
0 1 0 0
1/2 0 1/2 0
P= 0 0 0 1
1 0 0 0
be the transition probability matrix for a Markov chain. The simulation in
Figure 3 shows a typical path of the process.
a) Is the chain irreducible? Is it aperiodic?
b) Find all stationary distributions.
c) Consider the chain restricted to the even time points 0, 2, 4, . . . . Find the
corresponding transition matrix. Is this new chain irreducible? Is it aperi-
odic? Find all stationary distributions.
10 20 30 40 50 60 70 80 90 100
Solution. Since it is possible to visit the states one by one and return to the
initial state following the cycle 1 → 2 → 3 → 4 → 1, the Markov chain is
irreducible. Returns to state {1} can only occur at times 2, 4, 6, . . . , similarly
for the other states, hence the chain is periodic with period 2. Because of this
Theorem 2 is not applicable but we can still find the stationary distributions by
solving the equation π = πP. In this case there is only one solution, which is
thus the unique stationary distribution π = (1/3, 1/3, 1/6, 1/6).
For the even numbered dynamics of c), the transition matrix becomes
1/2 0 1/2 0
0 1/2 0 1/2
Q = P2 = 1 0
0 0
0 1 0 0
This chain is not irreducible since {1, 3}
and {2, 4} form irreducible sub-chains
1/2 1/2
with transition matrix R = . The new Markov chain is aperiodic
1 0
since states 1 and 2 can be revisited in only one step. For R the vector (2/3, 1/3)
is a stationary distribution, and from this follows that the stationary distribu-
tions for Q are of the form (2c/3, 2(1 − c)/3, c/3, (1 − c)/3), where 0 ≤ c ≤ 1.
Solved exercises 51
7. A simple model for DNA is that the sequence is a string of symbols, {Xn }, where
each new nucleotide is independently chosen from {A, C, G, T } with probabilities
0.25, 0.30, 0.15, 0.30, respectively. Suppose that there is an enzyme that breaks
up the sequence as soon as the “word” AC appears in the sequence. To study
the effect of the enzyme, let {Yn } be a process with the three states {A, AC, B},
so that Yn = A if Xn = A, Yn = AC if Xn−1 = A and Xn = C, and Yn = B
otherwise.
(a) Motivate that {Yn } is a Markov chain, and determine its transition matrix.
(b) Compute the asymptotic distribution for {Yn }.
(c) What is the average length of an AC-fragment (in equilibrium)?
Solution. a) Each new symbol Xn+1 is independent of the previous ones X1 , . . . , Xn−1
in the sequence, and hence {Xn } is, trivially, a Markov chain. The transition
probability matrix is
pA pC pT pG
pA pC pT pG
Pold =
pA pC pT pG
pA pC pT pG
where pA = 0.25, pC = 0.30, pT = 0.30, pG = 0.15. Next, we consider the
sequence Y1 , Y2 , . . . . If, for some n, Yn equals A then Yn+1 will be again A with
probability pA , AC with probability pC , and B with the remaining probability
pT + pG . Similarly, If Yn is AC or B we obtain the distribution of Yn+1 , without
using any knowledge of Y1 , . . . Yn−1 . Hence {Yn } is a Markov chain, and the
transition matrix is
pA pC pT + pG 0.25 0.30 0.45
Pnew = pA 0 1 − pA = 0.25 0 0.75
pA 0 1 − pA 0.25 0 0.75
b) Since {Yn } is irreducible and aperiodic the asymptotic distribution is given
by the stationary distribution π = (1/4, 3/40, 27/40), which is obtained as the
solution to π = πPnew .
c) The expected number of nucleotides from one instance of AC to the next
is the “expected return time”, given by the ratio 1/π2 = 40/3 ≈ 13.3 (Theorem
2.19).
8. The following paintball duelling problem has been given as an assignment within
the course Programming for Engineering students at UU, to be studied as a simu-
lation exercise. Here, we find exact solutions by using Markov chain techniques.
Agnes, Beata and Cecilia decides to settle a dispute by performing a paintball
shooting tournament. Cecilia is an excellent shooter who always hits her target.
Beata hits on average every second shot. Least skilled Agnes is known to have a
hitting probability of 0.3. Thus, it is agreed that the firing order will be Agnes
first, then Beata and last Cecilia. Each person is allowed to fire one shot and is
free to choose at whom among any remaining competitor. Any person hit by a
shot is out. Last person remaining is declared the winner. We assume that each
shot is independent of any previous shot.
a) A reasonable strategy for maximizing the chance of winning is to always
aim at the best remaining shooter, since that person is likely to be your
52 2. Markov Chain Models, discrete time
greatest threat. Assuming this strategy find the winning probabilities for
Agnes, Beata and Cecilia, respectively.
b) Agnes might consider the following alternative strategy. As long as both
Beata and Cecilia remain then Agnes fires in the air, deliberately missing
the target, with the motivation that none of Beata or Cecilia would aim
their first shot at her. Suppose Agnes adopts the alternative strategy with
everything else as before. Does this increase her chances to win? Find the
winning probabilities in this case.
Solution for a).
Agnes aims her first shot at Cecilia. If she misses, Beata is also going to
shoot at Cecilia. If Beata misses, she will be hit in the next round by Cecilia,
after which Agnes would take a shot at Cecilia. Thus, the only way for Cecilia
to win the competition is that Agnes misses her first shot, that Beata misses her
first shot, and that Agnes misses her second shot. Since these three events are
independent, the winning probability for Cecilia is 0.7 · 0.5 · 0.7 = 0.245.
Computing Agnes and Beatas winning probabilites is more involved. We may
represent the entire game by using a Markov chain with 10 states as follows:
ABC BAC CAB AB BA AC CA A B C
Here, the three states A, B and C represent Agnes, Beata or Cecilia winning
the game. A two-letter symbol such as BA means that Agnes and Beata remain
and the first person listed, Beata in this case, is about to shoot at Agnes. Three
letters such as BAC indicate that all three remain with the first listed, Beata,
to aim at the third listed, Cecilia, with the second listed, Agnes, safe for the
moment. By going through the various cases which may occur, we obtain the
transition graph of Figure 4. The initial state is ABC, and the states A, B and
BA BAC
0.5 0.5
0.5 0.5
0.7 AB
B 0.3 CAB
1
0.3
A AC
0.7
1
CA C
Beata is qABC . Now, considering the outcome of the first shot of the tournament,
qABC = 0.3 qBA + 0.7 qBAC .
Moreover, starting from BA we have qBA = 0.5 + 0.5qAB and starting from AB
yields qAB = 0.3 · 0 + 0.7qBA . The last two equations together imply qBA = 10/13
and qAB = 7/13. Furthermore, starting from the state BAC the only way for
Beata to win the game is to pass through AB. Hence qBAC = 0.5qAB = 7/26.
Therefore,
3 10 7 7 49
P (Beata winning) = + = = 0.4192
10 13 10 26 260
It follows that the probability for Agnes to win is 1 − 49/260 − 49/200 =
873/2600 = 0.3358.
Chapter 3
Much of the flexibility and versatility of Markov chain modeling come from the fact
that, just as we have useful theory and methods for chains which jump at discrete,
prespecified, time points, there is a parallel theory of Markov chains which carry
out the jumps at random time points on a continuous time scale. A Markov chain
{X(t), t ≥ 0} in continuous time is defined by a collection of transition intensities
{qij , i, j ∈ E}, such that
(3.1) P (X(t + h) = j|X(t) = i) = qij h + o(h), h → 0, i 6= j.
Since these probabilities do not depend on t we are again considering the time
homogeneous case (c.f. discrete time). By comparison with the property of the
Poisson process in Proposition 4 b), it follows from (3.1) that the waiting time for
the chain to jump from state i to state j is an exponential random variable with
intensity qij . In fact, it is a consequence of the Markov property in continuous time
that all holding times between successive jumps must be exponentially distributed
with state dependent parameters. Moreover, the waiting time until the chain leaves
a state i is also exponential. To see this, observe that
P (jump occurs during (t, t + h]|X(t) = i) = (qi0 + qi1 + . . . )h + o(h), h → 0,
and denote
X
qi = qij .
j6=i
55
56 3. Continuous time Markov chains
probability distribution
P (chain jumps to j|chain leaves i) = qij /qi , j 6= i.
The structure of the continuous time Markov chain is indicated in Figure 1.
X(t)
5 q_45/q_4
q_4 h+o(h)
4
3 q_43/q_4
2 q_42/q_4
1
t
t t+h
Exp(q_5)
Definition 3.1. The infinitesimal generator of the continuous time Markov chain
{X(t)} is the matrix
−q0 q01 q02 ...
q10 −q1 q12 ...
Q= q20 q21 −q2
...
.. .. .. ...
. . .
where it should bePnoted that by definition of the diagonal elements all row-sums
are equal to zero, j qij = 0, i = 0, 1, . . . .
Example 3.2 (Poisson process is Markov). It follows from Proposition 1.26
and the previous definition that the standard Poisson process is a continuous time
Markov chain with infinitesimal generator
−λ λ 0 ...
...
0 −λ λ
Q= ...
0 0 −λ
.. .. ... ...
. .
The definitions 2.7 and 2.8 given earlier of stationary and asymptotic distribu-
tions for discrete time Markov chains, have straightforward counterparts for Markov
chains in continuous time. Thus, if P (X(0) = k) = πk , k ∈ E, and {πk } is a station-
ary distribution for a Markov chain {X(t), t ≥ 0}, then P (X(t) = k) = πk , k ∈ E,
for all t ≥ 0. Moreover, if P (X(t) = k) → πk , k ∈ E, as t → ∞ regardless of the
distribution of X(0), then {πk } is said to be an asymptotic distribution. Again, as-
ymptotic distributions are always stationary. It is sometimes convenient to describe
3. Continuous time Markov chains 57
an asymptotic distribution π by saying that the Markov process has a steady state
X∞ with πk = P (X∞ = k), k ∈ E.
Example 3.4 (Spin-flip Markov chain). Consider the two states E = {+, −}
and think of the plus state as spin-up and the minus state as spin-down. The
spin-flip Markov chain is defined by the infinitesimal generator matrix
−λ λ
Q=
µ −µ
This is a continuous time Markov chain {X(t), t ≥ 0} on E with the following
dynamical behavior. If the chain is spin-down at a fixed point in time then it will
stay there for an exponentially distributed random time with expected value 1/λ
and then flip to spin-up. If the chain is in state + at a fixed time it remains up for
an exponentially distributed random time with expected value 1/µ and then flips
to state −. Because of the memoryless property of the exponential distribution, the
above description is true regardless of which “fixed time” we choose. In conclusion
the trajectory of the chain consists of consecutive cycles where each cycle is composed
of a spin-down period and a subsequent spin-up. Applying Theorem 3.3 to this
example we rename the states E = {0, 1} and seek a solution π = (π0 , π1 ) with
π0 + π1 = 1 of the matrix equation π Q, that is
−λπ0 + µπ1 = 0, λπ0 − µπ1 = 0, π0 + π1 = 1.
It follows that that the two-state Markov chain does have a unique stationary dis-
tribution π0 = µ/(λ + µ), π1 = λ/(λ + µ). For an arbitrary initial distribution of
X(0), the two-state chain has the asymptotic property
µ λ
lim P (X(t) = 0) = π0 = lim P (X(t) = 1) = π1 = .
t→∞ λ+µ t→∞ λ+µ
The conditioning technique of first-step analysis discussed in Section 2.3 for
discrete time Markov chains applies to continuous time systems as well. We give an
example.
Example 3.5 (Condition on first jump, continuous time). Consider the
Markov chain with states E = {1, 2, 3} and infinitesimal generator matrix
−λ λ 0
Q = 2µ −(2µ + λ) λ ,
0 3µ −3µ
Let
vij = expected time to reach j starting from i.
58 3. Continuous time Markov chains
To demonstrate the method, let’s fix j = 3 and focus on the expected times v13 and
v23 . For each initial state, 1 and 2, by conditioning on the outcome of the first jump,
we obtain the equations
1 1 λ 2µ
v13 = + v23 , v23 = + ·0+ v13
λ 2µ + λ 2µ + λ 2µ + λ
and hence
v13 = 2(λ + µ)/λ2 , v23 = (λ + 2µ)/λ2 .
Similarly for other values of j.
For non-finite state space it is more complicated to give conditions under which
there exists an asymptotic distribution. We restrict this discussion to a particu-
lar class of processes of great importance for applications, namely birth-and-death
processes.
We have thus proved, by induction, the balance equations for birth-and-death pro-
cesses in the form
(3.2) µn πn = λn−1 πn−1 , n ≥ 1.
Solving in terms of π0 , this gives
λn−1 λn−1 · · · λ0
πn = πn−1 = · · · = π0 , n ≥ 1.
µn µn · · · µ1
Then it remains to find π0 such that these relations are consistent with the required
normalization !
X∞ X∞
λ0 · · · λk−1
π0 + πk = 1 + π0 = 1.
k=1 k=1
µ 1 · · · µk
Based on the above motivating derivations we can now state the central result for
stationary distributions of birth-and-death processes. This result gives necessary
and sufficient conditions for an asymptotic distribution to exist and, if it exist, an
explicit product formula for the limiting equilibrium probabilities.
First, however, it is important to point out that in the case of an irreducible
birth-and-death process on a finite state space E = {0, 1, . . . , r}, we have
λn−1 · · · λ0
πn = π0 , 1 ≤ n ≤ r,
µn · · · µ1
P
the normalization becomes rk=0 πk = 1, and we can deduce already from Theorem
3.3 that these relations define a stationary and asymptotic distribution.
We give two examples. In the first example the state space is finite and the
theory is covered by Theorem 3.3. The second example has countable state space
and therefore requires Theorem 3.6.
Example 3.7 (Lost customers). Cars arrive at a village gas station according to
a Poisson process with an average arrival rate of 30 cars per hour. The gas station
is equipped with two pumps. On average a customer spends three minutes at one
of the pumps. The service times are assumed to be exponential and independent of
each other and of the customer arrival patterns. At the station there is only room for
one car at each of the pumps plus one more waiting. If all three spots are occupied
60 3. Continuous time Markov chains
3.5
2.5
1.5
0.5
0
0 0.5 1 1.5 2 2.5
at the time of arrival of a fourth potential customer, the new car instead continues
to another gas station. We are interested in the proportion of potential customers
leaving the station due to the lack of waiting areas.
Let X(t) ∈ {0, 1, 2, 3}, t ≥ 0, denote the number of cars at the gas station, mod-
eled as an irreducible birth-and-death process with infinitesimal generator matrix
−λ λ 0 0
µ −(λ + µ) λ 0
.
0 2µ −(λ + 2µ) λ
0 0 2µ −2µ
The given information says that λ is 30 per hour and µ 20 per hour. Figure 2 shows
the typical variation of X(t) over a period of 2.5 hours. It is natural to describe the
current number of cars at the station using the stationary distribution of the birth-
and-death process. According to the above discussion the asymptotic probabilities
satisfy
λ λ2 λ3
π1 = π0 , π2 = 2 π0 , π3 = 3 π0 .
µ 2µ 4µ
After normalization and insertion of the numeric values of the parameters, this
gives the steady state solution π = (32, 48, 36, 27)/143. In particular, E(# cars) =
201/143 ≈ 1.406. Moreover, the probability of losing a customer becomes π3 ≈ 0.189,
i.e. approximately 5.7 customers per hour.
Example 3.8 (The service model M/M/1). The M/M/1 model is the birth
and death process {X(t), t ≥ 0} in continuous time with constant jump intensities
λn = λ, n≥0 µn = µ, n ≥ 1.
3.1. Birth-and-death processes 61
condition 3.3 is fulfilled under the assumption λ/µ < 1. It is customary to introduce
a new parameter ̺, called the traffic intensity, by setting
̺ = λ/µ
and conclude, in the sub-critical regime ̺ < 1, that (3.4) simplifies to
∞
X
k
πk = ̺ / ̺k = ̺k (1 − ̺).
k=0
Hence if ̺ < 1 the M/M/1 model has an asymptotic steady state given by the
Ge(1 − ̺) distribution. We illustrate this fundamental example in Figure 3 which
shows simulated traces of the process for the three cases with traffic intensity ̺ equal
to 0.9, 1.0 and 1.1, respectively.
120
100
80
60
40
20
0
0 100 200 300 400 500 600 700 800 900 1000
25
20
15
10
0
AAA AA A BBB BB B CCC
Figure 4
by Standard & Poors, Sept. 2011, which uses credit rating data 1990-2010 for a
large number of global corporations and shows average one-year rates for transitions
between the credit classes.
AAA AA A BBB BB B CCC D
AAA 0.8791 0.0808 0.0054 0.0005 0.0008 0.0003 0.0005 0
AA 0.0057 0.8645 0.0819 0.0053 0.0006 0.0008 0.0002 0.0002
A 0.0004 0.0190 0.8730 0.0537 0.0038 0.0017 0.0002 0.0008
BBB 0.0001 0.0013 0.0371 0.8456 0.0399 0.0066 0.0015 0.0025
BB 0.0002 0.0004 0.0017 0.0522 0.7568 0.0733 0.0076 0.0095
B 0 0.0004 0.0014 0.0023 0.0549 0.7318 0.0449 0.0472
CCC 0 0 0.0019 0.0028 0.0083 0.1300 0.4380 0.2743
For example, close to 88% of all corporations rated AAA on January 1 of a given
year managed to keep its top-rating one year later. Of the A-rated companies, 2%
were upgraded to AA and 5% downgraded to BBB, and so on.
3.2. Credit rating models 63
The credit rate data resembles closely the way we interpret transitions of Markov
chains, except that the probabilities in each row of the table do not sum to one. This
is because the agency for some reason was unable to estimate new ratings for the
remaining percentage of corporations in each class. We will take the simplest ap-
proach in dealing with this issue and essentially ignore the missing data. To proceed
we now implement the Markov property by assuming that the history of a corpo-
ration does not affect its future prospect, only the current standing. For example,
an A-corporation recently downgraded from AA is an equally risky investment as a
corporation graded A for a longer time period.
Now we follow the approach in [16], which is applying a continuous time Markov
chain with the objective of estimating corporate lifespan based on the credit rating
data. The corresponding infinitesimal generator matrix is
−0.0883 0.0835 0.0056 0.0005 0.0008 0.0003 0.0005 0
0.0059 −0.0947 0.0854 0.0055 0.0006 0.0008 0.0002 0.0002
0.0004 0.0199 −0.0796 0.0564 0.0040 0.0018 0.0002 0.0008
0.0001 0.0014 0.0397 −0.0890 0.0427 0.0071 0.0016 0.0027
0.0002 0.0004 0.0019 0.0579 −0.1449 0.0813 0.0084 0.0105
0 0.0005 0.0016 0.0026 0.0622 −0.1511 0.0509 0.0535
0 0 0.0022 0.0033 0.0097 0.1520 −0.4173 0.3207
0 0 0 0 0 0 0 0
where the last row of zeros represents the absorbing default state D. Following the
method in Example 3.5 we can now find the expected values mi = E(TiD ), the
expected time to default given a corporation belongs to class i. The numerical
calculation is done in [16] with the resulting mean corporate lifespan ranging from
105.5 years for an AAA-company to about 15 years for CCC.
Since the data reflect annually updated corporate ratings it is also natural to use
the discrete time model. Now we take the view that defaulted corporations have a
chance to regain creditworthiness, with a probability of 5% say. For simplicity, we
suppose this happens by a restructuring procedure leading to an immediate AAA
rating. By renormalizing the diagonal elements we obtain the probability transition
matrix
0.9087 0.0835 0.0056 0.0005 0.0008 0.0003 0.0005 0
0.0059 0.9013 0.0854 0.0055 0.0006 0.0008 0.0002 0.0002
0.0004 0.0199 0.9164 0.0564 0.0040 0.0018 0.0002 0.0008
0.0001 0.0014 0.0397 0.9048 0.0427 0.0071 0.0016 0.0027
P=
0.0002 0.0004 0.0019 0.0579 0.8393 0.0813 0.0084 0.0105
0 0.0005 0.0016 0.0026 0.0622 0.8289 0.0509 0.0535
0 0 0.0022 0.0033 0.0097 0.1520 0.5121 0.3207
0.0500 0 0 0 0 0 0 0.9500
This defines a finite, irreducible and aperiodic Markov chain. By Theorem 2.14
there exists a unique asymptotic distribution π which solves π = πP . The solution
is displayed graphically in Figure 5.
64 3. Continuous time Markov chains
0,25
0,2
0,15
0,1
0,05
0
AAA AA A BBB BB B CCC D
Solved exercises
1. Find the asymptotic distribution of the birth-death process defined by the birth
and death intensities
λn = 2, 0 ≤ n ≤ 3, µn = 1, 1 ≤ n ≤ 4.
Determine the expected value and the variance of the asymptotic distribution.
Solution. We have a finite Markov chain with states {0, 1, 2, 3, 4}, which is irre-
ducible since all λn and µn are strictly positive. By Theorem 3.3, there exists a
unique stationary distribution π = (π0 , . . . , π4 ) which is also asymptotic. Here,
the probability vector π solves the system of equations πQ = 0 and we know for
finite birth-death chains that the solution has the product form
λ0 . . . λk−1
πk = π0 = 2k π0 , k = 1, 2, 3, 4.
µ1 . . . µk
By normalizing the solution, meaning that we use the criteria π0 + · · · + π4 = 1,
(1, 2, 4, 8, 16) 1 2 4 8 16
π= = , , , , .
1 + 2 + 4 + 8 + 16 31 31 31 31 31
The corresponding expected value is
4
X 0 + 2 + 8 + 24 + 64
m= kπk = = 98/31 ≈ 3.16
k=0
31
and the variance
X4
2 0 + 2 + 16 + 72 + 256 1122
σ = k 2 πk − m2 = − m2 = 2
≈ 1.17
k=0
31 31
2. A continuous time Markov chain X defined on the state space E = {0, 1, 2, 3}
starts in state 0 and has the generator matrix
−5 5 0 1
0 −5 2 3
.
0 2 −5 3
0 0 0 0
Find the expected time to absorption in state 3.
Solved exercises 65
It follows that there exists a unique distribution {πk }, stationary and asymptotic,
given by (3.4) as
πk = (λ/µ)n π0 π0 = exp{−λ/µ},
which we recognize as the Poisson distribution with mean value λ/µ.
4. A birth-death process has the jump intensities λk = λ(N − k) and µk = µN k.
Here λ and µ are positive parameters, N is a positive integer, and k varies
between 0 and N . Show that there is an asymptotic distribution, which is
given by a particular standard distribution. What happens to the asymptotic
distribution as N → ∞?
Solution. The balance equations for this particular birth-death process give the
result
λ0 · · · · · λk−1 N
πk = π0 = (λ/µN )k π0 , 0 ≤ k ≤ N.
µ1 · · · · · µk k
Normalization as a probability distribution shows that π0 = 1/(1 + λ/µN )N ,
hence
N λ/µN k 1 N −k
πk = .
k 1 + λ/µN 1 + λ/µN
λ/µN
We recognize the stationary distribution as a Bin(N, 1+λ/µN ). Since the process
has finite state space and is irreducible, this distribution is also asymptotic.
A general result in probability theory says that a binomial distribution
Bin(n, a/n) converges in distribution to the Poisson distribution Po(a) as n → ∞.
In this case we obtain the Poisson distribution with mean λ/µ in the limit, since
λ/µN λ
N = → λ/µ
1 + λ/µN µ + λ/N
Chapter 4
Some non-Markov
models
The Markovian paradigm - that given the past up to now, the future only depends
on the present, is mathematically tractable and widely applicable. Nevertheless, the
Markov property singles out a particular class of models and excludes other natural
random mechanisms. In this direction we will discuss two types of non-Markov
extensions, renewal processes and stationary time series models.
The special choice F1 (t) = F (t) = 1 − e−λt in Definition 4.1 is the case where all
variables U1 , U2 , . . . are exponentially distributed with expected value 1/λ. Then,
by Definition 1.24, N (t) is the Poisson process with intensity λ. This is the only
choice of F which makes the renewal process a Markov process, and hence we can
think of renewal processes as non-Markov generalizations of the Poisson counting
process.
67
68 4. Some non-Markov models
N(t)
3
2
T_1 T_2 t
U_1 U_2
Figure 1 is helpful in order to make two useful observations about the quantities
Tn and N (t) in Definition 4.1, namely
and
TN (t) ≤ t < TN (t)+1 .
The latter ordering property shows that
(4.2)
N (t) N (t)+1
1 X t 1 X N (t) + 1
Ui ≤ ≤ Ui · on the set {N (t) > 0}.
N (t) i=1 N (t) N (t) + 1 i=1 N (t)
P
The strong law of large numbers gives us n1 nk=1 Uk → ν as n → ∞, in the sense of
almost sure convergence. As t → ∞, the number of renewals also tends to infinity.
Actually N (t) → ∞ passing through every integer value one by one. Thus
N (t)
1 X
Uk → ν, t → ∞.
N (t) k=1
But then it follows from (4.2) that t/N (t) is asymptotically squeezed in between two
quantities, below and above, both converging to ν. This shows that N (t)/t → 1/ν
as t → ∞. More advanced methods, see e.g. [10], [21], lead to the corresponding
property for the mean number of renewals. This type of result is known as the
elementary renewal theorem.
4.2. Renewal reward processes 69
The aim is to find the asymptotic mean reward in the sense of a time average, i.e.
the limit as t → ∞ of
N (t)
R(t) N (t) 1 X
= Ri + fraction of partial reward.
t t N (t) i=1
It is clear from this relation what to expect. The renewal theorem shows that
N (t)/t → 1/ν, and so it should follow from the strong law of large numbers that
R(t)/t → E(R)/ν as t → ∞. The typical assumptions which are imposed on the
rewards in order to guarantee the expected behavior are that for each j, Rj may
depend on Uj but is independent of all Ui , i 6= j, and that {Ri } is an independent
sequence, identically distributed except possibly R1 which is allowed to have a differ-
ent distribution. Moreover, it is assumed that E|Ri | < ∞ for any i. It can be shown
that under these assumptions the details of assigning rewards to renewal events does
not affect the end result. It does not matter whether a reward is counted at the
beginning or at the end of a renewal interval, or if it is gradually allocated contin-
uously over time. In either case the partial rewards vanish asymptotically, and the
renewal reward theorem states that the time averaged total reward converges to the
cycle averaged reward.
70 4. Some non-Markov models
For proofs and more general versions related to regenerative processes, see e.g.
Wolff [21].
Example 4.4 (Markov two-state process). Consider the two-state Markov pro-
cess in Example 3.4. Denote the successive on-period durations by S1 , S2 , . . . and
the off-period durations by T1 , T2 , . . . , so that the on and the off periods form i.i.d.
sequences with exponential distributions of parameter µ and λ, respectively. Put
Ui = Si + Ti , i ≥ 2 and define U1 to be S1 + T1 if X(0) = 1 and T1 otherwise. Let
N (t) be the renewal process associated with the sequence {Ui } and define rewards
Ri = Si for each renewal cycle i. Then N (t) counts the number of on-periods up
to time t and the corresponding renewal reward process R(t) gives the total dura-
tion of on-periods up to time t. By the renewal reward theorem the ratio R(t)/t,
which is the fraction of time that the two-state Markov chain spends in the on-state,
converges to the ratio of expected reward to expected cycle length, that is
R(t) E(R) 1/µ λ
→ = = .
t E(S) + E(T ) 1/µ + 1/λ λ+µ
This shows that the limit of the time average studied in this example is the same as
the asymptotic probability π1 in Example 3.4.
Example 4.5 (On-off process). The Markov property of the on-off process in
the the previous example is not used for the application of the renewal reward
theorem. In fact, we may take arbitrary distributions with finite expected values
E(S) and E(T ) respectively, to model any system which goes through consecutive
on and off periods which satisfy the independence criteria of the renewal model.
By the renewal reward theorem we have the intuitively appealing result that the
asymptotic availability, meaning the asymptotic fraction of time during which the
system is on, is given by the ratio E(S)/(E(S) + E(T )).
time T0 is used to handle the loss of an ACK; according to certain rules expiry of
the clock at the sender results in retransmission of one or several unacknowledged
packets. Suppose that with probability p, either the packet or its ACK is lost during
transmission, hence unacknowledged at the sender. Suppose also that packet losses
are independent of each other.
We begin with the Go-Back-1 protocol. The sender transmits a single packet on
the channel. Then waits a round-trip-time for the corresponding ACK to arrive. If
the packet is successfully delivered the return of the ACK packet marks the start of
the next round. If the packet is lost so that no ACK packet arrives in the expected
time period, then the time-out clock is activated and the packet is retransmitted
with an additional delay of T0 round-trip-times.
The time periods between consecutive packet loss events will now form cycles
where each cycle consists of a random number of rounds. Let K denote the number
of such rounds in a cycle,
Throughput(t) = Ri ,
t i=1
Here {h(n)} and {h(t)} are called the impulse response functions of the linear filters.
A typical situation is that the input signal {x(n)} is a sequence of independent
random variables and the output {y(n)} hence a sequence of dependent random
variables, where the nature of the dependence structure varies with the choice of
impulse response. Another typical situation is that the input is a signal subject to
disturbing noise and the purpose of letting it through the filter is to obtain a less
noisy output. In fact, filters can be designed so that they change the nature of the
signal in a particular direction.
The stability condition is a natural restriction put on the impulse response func-
tion which makes the output sequence well-defined and allows to start building up
a mathematical theory for filters. Causality is a natural assumption in particular if
the filter describes evolution over time. In such a case, a filter is causal if the output
depends only on the past and the present of the input but not on future values of
the input.
The most important time series models for engineering applications are those
for which the input sequence is a weakly stationary process. If a weakly stationary
stochastic process is applied as input and fed through a stable, causal, linear filter,
then the output has equally tractable properties. Recalling Definition 1.8, the key
property of a weakly stationary process is that the autocovariance function r(s, t), of
two variables, only depends on the time difference |t − s|. It is therefore convenient
4.4. Time series models 73
Theorem 4.7. Assume that {X(t)} is a weakly stationary process with mean value
function mX and covariance function rX . Let h be the impulse response function
of
P∞ a stable, causal, linear filter. Then, in discrete time, the output sequence Yn =
k=0 h(k)Xn−k from the filter is weakly stationary with mean value and covariance
given by
X∞
mY = mX h(k)
k=0
∞ X
X ∞
rY (τ ) = h(i)h(j)rX (τ + i − j).
i=0 j=0
R∞
In continuous time, the output Y (t) = 0 h(s)X(t − s) ds is weakly stationary with
Z ∞
mY = mX h(t) dt
0
Z ∞Z ∞
rY (τ ) = h(s)h(t)rX (τ + t − s) dsdt.
0 0
A stochastic process {X(t)} is said to be Gaussian if for each t, X(t) has a normal
distribution. Similarly in discrete time. When we recall from basic probability
theory that the normal distribution is preserved under summation of a finite number
of random variables, it comes as no surprise that linear filters also preserve the
Gaussian property. Indeed, the following result is shown using the corresponding
properties for infinite series and integrals of random processes, see e.g. Hoel et al.
[13].
Theorem 4.8. If the input to a stable, linear filter is a weakly stationary Gaussian
process with mean value mX and covariance rX , then the output from the filter is
again a weakly stationary Gaussian process, with mean and covariance given by
mY and rY in Theorem 4.7.
Example 4.9. The discrete time filter with impulse response functionPh(k) = ak ,
k ≥ 0, and h(k) = 0, k < 0, is causal and stable if |a| < 1, since then ∞ k
k=0 |a| =
1/(1 − |a|) < ∞. This filter applied to an input signal {Xn } yields the output signal
{Yn }, given by
(4.5) Yn = Xn + aXn−1 + a2 Xn−2 + . . .
74 4. Some non-Markov models
The effect of this filter is illustrated in Figure 2 for the particular case when the
input {Xn } is a sequence of independent Gaussian random variables with mean 0
and variance 1. The top graph is the input noise, the middle graph is the output for
a = 0.5 and the lower graph is the output in the case when the parameter is set to
a = 0.9. For small values of a the output sequence is more or less the same sequence
as the input, but with increasing values of a the strength of the dependence between
the values of the signal at different time points increases. In the lower graph this
is visible in that some of the fluctuations over short time scales are reduced. The
filter is called a low pass filter since such high frequency fluctuations are dampened.
We apply Theorem 4.7 to this model. The input sequence {Xn } has mX = 0
−5
0 50 100 150 200 250 300 350 400
−5
0 50 100 150 200 250 300 350 400
−5
0 50 100 150 200 250 300 350 400
Figure 2. Input (top) and output of the linear filter (4.5), a = 0.5 (middle) and
a = 0.9 (bottom)
Moreover, using now Theorem 4.8 it follows that the output sequence from the
filter is Gaussian. Hence for each n, the random variables Yn all have the normal
4.5. Autoregressive and moving average processes 75
distribution N(0, (1 − a2 )−1 ) and the covariance of any two elements Yn and Ym is
given by rY (m − n) = a|m−n| /(1 − a2 ).
Definition 4.10. Let {Xn } denote white noise with variance parameter σ 2 .
• A moving average process of order q (MA(q) for short) is defined as a finite
linear combination
Yn = c0 Xn + c1 Xn−1 + · · · + cq Xn−q , c0 = 1,
of the white noise input sequence.
• An autoregressive process of order p (AR(p) for short) is a sequence {Yn }
defined by the recursion
(4.6) Yn = −a1 Yn−1 − · · · − ap Yn−p + Xn ,
where it is assumed that p initial values are specified. One may take, for
example, Y−1 = · · · = Y−p = 0 and white noise {Xn , n ≥ 0} in order to
generate {Yn , n ≥ 0}.
Theorem 4.11. The MA(q) process is always stable and weakly stationary with
mY = 0 and q
σ2 X c c
j j−τ if |τ | ≤ q
rY (τ ) =
j=0
0 otherwise.
Next we turn to the study of AR(p) filters. In order for an autoregressive filter
to be stable and hence generate a well-defined autoregressive sequence of order p,
it is required that the coefficients a1 , . . . , ap are chosen appropriately. To make this
precise, let z ∈ C be the complex variable in the complex plane and let A(z) be the
generating polynomial
A(z) = 1 + a1 z + a2 z 2 + · · · + ap z p .
The equation
(4.7) z p A(z −1 ) = z p + a1 z p−1 + · · · + ap = 0,
called the characteristic equation of the AR(p) model, is known to have exactly p
solutions, or roots, z1 , . . . , zp in C.
Theorem 4.12. The AR(p) filter is stable if the coefficients a1 , . . . , ap are such
that all p roots z1 , . . . , zp of the characteristic equation (4.7) are located inside of
the unit circle in the complex plane, that is |zk | < 1, 1 ≤ k ≤ p. Equivalently,
all roots z1′ , . . . , zp′ of the equation A(z ′ ) = 0 must be outside of the unit circle,
that is |zk′ | > 1. In this case the AR(p) process {Yn } is weakly stationary with
mY = 0. The covariance function r = rY is obtained as the solution to the so
called Yule-Walker equations
r(0) + a1 r(1) + · · · + ap r(p) = σ2
r(1) + a1 r(0) + · · · + ap r(1 − p) = 0
r(2) + a1 r(1) + · · · + ap r(2 − p) = 0
..
.
r(k) + a1 r(k − 1) + · · · + ap r(k − p) = 0, k ≥ 1
..
.
Proof. The proof of the stability criterion goes beyond the scope of these notes.
Therefore we assume that the AR(p) filter is stable and verify the other claims of
the lemma.
Take expected values of both sides of (4.6). This gives
mY + a1 mY + · · · + ap mY = mX = 0,
in other words mY A(1) = 0. Since any solution z of z p A(1/z) = 0 have |z| < 1 we
can not have A(1) = 0, hence mY = 0.
To show that r satisfies the Yule-Walker equations we consider the covariance of
Yn−k and Xn . By the defining relation for the AR(q) filter
C(Yn−k , Xn ) = C(Yn−k , Yn + a1 Yn−1 + · · · + ap Yn−p ).
where it is seen from the weak stationarity that the right hand side equals
r(k) + a1 r(k − 1) + · · · + ap r(k − p).
But since Yn−k is independent of Xn for any k ≥ 1, we have C(Yn−k , Xn ) = 0 for
such k, and hence the left hand side vanishes. Hence
r(k) + a1 r(k − 1) + · · · + ap r(k − p) = 0 k ≥ 1.
4.5. Autoregressive and moving average processes 77
It remains to verify the first of the Yule-Walker equations, for the case k = 0. To do
so we rewrite the covariance C(Yn , Xn ) in two different ways. First, by the causality,
C(Yn , Xn ) = C(−a1 Yn−1 − · · · − ap Yn−p + Xn , Xn ) = C(Xn , Xn ) = σ 2 .
Second,
C(Yn , Xn ) = C(Yn , Yn + a1 Yn−1 + · · · + ap Yn−p ) = r(0) + a1 r(−1) + · · · + ap r(−p)
and so
r(0) + a1 r(−1) + · · · + ap r(−p) = r(0) + a1 r(1) + · · · + ap r(p) = σ 2 .
Example 4.13. We want to compare two stationary processes, one AR(1)-process
{Yn } and one MA(2)-process {Zn }, defined by the filter relations
Yn + 0.8Yn−1 = Xn
and
Zn = Xn + 0.4Xn−1 + 1.2Xn−2 ,
respectively, where {Xn } is an input signal. The simulations a) and b) of Figure
a)
5
−5
0 10 20 30 40 50 60 70 80 90 100
b)
5
−5
0 10 20 30 40 50 60 70 80 90 100
(3) show one realization each of the two time series, where {Xn } is simulated white
noise of mean zero and variance one. Which simulation is the AR(1) process, and
which is the MA(2) process?
78 4. Some non-Markov models
200
180
160
140
120
100
80
60
40
20
0
1700 1750 1800 1850 1900 1950 2000
Of course an obvious drawback of this method is that the filtered output sequence
may attain negative values, whereas sunspot numbers do not.
of the known variables, and try to choose the coefficients a1 , . . . an such that the
resulting mean squared error (variance of the prediction error)
Xn 2
b 2
E(Yn+k − Yn+k ) = E Yn+k − aj Y j
j=1
4.6. Statistical methods 81
200
150
100
50
−50
1700 1750 1800 1850 1900 1950 2000
Theorem 4.17 (Projection theorem for linear space). Suppose that a finite number
of variables Y1 , . . . , Yn has been observed and we want to predict Yn+k with a linear
combination n
X
b
Yn+k = aj Y j .
j=1
The optimal choice of coefficients a1 , . . . , an which makes the mean squared error
minimal, is obtained as the solution of the normal equations
P
C(Yn+k − Pnj=1 aj Yj , Y1 ) = 0
C(Yn+k − n aj Yj , Y2 ) = 0
j=1
..
P .
C(Y − n a Y , Y ) = 0.
n+k j=1 j j n
In this expression the covariance term is zero because of the normal equations. In-
deed,
Xn n
X h Xn X
n i
C Yn+k − aj Y j , (ai − bi )Yi = E Yn+k − aj Y j (ai − bi )Yi
j=1 i=1 j=1 i=1
n
X h Xn i
= (ai − bi )E Yn+k − aj Yj Yi = 0,
i=1 j=1
and so we have
n
X n
X X
n
V Yn+k − bj Yj = V Yn+k − aj Y j + V (aj − bj )Yj
j=1 j=1 j=1
Xn
≥ V Yn+k − aj Y j ,
j=1
Solved exercises
1. A delivery truck repeatedly drives back and forth between Stockholm and Upp-
sala, located a distance of 70 kilometers apart. On each trip (either Stockholm
to Uppsala or the reverse trip) traffic disturbances varies so that the driver is
equally likely either to see clear roads or congested roads. When the roads are
congested, the average speed for the trip is 60 km/h. When the roads are clear,
the average speed is 80 km/h. Over many hours of travel, what is the average
speed of the truck?
Solution. Some reflection shows that the naive answer 70 km/h is wrong, and the
renewal theorem helps formalize the correct calculation. During each cycle the
truck accumulates a “reward” of 70 km driving distance. The time to complete
this task is random with mean value ν = 0.5 · 70/60 + 0.5 · 70/80 = 49/48
hours and thus, by the renewal reward theorem, the average speed is the ratio
48 · 70/49 ≈ 68.6 km/h.
2. Recall the M/M/1 service system discussed in Example 3.8. Suppose that the
Poisson arrival process has intensity 2 per hour and the service times are expo-
nentially distributed with mean 20 minutes, independent of each other and of the
arrival process. In this system there is unlimited queuing space and customers
who arrive when the server is busy always wait in line.
Seen from the single server’s perspective, time can be divided into busy pe-
riods of actively serving a customer, and vacant periods during which there are
no customers in the system. Consider the system in equilibrium.
(a) What proportion of time is the server active?
(b) Determine the expected length of a vacant period.
(c) Determine the expected length of a busy period.
Hint: It is useful to note that the time points when customers arrive to an
empty system are renewals!
Solution. Figure 6 shows a simulated trace of this M/M/1 process with X(0) = 0.
The traffic intensity is ρ = 2/3. Since ρ < 1 we know from Example 3.8 that there
exists an asymptotic distribution, which is given by the geometric distribution
πk = ρk (1 − ρ), k = 0, 1, . . . .
84 4. Some non-Markov models
0
0 2 4 6 8 10 12 14 16 18 20
Figure 6. Vacant periods and busy periods of the queueing system M/M/1
variables all uniformly distributed on the interval [0, 1]. Find the asymptotic
Thruput of the protocol, measured in Mbytes per second.
Solution. Let U1 , U2 , . . . form cycles, times between renewals, of a renewal pro-
cess and associate with cycle i of length Ui a reward Ri = Ui2 /2. By the renewal-
reward theorem, the total workload of data packets, R(t),R 1 transmitted over time
t behaves as R(t)/t → E(R)/E(U ). Here, E(U ) = 0 u du = 1/2 seconds and
R1
E(R) = E(U 2 )/2, where E(U 2 ) = 0 u2 du = 1/3. Thus, E(R) = 1/6 and the
asymptotic capacity is obtained as the cycle average 1/3 Mbytes per second.
4. Assume that {Yn } is wide sense stationary with covariance function given by
rY (0) = 3, rY (k) = 2 for |k| = 1 and rY (k) = 0 for |k| ≥ 2. For m ≥ 1 find the
optimal linear filter for prediction of Ybm+1 based on two previous values Ym−1 ,
Ym .
Solution. With k = 1 and n = 2 the normal equations of Theorem 4.17 attain
the form
C(Yn+1 − a1 Yn−1 − a2 Yn , Yn−1 ) = 0
C(Yn+1 − a1 Yn−1 − a2 Yn , Yn ) = 0,
which is the same as
C(Yn+1 , Yn−1 ) − a1 C(Yn−1 , Yn−1 ) − a2 C(Yn , Yn−1 ) = 0
C(Yn+1 , Yn ) − a1 C(Yn−1 , Yn ) − a2 C(Yn , Yn ) = 0.
In terms of the covariance function rY this is the system of equations
rY (2) − a1 rY (0) − a2 rY (1) = −3a1 − 2a2 = 0
rY (1) − a1 rY (1) − a2 rY (0) = 2 − 2a1 − 3a2 = 0
and thus we find the solution a1 = −4/5, a2 = 6/5. Hence the optimal linear
prediction filter is given by Ybn+1 = −4Yn−1 /5 + 6Yn /5.
Chapter 5
Applications in Biology
and Bioinformatics
87
88 5. Applications in Biology and Bioinformatics
From the above construction it is clear that {Xn } is a discrete time Markov
chain. The transition probabilities pij = P (Xn = j|Xn−1 = i) are given by
N j i
pij = pi (1 − pi )N −j , pi = , 0 ≤ i, j ≤ N.
j N
Indeed, if Xn−1 = i then in each of the N independent draws the probability is
pi = i/N to pick an A1 parent and the probability is 1 − pi to pick a parent of type
A2 . The number of A1 offspring is therefore a random variable with the Bin(N, pi )
distribution, explaining the form of the transitionP probabilities. Based on this we
can continue by deriving the mean value E(Xn ) = rj=0 jP (Xn = j). Namely, since
the law of total probability shows that
N
X N
X
P (Xn = j) = P (Xn = j|Xn−1 = i)P (Xn−1 = i) = pij P (Xn−1 = i),
i=0 i=0
we have
N
X N
X N
X N
X
E(Xn ) = j pij P (Xn−1 = i) = P (Xn−1 = i) jpij .
j=0 i=0 i=0 j=0
PN
But for each fixed i, the sum j=0 jpij is the mean value N · i/N = i of a random
variable with the binomial distribution Bin(N, i/N ). Hence
N
X
(5.1) E(Xn ) = P (Xn−1 = i) i = E(Xn−1 ).
i=0
In the same way E(Xn−1 ) = E(Xn−2 ), and so on, which shows that the expected
value of the Wright-Fisher Markov chain is preserved over time, E(Xn ) = E(X0 ),
n ≥ 0, and determined by the expected value E(X0 ) in the initial distribution.
As an alternative the above can be derived using conditional expected values.
Namely, we have
N
X N
X
E(Xn |Xn−1 = i) = jP (Xn = j|Xn−1 = i) = jpij = i,
j=0 j=0
Fixation of genes. Both of the states {0} and {N } are absorbing. If the Markov
chain gets absorbed in state {0} then allele A1 is extinct and allele A2 has been fix-
ated, whereas absorption in {N } corresponds to fixation of A1 . As for any irreducible
finite Markov chain with absorbing states, absorption is in fact certain to take place.
This effect, which forces the population into a more homogeneous state (all A1 ’s or
all A2 ’s) is called random genetic drift. There are two natural questions to be asked:
a) How likely is an allele to be lost due to genetic drift?
b) What is the expected time for an allele to fix?
5.1. The Wright-Fisher model 89
Hence ri = i/N is a solution of (5.2), and in fact the unique solution consistent with
rN = 1, which hence provides the answer to the first question a).
To discuss b), note that the time to fixation in A1 is the absorption time TiN and
the time to fixation in A2 is the absorption time Ti0 (see Section 2.3). The minimum
of these is the ultimate fixation time
Ti = the first time n for which Xn = 0 or Xn = N , given X0 = i.
Again by the principle of conditioning on the first event, it is seen that the collection
of expected values
mi = E(Ti ) = expected time to fixation in either A1 or A2 , given X0 = i,
can be obtained as the unique solution of the system of equations
N
X −1 N
X −1
(5.3) mi = pi0 · 1 + piN · 1 + pij (1 + mj ) = 1 + pij mj , 1≤i≤N −1
j=1 j=1
(m0 = mN = 0). Unfortunately, equation (5.3) becomes complicated even for mod-
erate size N . An approximation formula is known, valid for large N :
mi ≈ −2(i log(i/N ) + (N − i) log(1 − i/N )).
Effect of mutations. Mutation of genes, which change the allelic type of the
individuals in between reproduction events, cause genetic variability and thus have
the opposite effect compared to genetic drift. The balance of genetic drift and mu-
tation is an important aspect of population genetics for which some understanding
can be gained from stochastic models.
We add to the model mutation probabilities u12 and u21 :
uij = probability that due to mutation an allele shifts from Ai to Aj in a given
generation.
The Wright-Fisher model with mutation is the Markov chain with transition proba-
bilities
N j i i
pij = pi (1 − pi )N −j , pi = (1 − u12 ) + 1 − u21 , 0 ≤ i, j ≤ N.
j N N
90 5. Applications in Biology and Bioinformatics
Mutation changes the nature of the process in a very significant way. One allele
can never remain fixed forever. Sooner or later there will be a mutation event that
prevents alleles to be lost from the population. Since now all transition probabilities
are positive, pij > 0, the Markov chain is irreducible and aperiodic, and possesses a
steady state X∞ with a stationary distribution π. The stationary probabilities are
complicated, but we can rather easily find the expected value E(X∞ ). Indeed, the
analog of (5.1) becomes
(5.4) E(Xn ) = E(Xn−1 )(1 − u12 − u21 ) + N u21 ,
which as n → ∞ yields the limit
u21
E(X∞ ) = N .
u12 + u21
The Moran model. This is a continuous time version of the Wright-Fisher model.
Under Wright-Fisher dynamics the state of each individual of the population is
updated each discrete generation step. In comparison, the Moran model is suitable
for populations in which only only one individual changes type at a time. We
consider again a population of N individuals each of type A1 or type A2 . Each
individual carries an exponential clock with rate one. When a clock rings this
individual randomly selects one individual in the population (including itself). The
chosen individual is subject to mutation (an A1 mutates to A2 with probability
u12 and an A2 mutates to A1 with probability u21 ) and then replaces the original
individual for which the clock rang.
Let X(t) be the number of individuals of type A1 at time t. The Moran model
is the birth and death process with states E = {0, 1 . . . , N } and intensities
i i
λi = (1 − i/N )pi , µi = (i/N )(1 − pi ), pi = (1 − u12 ) + 1 − u21 .
N N
We need to check that these birth and death jump rates correspond to the given
model dynamics. To go from i to i + 1 individuals of type A1 first of all the clock
92 5. Applications in Biology and Bioinformatics
must ring for a type A2 individual, which occurs with probability 1 − i/N . To be
sure that this A2 is replaced by an A1 , we may either chose an A1 , probability i/N ,
and avoid mutation, probability 1 − u12 , or chose A2 and have a mutation to A1 .
Similarly for the death rates.
Figure 1 shows three simulated trajectories of the Moran model in the case of no
mutation, u12 = u21 = 0, and population size N = 200. One population starts with
200
180
160
140
number of A1 alleles
120
100
80
60
40
20
0
0 20 40 60 80 100 120 140 160 180
time
function y=moransim(init,time);
N=200;
y=zeros(1,nmb);
y(1)=init;
x=1:N+1;
lambda(x)=(x-1).*(1-(x-1)./N);
mu(x)=(x-1).*(1-(x-1)./N);
q(x)=lambda(x)+mu(x);
x=init;
t=1;
for i=2:nmb
if (x==1 | x==N+1)
y(i)=x;
elseif (lambda(x)/q(x)>rand)
y(i)=x+1;
x=y(i);
else
y(i)=x-1;
x=y(i);
end;
end;
z=cumsum(-log(rand(1,nmb))./q(y));
stairs(z,y);
axis([0 z(nmb) 0 N+1]);
In the case u12 > 0, u21 > 0, there is a unique steady-state distribution, which
is obtained in the usual way from Theorem 3.6.
n−2
n−1
The formula can be understood as giving the total probability for coalescence result-
k
ing from the 2 distinct pairs that can be formed out of the k individuals, and each
pair having the same parent with probability 1/N .
(k)
The interesting observation now is that we can decompose TMRCA into the sum
k
X
(k) (j) (j−1) (1)
TMRCA = Vj , Vj = TMRCA − TM RCA , j ≥ 2, TMRCA = 0,
j=2
Hence for any reasonably large sample of individuals taken in a population of size N ,
the expected number of generations that has to be traced backwards in order to find
a common ancestor for the sample is approximately 2N . More exactly, with k = N ,
(N )
considering the ancestry of the whole population, we find E(TMRCA ) = 2(N − 1).
Note also that in the light of (5.5), the tree depicted in Figure 2 appears to be
nontypical. Indeed, since E(V2 ) = 1/p2 = N , typical ancestral trees are stretched
out toward the root. On average about half of the length of the tree corresponds to
the waiting time for two ancestors to coalesce into a single, common ancestor.
The by now familiar technique to approximate geometric distributions with ex-
ponential ones, reveals the following structure: If we measure time in units of [N t]
generations, then the time during which there are j distinct ancestors in the sample
is approximately exponential with parameter 2j . This leads us to the coalescent,
which was introduced by Kingman in 1982. The coalescent is a random tree that
allows one to characterize ancestral relationships between genes in a sample when
the population size is reasonably large.
The probabilistic structure of Kingman’s coalescent is quite simple. If we start
with a sample of k individuals,then after a random time Vek , which is exponentially
distributed with parameter k2 , two randomly chosen ancestral lineages coalesce,
leaving k − 1 distinct lineages. The lineages continue coalescing in this way until we
reach a single common ancestor for the sample. We thus obtain a sequence Vek , . . . , Ve2
of intercoalescense times thatare independent and exponentially distributed with
expected values E(Vej ) = 1/ 2j . The time to reach the most recent ancestor is the
(k)
sum TeMRCA = Vek + · · · + Ve2 with mean 2(1 − 1/k) (compare (5.5)). If we let X(t)
denote the number of lineages at time t, the standard coalescent process may also
be viewed as the Markov pure death process in continuous time, {X(t), t ≥ 0},
with states E = {1, . . . , k}, initial value X(0) = k, and death intensities µi = 2i ,
(k)
2 ≤ i ≤ k. The absorption time of {X(t)} is the random time TeMRCA with mean
value 2(1 − 1/k) ≈ 2.
96 5. Applications in Biology and Bioinformatics
The drawback of the infinite alleles assumption is that two sequences are consid-
ered distinct regardless of the degree to which they differ. In contrast, the infinite
sites model provides a refined measure of how close two or several sequences are to
each other. Again sequences of length L are subject to mutations with rate u per
site (so u = 3α in the Jukes-Cantor model). As in the infinite alleles model we are
ignoring the possibility that a substitution at a site at a later time is followed by a
reversed substitution at the same site, but now we keep record of the number of sites
where two sequences differ. More generally, in a sample consisting of n sequences
we put
Sn = the number of sites where at least two sequences differ.
This the number of segregating sites in a sample of size n.
To analyze this quantity we refer to the complete coalescence process, starting
with n loci and working backwards in time marking successive coalescence events.
At each event the number of lineages shrinks by one until eventually the common
ancestor has been traced. It was found earlier that the time Vj during which there are
j lineages in the coalescent is approximately exponential with mean E(Vj ) = 1/ 2j .
Using the random variables {Vj }, the total length of all branches in the coalescence
tree (or the total time in the tree) can be expressed as the sum
n
X
(5.6) Ttot = jVj
j=2
Now we combine the above observations with the property of the Jukes-Cantor
model that the successive substitution events at a given site forms a Poisson process
with intensity u = 3α. The totality of substitutions in a given sequence is therefore a
Poisson process with intensity uL (recall that sites were supposed to be independent).
Thus each lineage in the coalescent is subject to substitution events with intensity uL.
To visualize, think of the coalescent as given and let independently Poisson events
occur with constant intensity along all branches of the coalescent tree with each
event marking one nucleotide substitution. The total number of such substitutions
in the tree is given by N (Ttot ), where {N (t)} is a Poisson process with intensity
uL and Ttot as in (5.6). But this number must also be the same as the number of
segregating sites. Since N (t) and Ttot are independent, we have
1 1
E(Sn ) = 3αL E(Ttot ) = 6αL 1 + + . . . .
2 n−1
In principle the above theory can be used as a basis for statistical estimation of the
parameter α. If a count Sn∗ is available from data measurements, then
Sn∗
α∗ =
1
6L 1 + 12 + . . . n−1
is a point estimate of α.
5.4. Recovering the genome from fragments 99
to study shotgun sequencing we apply the stochastic model that the fragments are
picked randomly and independently over the full length of the DNA. More exactly,
we make a continuous approximation and assume that the left-end of each fragment
is uniformly distributed in (0, G) and that the positions of different fragments are
independent of each other. The result of selecting N fragments in this way is that
a part of the DNA will be covered by a collection of overlapping fragments, contigs,
separated by intervals of DNA that each of the fragments missed. The bases in
between contigs will remain unsequenced. For 0 ≤ x ≤ G, let
Next we find the mean number of contigs. Since each contig has a unique right-
most fragment, we have
where
q = P (a given fragment is the rightmost member of a contig).
Now,
q = P (no other fragment has its leftmost point on the given fragment) = e−a
and so
mean number of contigs = N e−a = N e−N L/G .
5.5. Pairwise alignment methods 101
A R N D C Q E G H I L K M F P S T W Y V
A 4 −1 −2 −2 0 −1 −1 0 −2 −1 −1 −1 −1 −2 −1 1 0 −3 −2 0
R −1 5 0 −2 −3 1 0 −2 0 −3 −2 2 −1 −3 −2 −1 −1 −3 −2 −3
N −2 0 6 1 −3 0 0 0 1 −3 −3 0 −2 −3 −2 1 0 −4 −2 −3
D −2 −2 1 6 −3 0 2 −1 −1 −3 −4 −1 −3 −3 −1 0 −1 −4 −3 −3
C 0 −3 −3 −3 9 −3 −4 −3 −3 −1 −1 −3 −1 −2 −3 −1 −1 −2 −2 −1
Q −1 1 0 0 −3 5 2 −2 0 −3 −2 1 0 −3 −1 0 −1 −2 −1 −2
E −1 0 0 2 −4 2 5 −2 0 −3 −3 1 −2 −3 −1 0 −1 −3 −2 −2
G 0 −2 0 −1 −3 −2 −2 6 −2 −4 −4 −2 −3 −3 −2 0 −2 −2 −3 −3
H −2 0 1 −1 −3 0 0 −2 8 −3 −3 −1 −2 −1 −2 −1 −2 −2 2 −3
I −1 −3 −3 −3 −1 −3 −3 −4 −3 4 2 −3 1 0 −3 −2 −1 −3 −1 3
L −1 −2 −3 −4 −1 −2 −3 −4 −3 2 4 −2 2 0 −3 −2 −1 −2 −1 1
K −1 2 0 −1 −3 1 1 −2 −1 −3 −2 5 −1 −3 −1 0 −1 −3 −2 −2
M −1 −1 −2 −3 −1 0 −2 −3 −2 1 2 −1 5 0 −2 −1 −1 −1 −1 1
F −2 −3 −3 −3 −2 −3 −3 −3 −1 0 0 −3 0 6 −4 −2 −2 1 3 −1
P −1 −2 −2 −1 −3 −1 −1 −2 −2 −3 −3 −1 −2 −4 7 −1 −1 −4 −3 −2
S 1 −1 1 0 −1 0 0 0 −1 −2 −2 0 −1 −2 −1 4 1 −3 −2 −2
T 0 −1 0 −1 −1 −1 −1 −2 −2 −1 −1 −1 −1 −2 −1 1 5 −2 −2 0
W −3 −3 −4 −4 −2 −2 −3 −2 −2 −3 −2 −3 −1 1 −4 −3 −2 11 2 −3
Y −2 −2 −2 −3 2 −1 −2 −3 2 −1 −1 −2 −1 3 −3 −2 −2 2 7 −1
V 0 −3 −3 −3 1 −2 −2 −3 −3 3 1 −2 1 −1 −2 −2 0 −3 −1 4
amino acid. Each diagonal entry in the matrix gives the score for a matching of
the corresponding amino acid. The non-diagonal entries list the scores for each
possible mismatch of a character with one of the 19 other characters. One commonly
used substitution matrix or scoring matrix is called BLOSUM62. This matrix is
reproduced in Table 5.5. The scores in BLOSUM matrices are obtained by rounding
off to integers certain log-likelihood ratios of estimated substitution rates found in
empirical sequence data.
Before using the scoring matrix we need to include the gap penalty in the model.
The simplest assumption is that of a linear gap penalty of the form −dg to be
subtracted from the score for each gap of length g in between two pairs of amino
acids in the alignment, so that d is the cost of inserting a blank character. Typical
values are d = 4 or d = 8.
As an example we apply the BLOSUM62 substitution matrix with a gap penalty
of d = 4 to see that the score of the alignment
R D I S L V − − − K N A G I
R N I − L V S D A K N V G I
adds up to a total of
5 + 1 + 4 − 4 + 4 + 4 − 3 × 4 + 5 + 6 + 0 + 6 + 4 = 23.
Next we discuss the so called dynamic programming approach to pairwise align-
ments, using what is known as a Needleman-Wunsch algorithm. Given two sequences
of amino acids of lengths ℓ1 and ℓ2 , ℓ1 ≤ ℓ2 , we want to find an alignment with at
least ℓ2 − ℓ1 inserted gaps such that the total score calculated from a given substi-
tution matrix is maximized.
The algorithm can be performed using some auxiliary matrices as we discuss
next. We begin by letting the two sequences to be aligned define the rows and the
5.5. Pairwise alignment methods 103
columns in a ℓ2 × ℓ1 -matrix and fill the new matrix with the scores s(i, j) taken
from the substitution matrix and resulting out of pairing the amino acids from each
row i with the amino acids corresponding to each column j. In the example case of
starting with the sequences QGLK and QGKLLK, and using BLOSUM62, we find
Q G L K
Q 5 −2 −2 1
G −2 6 −4 −2
K 1 −2 −2 5
L −2 −4 4 −2
L −2 −4 4 −2
K 1 −2 −2 5
We denote the elements in this matrix
B(i, j), 1 ≤ i ≤ ℓ2 , 1 ≤ j ≤ ℓ1 .
Next we need to initialize the algorithm by noting that each time we make an indel
in one of the sequences the score will be reduced by the amount d. In the example,
using d = 4, we insert this as follows:
Q G L K
0 −4 −8 −12 −16
Q −4 5 −2 −2 1
G −8 −2 6 −4 −2
K −12 1 −2 −2 5
L −16 −2 −4 4 −2
L −20 −2 −4 4 −2
K −24 1 −2 −2 5
We include these extra elements in the matrix by extending the indexing as
B(i, j), 0 ≤ i ≤ ℓ2 , 0 ≤ j ≤ ℓ1 .
Now we are going to search for an optimal alignment by changing the entries
{B(i, j)} systematically, starting with B(1, 1) in the upper left corner and then
moving towards the lower right corner. The update rule is to let in each step B(i, j)
be the maximum of three numbers known from the previous step, namely
(5.7) B(i, j) = max{B(i − 1, j − 1) + s(i, j), B(i − 1, j) − d, B(i, j − 1) − d}
If we fill out in this manner the first row, for amino acid Q, in the example this
gives us the modified matrix
Q G L K
0 −4 −8 −12 −16
Q −4 5 1 −3 −7
G −8 −2 6 −4 −2
K −12 1 −2 −2 5
L −16 −2 −4 4 −2
L −20 −2 −4 4 −2
K −24 1 −2 −2 5
104 5. Applications in Biology and Bioinformatics
The resulting modified score matrix after updating all elements in the same way is
Q G L K
0 → −4 → −8 → −12 → −16
↓ ց
Q −4 5 → 1 → −3 → −7
↓ ↓ ց
G −8 1 11 → 7 → 3
↓ ↓ ↓ ց ց
K −12 −3 7 9 12
↓ ↓ ↓ ց ↓
L −16 −7 3 11 8
↓ ↓ ↓ ց ↓ ց
L −20 −11 −1 7 9
↓ ↓ ↓ ↓ ց
K −24 −15 −5 3 12
In this final matrix we have also as customary indicated by arrows which of the
three choices in (5.7) that gave rise to the new entry. In this way one can trace
the algorithm backwards and read off the optimum alignment(s) that led to the
maximum score. In the example the maximum score is 12 and the two different
paths of arrows leading to the lower right cell show that both of the alignments
Q G − L − K Q G − − L K
Q G K L L K Q G K L L K
are optimal with total score 12.
Example 5.1. In this standard example the Markov chain switches between two
states {A, B}, described in the usual way by a 2 × 2 transition probability matrix P.
The alphabet is {H, T } for Head and Tail of a coin flip, and the emission probabilities
are represented by either a fair coin with fifty-fifty chance for H or T in state A, or
a biased coin with probabilities p and q for H or T in state B. In the case p > 1/2,
an output sequence of the form
HHT T T T HT HHHT T T HHHHHHHHHHHHT HHHHHHHHHT T HHT T H
would suggest that the Markov chain started in A and remained there for approxi-
mately 15 time steps, then visited B for another 20 steps or so, and then returned to
A. Of course, such inference on the behavior of the Markov chain can only be stated
in a statistical sense. Possibly we are just observing a fair coin with some unusually
long sequences of successive heads. To get a sense of what is likely and not likely
in this example we apply a piece of classical probability theory. De Moivre studied
patterns in independents sequences as early as 1738, see Blom, Holst, Sandell [2].
He showed among many other things that if we let N be the number of coin flips
until for the first time either r heads or r tails come up in sequence, then
E(N ) = 2 + 22 + 23 + · · · + 2r .
To get r = 12 heads as in the above example, one would thus expect to do on the
average 8 190 coin flips.
Example 5.2. It is a common feature of DNA strings that separate regions differ in
the composition of nucleotides. In one segment perhaps A and T are most common,
in another segment all four nucleotides appear to be equally frequent, and in a third
segment of the DNA chain C and G occur most regularly. A hidden Markov model
provides a framework for modeling such patterns in observed data. For this example
we may assume that there exists an underlying Markov chain with three states
{1, 2, 3}, which represents the current type of segment as we move along the DNA
chain. As long as the Markov chain is in state 1, successive nucleotides are generated
with the emission probabilities p1 (A) = p1 (T ) = 0.3 and p1 (C) = p1 (G) = 0.2.
During periods when the Markov chain visits state 2 the emission probabilities
change into p2 (A) = p2 (T ) = p2 (C) = p2 (G) = 0.25, and during visits of state 3
into p3 (A) = p3 (T ) = 0.1 and p3 (C) = p3 (G) = 0.4. The Markov chain is hidden
in the sense that we are unable to observe the state transitions, only the resulting
sequence of emitted nucleotides.
Similarly, if the Markov chain is in the insert state ij a symbol is emitted representing
the insertion of an amino acid and this time the symbol is drawn using a probability
distribution qj (a). Often one would simply assume that insertions are uniform, that
is qj (a) = 1/20 for each symbol a. Finally, in each state dj the output is always †.
To complete the description of the hidden Markov model it remains to specify the
transition probabilities of the underlying Markov chain. The following is a typical
choice. Assume transitions from match states are given by
P (mj → ij ) = ǫ P (mj → dj+1 ) = δ P (mj → mj+1 ) = 1 − ǫ − δ, 0 ≤ j ≤ ℓ − 1,
and
P (mℓ−1 → iℓ−1 ) = ǫ P (mℓ−1 → mℓ ) = 1 − ǫ.
For the insert states put
P (ij → ij ) = ǫ P (ij → mj+1 ) = 1 − ǫ, 0 ≤ j ≤ ℓ − 1.
Finally, the possible transitions from delete states are given by
P (dj → dj+1 ) = δ P (dj → mj+1 ) = 1 − δ, 0 ≤ j ≤ ℓ − 1..
The parameters of a HMM are the emission probability distributions pj and qj
and the insertion and deletion probabilities ǫ and δ. In terms of the parameters of
the model it is now possible to write down the probability of any particular sequence
generated by the HMM. For example, let us take ℓ = 5 and assume that the Markov
chain while successively visiting the states m0 m1 m2 i2 i2 m3 d4 m5 (after starting in m0 )
emits the sequence of symbols a1 a2 a3 a4 a5 †a6 . The corresponding probability is
P (m0 m1 m2 i2 i2 m3 d4 m5 = a1 a2 a3 a4 a5 †a6 )
(1 − ǫ − δ)2 ǫ2 (1 − ǫ)δ
= p1 (a1 )p2 (a2 )p3 (a5 )p5 (a6 ).
202
Example 5.3. We assume that the parameters of the model have been fixed. Each
output sequence is the result of a corresponding trajectory of the underlying Markov
chain. Hence the path m0 m1 m2 i2 i2 m3 d4 m5 may have produced the output symbols
GQHHA†A and the path m0 m1 d2 m3 m4 m5 the symbols G†AGA. The alignment
induced in this way is found by aligning positions that were generated by the same
match state:
m 0 m 1 m 2 i2 i2 m 3 d 4 m 5
G Q H H A A
G A G A
m0 m1 d2 m3 m4 m5
This leads to the alignment
G Q H H A − A
G − − − A G A
A typical problem in HMM theory is to start with a given family of protein
sequences and try to select the parameters of the model so that output sequences
of the same length with high probability are “close” to the given ones, meaning
that most of the symbols match. The known family of sequences is supposed to
be correctly aligned and is therefore used as “training” data for the HMM in the
procedure of actually fitting the parameters. While a particular choice of parameters
Solved exercises 107
may lead to output sequences that are all very similar, another choice, however, may
result in highly varying outputs. It is not surprising, therefore, that such parameter
estimation algorithms discussed in the HMM research literature are quite complex.
They are based on statistical methods that have been devised for analyzing an
observed string of DNA or protein with the goal of estimating where along the
chain it is most likely that hidden transitions occur. In practice, given an output
sequence from the hidden Markov model, the most likely path to have produced the
output is found by using the Viterbi algorithm. This technique yields the particular
state sequence which has the highest conditional probability to have occurred given
the observed symbol output. Using this information for a number of sequences the
multiple alignment can be found. See Durbin et. al [5] for a detailed account of
these ideas, and also Ewens and Grant [7] for further examples and discussion.
Although we have referred mainly to protein sequences in the previous discussion
the same arguments are applicable to DNA strings. We give an example of multiple
alignments.
Example 5.4. Suppose that we have properly selected parameters of a HMM of
length ℓ = 5 to match the following given family of DNA sequences:
G C G A G
G C G G G
G − G G G
G T G G G
Having observed the new sequence GGAAG we are faced with the problem of de-
termining which of all possible alignments is the most likely one. The alignment
GGAAG has probability
P (m1 m2 m3 m4 m5 = GGAAG) = (1 − ǫ − δ)4 (1 − ǫ)p1 (G)p2 (G)p3 (A)p4 (A)p5 (G).
The alternative alignment G†GA(insertA)G, which corresponds to breaking up the
given family between the last two bases, gives
P (m1 d2 m3 m4 i4 m5 = G†GAAG)
(1 − ǫ − δ)2 δ(1 − δ)ǫ(1 − ǫ)
= p1 (G)p3 (G)p4 (A)p5 (G),
4
and so on. Select the alignment which has the largest probability. In practice more
refined methods may be used at this point, such as maximizing properly chosen
likelihood ratios rather than plain probabilities.
Solved exercises
1. Solution.
2. Solution.
3. Solution.
4. Solution.
Bibliography
[1] C. Bertsekas, R. Gallager, Data networks, 2nd Ed, Prentice Hall, Englewood Cliffs
NJ, 1992.
[2] G. Blom, L. Holst, D. Sandell, Problems and snapshots from the world of probability,
Springer-Verlag, New York, 1994.
[3] P. Brémaud, Markov chains; Gibbs fields, Monte Carlo simulation, and queues,
Springer-Verlag, New York, 1999.
[4] M. Denny, S. Gaines, Chance in biology; using probability to explore nature, Prince-
ton University Press, Princeton 2002.
[5] R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological sequence analysis, proba-
bilistic models of proteins and nucleic acids, Cambridge University Press, Cambridge,
1998.
[6] R. Durrett, Probability models for DNA sequence evolution, Springer-Verlag, New
York, 2002.
[7] W.J. Ewens, G.R. Grant, Statistical methods in bioinformatics, an introduction,
Springer-Verlag, New York, 2001.
[8] B. Fristedt, N. Jain, N. Krylov, Filtering and Prediction: A Primer. AMS Student
Mathematical Library, Vol 38. American Mathematical Society 2007.
[9] R. Gaigalas, I. Kaj, Stochastic simulation using MATLAB, tutorial and code available
at http://www.math.uu.se/research/telecom/software, last updated Dec 2005.
[10] G.R. Grimmett, D.R. Stirzaker, Probability and random processes, 2nd Ed, Oxford
Science Publications, Oxford, 1992.
[11] A. Gut, An intermediate course in probability, Springer, New York, 1995.
[12] P.G. Harrison, N.M. Patel, Performance modelling of communication networks and
computer architectures, Addison-Wesley, Reading MA, 1993.
[13] P.G. Hoel, S.C. Port and C.J. Stone, Introduction to stochastic processes, Houghton
Mifflin Co., Boston 1972.
[14] I. Kaj, Stochastic modeling in broadband communications systems, SIAM Mono-
graphs in Mathematical Modeling and Computation 8, SIAM Philadelphia PA, 2002.
[15] J.F. Kurose, K.W. Ross, Computer networking, a top-down approach featuring the
Internet, Addison-Wesley, Boston MA, 2001.
[16] O. Machek, J. Hnilica, A stochastic model of corporate lifespan based on corporate
credit ratings, Int. J. Eng. Bus. Manag. 5:45 (2013).
109
110 Bibliography
[17] E. Renshaw, Modelling biological populations in space and time, Cambridge Univer-
sity Press, Cambridge, 1991.
[18] S.I. Resnick, Adventures in stochastic processes, Birkhäuser, Boston, 1992.
[19] M. Schwartz, Telecommunication networks: protocols, modeling and analysis,
Addison-Wesley, Reading MA, 1987.
[20] T. Söderström, Discrete-time stochastic systems, 2nd Ed., Springer-Verlag, London,
2002.
[21] R. Wolff, Stochastic modeling and the theory of queues, Prentice-Hall, Englewood
Cliffs NJ, 1989.