MSC Notes
MSC Notes
D. A. Stephens
Department of Mathematics
Imperial College London
November 2005
ii
Contents
1 PROBABILITY THEORY 1
1.1 INTRODUCTION AND MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 BASIC PROBABILITY CONCEPTS . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 EXPERIMENTS AND EVENTS . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 THE RULES OF PROBABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 CONDITIONAL PROBABILITY AND INDEPENDENCE . . . . . . . . . . . . . . 6
1.5 THE LAW OF TOTAL PROBABILITY . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 BAYES RULE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 PROBABILITY DISTRIBUTIONS 11
2.1 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 RANDOM VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 DISCRETE PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . 13
2.4.1 PROBABILITY MASS FUNCTION . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 DISCRETE CUMULATIVE DISTRIBUTION FUNCTION . . . . . . . . . . 14
2.4.3 RELATIONSHIP BETWEEN fX AND FX . . . . . . . . . . . . . . . . . . . 15
2.5 SPECIAL DISCRETE PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . . . 15
2.6 CONTINUOUS PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . . . . . . . 18
2.6.1 CONTINUOUS CUMULATIVE DISTRIBUTION FUNCTION . . . . . . . . 18
2.6.2 PROBABILITY DENSITY FUNCTION . . . . . . . . . . . . . . . . . . . . 18
2.7 SPECIAL CONTINUOUS PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . 19
2.8 EXPECTATION AND VARIANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8.1 EXPECTATIONS OF SUMS OF RANDOM VARIABLES: . . . . . . . . . 22
2.8.2 EXPECTATIONS OF A FUNCTION OF A RANDOM VARIABLE . . . . . 22
2.8.3 RESULTS FOR STANDARD DISTRIBUTIONS . . . . . . . . . . . . . . . . 23
2.8.4 ENTROPY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8.5 RELATIVE ENTROPY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 TRANSFORMATIONS OF RANDOM VARIABLES . . . . . . . . . . . . . . . . . 25
2.9.1 LOCATION/SCALE TRANSFORMATIONS . . . . . . . . . . . . . . . . . . 26
2.9.2 TRANSFORMATION CONNECTIONS BETWEEN DISTRIBUTIONS . . . 26
2.10 JOINT PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . 27
2.10.1 JOINT PROBABILITY MASS/DENSITY FUNCTIONS . . . . . . . . . . . 27
2.10.2 MARGINAL MASS/DENSITY FUNCTIONS . . . . . . . . . . . . . . . . . . 28
2.10.3 CONDITIONAL MASS/DENSITY FUNCTIONS . . . . . . . . . . . . . . . 28
2.10.4 INDEPENDENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.10.5 THE MULTINOMIAL DISTRIBUTION . . . . . . . . . . . . . . . . . . . . . 31
2.11 COVARIANCE AND CORRELATION . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.11.1 PROPERTIES OF COVARIANCE AND CORRELATION . . . . . . . . . . 33
2.12 EXTREME VALUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.12.1 ORDER STATISTICS, MAXIMA AND MINIMA . . . . . . . . . . . . . . . 34
2.12.2 GENERAL EXTREME VALUE THEORY . . . . . . . . . . . . . . . . . . . 35
iii
iv CONTENTS
3 STATISTICAL ANALYSIS 37
3.1 GENERAL FRAMEWORK, NOTATION AND OBJECTIVES . . . . . . . . . . . . 37
3.1.1 OBJECTIVES OF A STATISTICAL ANALYSIS . . . . . . . . . . . . . . . . 38
3.2 EXPLORATORY DATA ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 NUMERICAL SUMMARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 LINKING SAMPLE STATISTICS AND PROBABILITY MODELS. . . . . . 40
3.2.3 GRAPHICAL SUMMARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.4 OUTLIERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 PARAMETER ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 MAXIMUM LIKELIHOOD ESTIMATION . . . . . . . . . . . . . . . . . . . 41
3.3.2 METHOD OF MOMENTS ESTIMATION . . . . . . . . . . . . . . . . . . . 42
3.4 SAMPLING DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 HYPOTHESIS TESTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 TESTS FOR NORMAL DATA I - THE Z-TEST (σ KNOWN) . . . . . . . . 44
3.5.2 HYPOTHESIS TESTING TERMINOLOGY . . . . . . . . . . . . . . . . . . 45
3.5.3 TESTS FOR NORMAL DATA II - THE T-TEST (σ UNKNOWN) . . . . . 46
3.5.4 TESTS FOR NORMAL DATA III - TESTING σ. . . . . . . . . . . . . . . . 47
3.5.5 TWO SAMPLE TESTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.6 ONE-SIDED AND TWO-SIDED TESTS . . . . . . . . . . . . . . . . . . . . 50
3.5.7 CONFIDENCE INTERVALS . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 MODEL TESTING AND VALIDATION . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.1 PROBABILITY PLOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.2 THE CHI-SQUARED GOODNESS-OF-FIT TEST . . . . . . . . . . . . . . . 54
3.7 HYPOTHESIS TESTING EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.1 ANALYSIS OF VARIANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.2 NON-NORMAL DATA: COUNTS AND PROPORTIONS . . . . . . . . . . . 60
3.7.3 CONTINGENCY TABLES AND THE CHI-SQUARED TEST . . . . . . . . 61
3.7.4 2 × 2 TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7.5 NON-PARAMETRIC TESTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7.6 EXACT TESTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8 POWER AND SAMPLE SIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.8.1 POWER CALCULATIONS FOR NORMAL SAMPLES . . . . . . . . . . . . 66
3.8.2 EXTENSIONS: SIMULATION STUDIES . . . . . . . . . . . . . . . . . . . . 67
3.9 MULTIPLE TESTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.9.1 THE BONFERRONI AND OTHER CORRECTIONS . . . . . . . . . . . . . 68
3.9.2 THE FALSE DISCOVERY RATE . . . . . . . . . . . . . . . . . . . . . . . . 69
3.9.3 STEP-DOWN AND STEP-UP ADJUSTMENT PROCEDURES . . . . . . . 70
3.10 PERMUTATION TESTS AND RESAMPLING METHODS . . . . . . . . . . . . . . 71
3.10.1 PERMUTATION TESTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.10.2 MONTE CARLO METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.10.3 RESAMPLING METHODS AND THE BOOTSTRAP . . . . . . . . . . . . 72
3.11 REGRESSION ANALYSIS AND THE LINEAR MODEL . . . . . . . . . . . . . . . 73
3.11.1 TERMINOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.11.2 LEAST-SQUARES ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . 73
3.11.3 LEAST-SQUARES AS MAXIMUM LIKELIHOOD ESTIMATION . . . . . . 74
3.11.4 ESTIMATES OF ERROR VARIANCE AND RESIDUALS . . . . . . . . . . 75
3.11.5 PREDICTION FOR A NEW COVARIATE VALUE . . . . . . . . . . . . . . 75
3.11.6 STANDARD ERRORS OF ESTIMATORS AND T-STATISTICS . . . . . . 75
CONTENTS v
To explain the variation in observed data, we need to introduce the concept of a probability
distribution. Essentially we need to be able to model, or specify, or compute the “chance” of
observing the data that we collect or expect to collect. This will then allow us to assess how likely
the data were to occur by chance alone, that is, how “surprising” the observed data are in light of
an assumed theoretical model.
For example, consider two nucleotide sequences of the same length that we wish to assess for
similarity:
Sequence 1 AT AGT AGAT ACGCACCGAGGA
1
2 CHAPTER 1. PROBABILITY THEORY
point mutation A → C occurs (as in the discordant position 3).in unit evolutionary time. Perhaps
the chance of observing a sub-sequence
AT CT T A
rather than
AT AGT A
(in positions 1-6) in important. Is the hidden (or latent) structure in the sequence, corresponding
to whether the sequence originates from a coding region or otherwise, important ? Can we even
infer the hidden structure in light of the data we have observed ?
These questions can only really be answered when we have an understanding of randomness
and variation. The framework that we will use to pose and answer such questions formally is given
to us by probability theory.
SIMPLE EXAMPLES:
(a) Coin tossing: S = {H, T }.
(b) Dice : S = {1, 2, 3, 4, 5, 6}.
(c) Proportions: S = {x : 0 ≤ x ≤ 1}
(d) Time measurement: S = {x : x > 0} = R+
(e) Temperature measurement: S = {x : a ≤ x ≤ b} ⊆ R
In biological sequence analysis, the experiment may involve the observation of a nucleotide or
protein sequence, so that the sample space S may comprise all sequences (of bases/amino acids)
up to a given length, and a sample outcome would be a particular observed sequence.
COUNTING
and
MEASUREMENT
1.2. BASIC PROBABILITY CONCEPTS 3
- we shall see that these two types lead to two distinct ways of specifying probability distributions.
s∈S
Definition 1.2.1 An event E is a set of the possible outcomes of the experiment, that is E is a
subset of S, E ⊆ S, E occurs if the actual outcome is in this set.
NOTE: the sets S and E can be either be written as a list of items, for example,
which may a finite or infinite list, or can only be represented by a continuum of outcomes, for
example
E = {x : 0.6 < x ≤ 2.3}
Events are manipulated using set theory notation; if E, F are two events, E, F ⊆ S,
0
We can interpret the events E ∪ F , E ∩ F , and E in terms of collections of sample outcomes, and
use Venn Diagrams to represent these concepts.
Another representation for this two event situation is given by the following table:
E E0 Union
F (E ∩ F ) (E 0 ∩ F ) F
F0 (E ∩ F 0 ) (E 0 ∩ F 0 ) F0
Union E E0
4 CHAPTER 1. PROBABILITY THEORY
E∩F =Ø
that is, the collections of sample outcomes E and F have no element in common.
Mutually exclusive events are very important in probability and statistics, as they allow complicated
events to be simplified in such a way as to allow straightforward probability calculations to be made.
S
k
(a) Ei ∩ Ej = Ø for all i and j (b) Ei = E1 ∪ E2 ∪ ... ∪ Ek = F .
i=1
We are interested in mutually exclusive events and partitions because when we carry out prob-
ability calculations we will essentially be counting or enumerating sample outcomes; to ease this
counting operation, it is desirable to deal with collections of outcomes that are completely distinct
or disjoint.
1.3. THE RULES OF PROBABILITY 5
(1) 0 ≤ P (E) ≤ 1
(2) P (Ω) = 1
If E ∩ F 6= Ø, then P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F )
E E0 Sum
F P (E ∩ F ) P (E 0 ∩ F ) P (F )
F0 P (E ∩ F 0 ) P (E 0 ∩ F 0 ) P (F 0 )
Sum P (E) P (E 0 )
P (E ∩ F ) + P (E ∩ F 0 ) = P (E)
P (E 0 ∩ F ) + P (E 0 ∩ F 0 ) = P (E 0 )
P (E ∩ F 0 ) + P (E 0 ∩ F 0 ) = P (F 0 )
6 CHAPTER 1. PROBABILITY THEORY
The result of this study is clear: the pass rate for MALES is higher than that for FEMALES.
Further investigation revealed a more complex result: for the essay paper, the results were as
follows;
PASS FAIL PASS RATE
FEMALE 80 20 0.8
MALE 210 90 0.7
so for the multiple choice paper, the pass rate for FEMALES is higher than that for MALES.
Hence we conclude that FEMALES have a higher pass rate on the essay paper, and FEMALES
have a higher pass rate on the multiple choice test, but MALES have a higher pass rate overall.
This apparent contradiction can be resolved by careful use of the probability definitions. First
introduce notation; let E be the event that the student chooses an essay, F be the event that the
student is female, and G be the event that the student passes the selected paper.
Definition 1.4.1 For two events E and F with P (F ) > 0, the conditional probability that E
occurs, given that F occurs, is written P (E|F ), and is defined by
P (E ∩ F )
P (E|F ) = so that P (E ∩ F ) = P (E|F )P (F )
P (F )
It is easy to show that this new probability operator P ( . | . ) satisfies the probability axioms.
1.4. CONDITIONAL PROBABILITY AND INDEPENDENCE 7
The probability of the intersection of events E1 , ..., Ek is given by the chain rule
P (E1 ∩ ... ∩ Ek ) = P (E1 )P (E2 |E1 )P (E3 |E1 ∩ E2 )...P (Ek |E1 ∩ E2 ∩ ... ∩ Ek−1 )
A simple way to think about joint and conditional probability is via a probability tree:
The chain rule construction is particularly important in biological sequence analysis; consider one
of the sequences from page 1
0 ≤ pA , pC , pG , pT ≤ 1 pA + pC + pG + pT = 1
which simplifies to
However, the assumption of independence may not be correct; perhaps knowledge about a base
being in one position influences the probability of the base in the next position. In this case, we
would have to write (in general)
Finally, our estimate (or specified value) for pA , pC , pG , pT may change due to the hidden structure
of the underlying genomic segment; that is, whether the segment is from a codon or otherwise; for
example
(E)
P (A|Exon) = pA
(I)
P (A|Intron) = pA
(E) (I)
where it is not necessarily the case that pA = pA = pA .
[In the exam results problem, what we really have specified are conditional probabilities. From the
pooled table, we have
0
P (G|F ) = 0.5 P (G|F ) = 0.6,
from the essay results table, we have
0
P (G|E ∩ F ) = 0.4 P (G|E ∩ F ) = 0.3,
P (F |E)P (E)
P (E|F ) =
P (F )
If events E1 , ..., Ek form a partition of S, with P (Ei ) > 0 for all i, then
k
X
P (F |Ei )P (Ei )
P (Ei |F ) = where P (F ) = P (F |Ej )P (Ej )
P (F )
j=1
Note that this result follows immediately from the conditional probability definition that
and hence equating the right hand sides of the two equations we have
and hence the result follows. Note also that in the second part of the theorem,
P (F |Ei )P (Ei ) P (F |Ei )
P (Ei |F ) = = P (Ei )
P (F ) P (F )
P (E|F ) 6= P (F |E)
The test is regarded as a good way of determining guilt, because laboratory testing indicate that
the detection rates are high; for example it is known that
Suppose that the suspect fails the test. What can be concluded ? The probability of real interest
is P (G|T ); we do not have this probability but can compute it using Bayes Theorem.
10 CHAPTER 1. PROBABILITY THEORY
so that
0.95 × 0.005
P (G|T ) = = 0.323
0.95 × 0.005 + 0.01 × 0.995
which is still relatively small. So, as a result of the lie-detector test being failed, the probability of
guilt of the suspect has increased from 0.005 to 0.323.
2.1 MOTIVATION
The probability definitions, rules, and theorems given previously are all framed in terms of events
in a sample space. For example, for an experiment with possible sample outcomes denoted by the
sample space S, an event E was defined as any collection of sample outcomes, that is, any subset
of the set S.
In this framework, it is necessary to consider each experiment with its associated sample space
separately - the nature of sample space S is typically different for different experiments.
EXAMPLE 1: Count the number of days in February which have zero precipitation.
SAMPLE SPACE S = {0, 1, 2, ..., 28}.Let Ei = “i days have zero precipitation”; E0 , ..., E28
partition S.
A general notation useful for all such examples can be obtained by considering a sample space
that is equivalent to S for a general experiment, but whose form is more familiar. For example,
for a general sample space S, if it were possible to associate a subset of the integer or real
number systems, X say, with S, then attention could be restricted to considering events in X,
whose structure is more convenient, as then S are collections of sample outcomes events in
11
12 CHAPTER 2. PROBABILITY DISTRIBUTIONS
X are intervals of the real numbers For example, consider an experiment involving counting the
number of breakdowns of a production line in a given month. The experimental sample space S
is therefore the collection of sample outcomes s0 , s1 , s2 , ... where si is the outcome “there were i
breakdowns”; events in S are collections of the si s. Then a useful equivalent sample space is the
set X = {0, 1, 2, ...}, and events in X are collections of non-negative integers. Formally, therefore,
we seek a function or map from S to X. This map is known as a random variable.
A random variable X is a function from experimental sample space S to some set of real numbers
X that maps s ∈ S to a unique x ∈ X
X : S −→ X ⊆ R
s 7−→ x
Depending on the type of experiment being carried out, there are two possible forms for the set of
values that X can take:
A random variable is DISCRETE if the set X is of the form
that is, a finite or infinite set of distinct values x1 , x2 , ..., xn , .... Discrete random variables are
used to describe the outcomes of experiments that involve counting or classification.
for real numbers ai , bi , that is, the union of intervals in R. Continuous random variables are used
to describe the outcomes of experiments that involve measurement.
2.3. PROBABILITY DISTRIBUTIONS 13
P[ X = x ] or P[ X ≤ x ]
can be calculated for each x in a suitable range X. The functions used to specify these probabil-
ities are just real-valued functions of a single real argument, similar to polynomial, exponential,
logarithmic or trigonometric functions such as (for example)
f (x) = ex
and so on. However, the fundamental rules of probability mean that the functions specifying
P [ X = x ] or P [ X ≤ x ] must exhibit certain properties. As we shall see below, the properties of
these functions, and how they are manipulated mathematically, depend crucially on the nature os
the random variable.
The probability distribution of a discrete random variable X is described by the probability mass
function (pmf ) fX , specified by
Because of the probability axioms, the function fX must exhibit the following properties:
X
(i)fX (xi ) ≥ 0 for all i (ii) fX (xi ) = 1.
i
14 CHAPTER 2. PROBABILITY DISTRIBUTIONS
FX (x) = P [X ≤ x] f or x ∈ R
The functions fX and/or FX can be used to describe the probability distribution of random
variable X.
X = {0, 1, 2, 3, 4, 5, 6}
To specify the probability distribution of X, can use the mass function fX or the cdf FX . For
example,
x 0 1 2 3 4 5 6
1 2 4 4 2 2 1
fX (x) 16 16 16 16 16 16 16
1 3 7 11 13 15 16
FX (x) 16 16 16 16 16 16 16
P [ X ≤ 2.5 ] ≡ P [ X ≤ 2 ]
X = {1, 2, 3, ...}
2.5. SPECIAL DISCRETE PROBABILITY DISTRIBUTIONS 15
To specify the probability distribution of X, can use the mass function fX or the cdf FX . Now,
fX (x) = P [ X = x ] = (1 − θ)x−1 θ
for x = 1, 2, 3, ... (if the first crash occurs on day x, then we must have a sequence of x − 1 crash-free
days, followed by a crash on day x). Also
as the terms in the summation are merely a geometric progression with first term θ and common
term 1 − θ.
so that
X
FX (x) = fX (xi ) ,
xi ≤x
and
fX (x1 ) = FX (x1 ) fX (xi ) = FX (xi ) − FX (xi−1 )
Hence, in the discrete case, we can calculate FX from fX by summation, and calculate fX from
FX by differencing.
Discrete probability models are used to model the outcomes of counting experiments. Depending
on the experimental situation, it is often possible to justify the use of one of a class of “Special”
discrete probability distributions. These are listed in this chapter, and are all motivated from the
central concept of a binary or 0-1 trial, where the random variable concerned has range consisting
of only two values with associated probabilities θ and 1 − θ respectively; typically we think of the
possible outcomes as “successes” and “failures”. All of the distributions in this section are derived
by making different modelling assumptions about sequences of 0-1 trials.
16 CHAPTER 2. PROBABILITY DISTRIBUTIONS
The sum of n i.i.d. Geometric(θ) random variables has a Negative Binomial distribution, that is,
n
X
If X1 , ..., Xn ∼ Geometric(θ) with X1 , ..., Xn i.i.d, then X= Xi ∼ N egBin(n, θ)
i=1
that is, the number of trials until the nth 1 is the sum of the number of trials until the first 1,
plus the number of trials between the first and second 1, etc. For this reason, the negative
binomial distribution is also known as the GENERALIZED GEOMETRIC distribution.
e−λ λx
fX (x) = x ∈ {0, 1, 2, ...}
x!
That is, if we write θ = λ/n, and then consider a limiting case as n −→ ∞, then
µ ¶ µ ¶x µ ¶ µ ¶ µ ¶x
n λ λ n−x λx λ n n! 1 λx −λ
fX (x) = 1− = 1− −→ e
x n n x! n (n − x)! n − λ x!
Binomial(n, θ) −→ P oisson(λ)
.in this limiting case. The Poisson model is appropriate for count data, where the number of
events (accidents, breakdowns etc) that occur in a given time period are being counted.
It is also related to the Poisson process; consider a sequence of events occurring
independently and at random in time at a constant rate λ per unit time. Let X(t) be the
random variable defined for t > 0 by
e−(λt) (λt)x
fX(t) (x) = P [X(t) = x] = x ∈ {0, 1, 2, ...} .
x!
18 CHAPTER 2. PROBABILITY DISTRIBUTIONS
The probability distribution of a continuous random variable X is defined by the continuous cu-
mulative distribution function or c.d.f., FX , specified by
d
fX (x) = {FX (x)}
dx
so that, by a fundamental calculus result,
Z x
FX (x) = fX (t) dt
−∞
P [ a ≤ X ≤ b ] = FX (b) − FX (a) −→ 0
P[ X = x ] = 0
if X is continuous. Therefore must use FX to specify the probability distribution initially, although
it is often easier to think of the “shape” of the distribution via the pdf fX . Any function that
satisfies the properties for a pdf can be used to construct a probability distribution. Note that, for
a continuous random variable
fX (x) 6= P [X = x].
2.7. SPECIAL CONTINUOUS PROBABILITY DISTRIBUTIONS 19
fX (x) = λe−λx x ∈ R+
e−λ(x0 +x)
= = e−λx
e−λx0
= P [X > x]
Notes :
(1) If α > 1, Γ(α) = (α − 1)Γ(α − 1).; If α = 1, 2, ..., Γ(α) = (α − 1)!.
√
(2) Γ(1/2) = π.
(3) If α = 1, 2, ..., then the Gamma(α/2, 1/2) distribution is known as the Chi-squared
distribution with α degrees of freedom, denoted χ2α .
Interpretation : The expectation and variance of a probability distribution can be used to aid
description, or to characterize the distribution; the EXPECTATION is a measure of location (that
is, the “centre of mass” of the probability distribution. The VARIANCE is a measure of scale or
spread of the distribution (how widely the probability is distributed) .
EXAMPLE Suppose that X is a discrete Poisson random variable taking values on X = {0, 1, 2, ...}
with pdf
λx −λ
fX (x) = e x = 0, 1, 2, ...
x!
and zero otherwise. Then
P
∞ P∞ λx P∞ λx−1 P∞ λx
EfX [ X ] = xfX (x) = x e−λ = λe−λ = λe−λ = λe−λ eλ = λ
x=−∞ x=0 x! x=1 (x − 1)! x=0 x!
using the power series expansion definition for the exponential function
∞
X
λ λx λ2 λ3
e = =1+λ+ + + ...
x! 2! 3!
x=0
EXAMPLE Suppose that X is a continuous random variable taking values on X = R+ with pdf
2
fX (x) = x > 0.
(1 + x)3
Then, integrating by parts.
Z ∞ Z ∞ · ¸∞ Z ∞
2x x 1
EfX [ X ] = xfX (x) dx = 3
dx = − 2
+ dx
−∞ 0 (1 + x) (1 + x) 0 0 (1 + x)2
· ¸∞
1
= 0− − =1
1+x 0
22 CHAPTER 2. PROBABILITY DISTRIBUTIONS
so we have a simple additive property for expectations and variances. Note also that if a1 = 1, a2 =
−1, then
Sums of random variables crop up naturally in many statistical calculations. Often we are interested
in a random variable Y that is defined as the sum of some other independent and identically
distributed (i.i.d) random variables, X1 , ..., Xn . If
n
X
Y = Xi with EfXi [Xi ] = µ and V arfXi [Xi ] = σ 2
i=1
we have
n
X n
X n
X n
X
EfY [Y ] = EfXi [Xi ] = µ = nµ and V arfY [Y ] = V arfXi [Xi ] = σ 2 = nσ 2
i=1 i=1 i=1 i=1
and also, if
n
1X
X= Xi is the sample mean random variable
n
i=1
then, using the properties listed above
£ ¤ 1 1 £ ¤ 1 1 σ2
EfX X = EfY [Y ] = nµ = µ and V arfX X = 2 V arfY [Y ] = 2 nσ 2 =
n n n n n
For example, if X is a continuous random variable, and g(x) = exp {−x} then
Z∞
EfX [ g(X) ] = EfX [ exp {−X} ] = exp {−x} fX (x) dx
−∞
Note that Y = g(X) is also a random variable whose probability distribution we can calculate from
the probability distribution of X .
The expectations and variances for the special distributions described in previous sections are as
follows:
• DISCRETE DISTRIBUTIONS
Binomial(n, θ) n, θ nθ nθ(1 − θ)
P oisson(λ) λ λ λ
1 (1 − θ)
Geometric(θ) θ
θ θ2
n n(1 − θ)
N egBinomial(n, θ) n, θ
θ θ2
• CONTINUOUS DISTRIBUTIONS
α αβ
Beta(α, β) α, β
α+β (α + β)2 (α + β + 1)
N ormal(µ, σ 2 ) µ, σ 2 µ σ2
24 CHAPTER 2. PROBABILITY DISTRIBUTIONS
2.8.4 ENTROPY
For a random variable X with mass or density function fX , the entropy of the distribution is
defined by
X
− log [fX (x)] fX (x) DISCRETE CASE
x
HfX [X] = EfX [− log fX (X)] = Z
− log [fX (x)] fX (x)dx CONTINUOUS CASE
where log in this case can mean logarithm to any base; typically, log2 or ln (natural log) are
used. One interpretation of the entropy of a distribution is that it measures the “evenness” of the
distribution, that is, a distribution with high entropy assigns approximately equal probability to
each value of the random variable.
These two examples illustrate another interpretation for the entropy as an overall measure of
uncertainty. In the first example, there is the maximum possible uncertainty, whereas in the
second example the uncertainty is at a minimum.
where the sum extends over values of x for which both f0 and f1 are non-zero. It is also possible to
obtain an overall measure of the difference in entropy terms between the two distributions as the
sum of these two measures.
X ½ ¾ X ½ ¾
f0 (x) f1 (x)
Hf0 ,f1 [X] = Hf0 ||f1 [X] + Hf1 ||f0 [X] = log f0 (x) + log f1 (x)
x
f1 (x) x
f0 (x)
2.9. TRANSFORMATIONS OF RANDOM VARIABLES 25
It can be shown that Hf0 ||f1 [X], Hf1 ||f0 [X] and hence Hf0 ,f1 [X] are all non-negative. Furthermore,
we can define the support for x in favour of f0 over f1 , denoted S0,1 (x) by
½ ¾
f0 (x)
S0,1 (x) = log
f1 (x)
with the equivalent definition for S1,0 (x) (where S1,0 (x) = −S0,1 (x)) Using this definition, we see
that S0,1 (X) is a random variable, and using the general definition of expectation we have that the
expectation of S0,1 (x) is
X X ½ ¾
f0 (x)
S0,1 (x) f0 (x) = log f0 (x) = Hf0 ||f1 [X]
x x
f1 (x)
where Ay = { x ∈ X : g(x) ≤ y}. Often, the set is Ay is easy to identify for a given y, and this
becomes our main objective in the calculation.
so that Ay = { x ∈ X : ax + b ≤ y} = { x ∈ X : x ≤ (y − b)/a} .
We may be interested in the mass or density function of the newly formed variable Y ; in that
case we could take the cdf formed above and use it to calculate the mass function/pdf. For example,
if Y = aX + b when a > 0 then
µ ¶ ½ µ ¶¾ ½µ ¶¾ µ ¶ µ ¶
y−b d y−b d y−b y−b 1 y−b
FY (y) = FX =⇒ fY (y) = FX = fY = fX
a dy a dy a a a a
d dg(x) dh(x)
{g(h(x)} = h0 (x)g 0 (h(x)) whereg 0 (x) = h0 (x) =
dx dx dx
In the discrete case, it may be easier to consider the mass function directly rather than the
cdf. However, for a particular type of transformations, namely 1-1 transformations, it is possible
to produce a general transformation result that allows direct calculation of the distribution of the
transformed variable.
1
X ∼ U nif orm(0, 1) Y = − log X Y ∼ Exponential(λ)
λ
X ∼ N ormal(0, 1) Y = µ + σX Y ∼ N ormal(µ, σ 2 )
¡1 1
¢
X ∼ N ormal(0, 1) Y = X2 Y ∼ Gamma 2, 2 ≡ χ21
2.10. JOINT PROBABILITY DISTRIBUTIONS 27
Typically, such a specification is represented by a probability table; for example for discrete
random variables X1 and X2 , we could have
X1
1 2 3 4
In the table above, the marginal mass functions can be computed easily
X1 ↓
1 2 3 4 fX2 (x2 )
so that the marginal mass functions are formed by the column and row sums respectively. In this
case, it turns out that fX1 (x) = fX2 (x), for each x = 1, 2, 3, 4, but this will not always be the case.
P [ X1 = x1 | X2 = x2 ]
that is, the conditional probability distribution of X1 , given that X2 = x2 . This conditional
distribution is easily computed from the conditional probability definition, that is
By extending these concepts, we may define the conditional probability distributions for both
variables in the discrete and continuous cases; The two conditional mass/density functions are
fX1 |X2 (x1 |x2 ) and fX2 |X1 (x2 |x1 )
P [(X1 = x1 ) ∩ (X2 = x2 )]
fX1 |X2 (x1 |x2 ) = P [X1 = x1 |X2 = x2 ] =
P [X2 = x2 ]
X1
1 2 3 4 fX2 (x2 )
The highlighted column gives the conditional mass function for X2 given that X1 = 2; from the
definition,
0.200 0.250
fX2 |X1 (1|2) = = 0.400 fX2 |X1 (1|2) = = 0.500
0.500 0.500
0.050 0.000
fX2 |X1 (3|2) = = 0.100 fX2 |X1 (4|2) = = 0.000
0.500 0.500
Note that
4
X
fX2 |X1 (x|2) = 0.400 + 0.500 + 0.100 + 0.000 = 1
x=1
Note that, in general, the conditional mass functions will be different for different values of
the conditioning variable.
30 CHAPTER 2. PROBABILITY DISTRIBUTIONS
SUMMARY
Suppose that X1 and X2 are discrete random variables that take values {1, 2, ..., n} and {1, 2, ..., m}
respectively. Then the joint mass function can be displayed as a table with n columns and m rows,
where
• the conditional mass function for X1 given X2 = j is given by the jth row divided by the
sum of the jth row
• the conditional mass function for X2 given X1 = i is given by the ith column divided by the
sum of the ith column
CONTINUOUS EXAMPLE
If the joint density of continuous variables X1 and X2 is given by
2.10.4 INDEPENDENCE
Random variables X1 and X2 are independent if
(i) the joint mass/density function of X1 and X2 factorizes into the product of the two marginal
pdfs, that is,
fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 )
(ii) the range of X1 does not conflict/influence/depend on the range of X2 (and vice versa).
The concept of independence for random variables is closely related to the concept of independence
for events.
2.10. JOINT PROBABILITY DISTRIBUTIONS 31
where
XX
x1 x2 fX1 ,X2 (x1 , x2 ) X1 and X2 discrete
x2 x1
EfX1 ,X2 [X1 X2 ] = Z Z
x1 x2 fX1 ,X2 (x1 , x2 ) dx1 dx2 X1 and X2 continuous
is the expectation of the function g (x1 , x2 ) = x1 x2 with respect to the joint probability function
fX1 ,X2 , and where µi = EfXi [Xi ] is the expectation of Xi , for i = 1, 2.
If
CovfX1 ,X2 [X1 , X2 ] = CorrfX1 ,X2 [X1 , X2 ] = 0
then variables X1 and X2 are uncorrelated. Note that if random variables X1 and X2 are
independent then
CovfX1 ,X2 [X1 , X2 ] = EfX1 ,X2 [X1 X2 ]−EfX1 [X1 ]EfX2 [X2 ] = EfX1 [X1 ]EfX2 [X2 ]−EfX1 [X1 ]EfX2 [X2 ] = 0
and so X1 and X2 are also uncorrelated (note that the converse does not necessarily hold).
Key interpretation
that is, two variables for which the correlation is large in magnitude are strongly associated,
whereas variables that have low correlation are weakly associated.
2.11. COVARIANCE AND CORRELATION 33
(ii) The extension to k variables: covariances can only be calculated for pairs of random variables,
but if k variables have a joint probability structure it is possible to construct a k × k matrix, C
say, of covariance values, whose (i, j)th element is
for i, j = 1, .., k, (so C is symmetric) that captures the complete covariance structure in the joint
distribution. If i = j,
CovfXi ,Xi [Xi , Xi ] ≡ V arfXi [Xi ]
The matrix C is referred to as the variance-covariance matrix.
k
X
X= ai Xi
i=1
k
X k X
X i−1
V arfX [X] = a2i V arfXi [Xi ] + 2 ai aj CovfXi ,Xj [Xi , Xj ]
i=1 i=1 j=1
(iv) Combining (i) and (iii) when k = 2, and defining standardized variables Z1 and Z2 ,
0 ≤ V arfZ1 ,Z2 [Z1 ± Z2 ] = V arfZ1 [Z1 ] + V arfZ2 [Z2 ] ± 2 CovfZ1 ,Z2 [Z1 , Z2 ]
that is, the correlation is bounded between -1 and 1. We will see later how to compute covari-
ance and correlation for sample data; there is a close relationship between theoretical and sample
covariances and correlations.
34 CHAPTER 2. PROBABILITY DISTRIBUTIONS
For n independent, identically distributed random variables X1 , ..., Xn , with marginal density func-
tion fX , there are two main results to consider; it can be shown that the joint density function of
the order statistics Y1 , ...., Yn is given by
fY1 ,...,Yn (y1 , ..., yn ) = n!fX (y1 )...fX (yn ) y1 < ... < yn
and that the marginal pdf of the jth order statistic Yj for j = 1, ..., n has the form
n!
fYj (yj ) = {FX (yj )}j−1 {1 − FX (yj )}n−j fX (yj )
(j − 1)!(n − j)!
To derive the marginal pdf of the maximum Yn , first consider the marginal cdf of Yn ;
n
Y n
Y
= P [Xi ≤ yn ] = {FX (yn )}
i=1 i=1
and so
fYn (yn ) = n {FX (yn )}n−1 fX (yn ) differentiating using the chain rule
By a similar calculation, we can find the marginal pdf/cdf for the minimum Y1 ;
n
Y n
Y
= 1− P [Xi > y1 ] = 1 − {1 − FX (y1 )}
i=1 i=1
= 1 − {1 − FX (y1 )}n
and so
fY1 (y1 ) = n {1 − FX (y1 )}n−1 fX (y1 ) differentiating using the chain rule
Hence
FY1 (y1 ) = 1 − {1 − FX (y1 )}n fY1 (y1 ) = n {1 − FX (y1 )}n−1 fX (y1 )
2.12. EXTREME VALUES 35
Type (I) and Type (II) distributions are transformed versions of Type (III) distributions; the
transformations are
log (Z − µ) and − log (µ − Z)
respectively. In addition, the distribution of the variable (−Z) has, in each case an extreme value
distribution.
36 CHAPTER 2. PROBABILITY DISTRIBUTIONS
The GEV distribution describes the probability distribution of maximum order statistics. We now
consider threshold-excedance distributions; that is, the distribution of observed values beyond
a certain high threshold value. Let X be a random variable with cdf FX , and for some fixed u, let
Y = (X − u) IX>u
FX (u + y) − FX (u)
FY (y; u) = P [Y ≤ y; u] = P [X ≤ u + y|X > u] = y>0
1 − FX (u)
The distribution of Y as u approaches some upper endpoint is known as the Generalized Pareto
Distribution.
For events occurring in time or space, the number N of events that exceed a threshold u in any
time interval t, X(t), is often adequately modelled using a Poisson distribution with parameter
λt; we say that the events occur at rate λ. Given that N ≥ 1, the excedances themselves are
distributed according to the GPD model, and the largest excedance is well modelled using a GEV
distribution.
CHAPTER 3
STATISTICAL ANALYSIS
In practice, it is commonly assumed that f takes one of the familiar forms (Binomial, P oisson,
Exponential, N ormal etc.). Thus f depends on one or more parameters (θ, λ, (µ, σ) etc.). The
role of these parameters could be indicated by re-writing the function f (x) as
37
38 CHAPTER 3. STATISTICAL ANALYSIS
• SUMMARY : Describe and summarize the sample {x1 , ..., xn } in such a way that allows
a specific probability model to be proposed.
• INFERENCE : Deduce and make inference about the parameter(s) of the probability
model θ.
• GOODNESS OF FIT : Test whether the probability model encapsulated in the mass/density
function f , and the other model assumptions are adequate to explain the experimental re-
sults.
The first objective can be viewed as an exploratory data analysis exercise - it is crucially important
to understand whether a proposed probability distribution is suitable for modelling the observed
data, otherwise the subsequent formal inference procedures (estimation, hypothesis testing, model
checking) cannot be used.
These features of the sample are important because we can relate them directly to features of
probability distributions.
3.2. EXPLORATORY DATA ANALYSIS 39
• Sample mean
n
1X
x̄ = xi
n
i=1
• Sample quantiles: suppose that the sample has been sorted into ascending order and re-
labelled x(1) < ... < x(n) . Then the pth quantile, 0 < p < 100,is given by
x(p) = x(k)
• Sample skewness
n
X
(xi − x̄)3
i=1
A= n
X
(xi − x̄)2
i=1
• Sample kurtosis
n
X
n (xi − x̄)4
K = Ã i=1 !2
Xn
(xi − x̄)2
i=1
40 CHAPTER 3. STATISTICAL ANALYSIS
1
f(n) (x) = x ∈ {x1 , ..., xn } .
n
Then the expectation and variance of this probability distribution are given by
n
X n
X ½ ¾ n n
1 1X 1X
Ef (n) [X] = xi f(n) (xi ) = xi = xi = x̄ V arf (n) [X] = (xi − x̄)2 = S 2
n n n
i=1 i=1 i=1 i=1
that is, the sample mean. Similarly, the variance of this probability distribution is equal to sample
variance. In fact, each of the summary statistics listed above can be viewed as a feature of the
probability distribution described by mass function f(n) .
Now, consider this probability distribution as n increases to infinity. Then the sample mass function
f(n) tends to a function f which can be regarded as the “true” mass/density function, and the sample
mean, variance, percentiles etc. tend to the true mean, variance, percentiles of the distribution from
which the data are generated. In practice, of course, n is always finite, and thus the true distribution,
true mean etc., cannot be known exactly. Therefore, we approximate the true distribution by an
appropriately chosen distribution (Poisson, Exponential, Normal etc.) with parameters chosen to
correspond to the observed sample properties.
3.2.4 OUTLIERS
Sometimes, for example due to slight variation in experimental conditions, one or two values in the
sample may be much larger or much smaller in magnitude than the remainder of the sample. Such
observations are termed outliers and must be treated with care, as they can distort the impression
given by some of the summary statistics. For example, the sample mean and variance are extremely
sensitive to the presence of outliers in the sample. Other summary statistics, for example those
based on sample percentiles, are less sensitive to outliers. Outliers can usually be identified by
inspection of the raw data, or from careful plotting of histograms.
3.3. PARAMETER ESTIMATION 41
that is, the product of the n mass/density function terms (where the ith term is the mass/density
function evaluated at xi ) viewed as a function of θ.
STEP 2 Take the natural log of the likelihood, and collect terms involving θ.
STEP 3 Find the value of θ ∈ Θ, b θ, for which log L(θ) is maximized, for example by differentiation.
b
If θ is a single parameter, find θ by solving
d
{log L(θ)} = 0
dθ
∂
{log L(θ)} = 0 j = 1, ..., d
∂θj
in parameter space Θ.
Note that, if parameter space Θ is a bounded interval, then the maximum likelihood estimate may
lie on the boundary of Θ.
STEP 4 Check that the estimate b θ obtained in STEP 3 truly corresponds to a maximum in the
(log) likelihood function by inspecting the second derivative of log L(θ) with respect to θ. If
d2
{log L(θ)} < 0
dθ2
at θ = bθ, then b
θ is confirmed as the m.l.e. of θ (other techniques may be used to verify that the
likelihood is maximized at bθ).
42 CHAPTER 3. STATISTICAL ANALYSIS
This procedure is a systematic way of producing parameter estimates from sample data and a
probability model; it can be shown that such an approach produces estimates that have good
properties. After they have been obtained, the estimates can be used to carry out prediction of
behaviour for future samples.
EXAMPLE A sample x1 , ..., xn is modelled by a Poisson distribution with parameter denoted λ
λx −λ
f (x; θ) ≡ f (x; λ) = e x = 0, 1, 2, ...
x!
for some λ > 0.
STEP 1 Calculate the likelihood function L(λ). For λ > 0,
Yn n ½ xi
Y ¾
λ −λ λx1 +...+xn −nλ
L(λ) = f (xi ; λ) = e = e
xi ! x1 !....xn !
i=1 i=1
STEP 2 Calculate the log-likelihood log L(λ).
n
X n
X
log L(λ) = xi log λ − nλ − log(xi !)
i=1 i=1
STEP 3 Differentiate log L(λ) with respect to λ, and equate the derivative to zero.
n n
d 1X 1X
{log L(λ)} = xi − n = 0 =⇒ λ̂ = xi = x̄
dλ λ n
i=1 i=1
STEP 4 Check that the second derivative of log L(λ) with respect to λ is negative at λ = λ̂.
n
d2 1 X
{log L(λ)} = − xi < 0 at λ = λ̂
dλ2 λ2 i=1
If five samples of eight observations are collected, however, we might get five different sample means
x1 x2 x3 x4 x5 x6 x7 x8 x
10.4 11.2 9.8 10.2 10.5 8.9 11.0 10.3 10.29
9.7 12.2 10.4 11.1 10.3 10.2 10.4 11.1 10.66
12.1 7.9 8.6 9.6 11.0 11.1 8.8 11.7 10.10
10.0 9.2 11.1 10.8 9.1 12.3 10.3 9.7 10.31
9.2 9.7 10.8 10.3 8.9 10.1 9.7 10.4 9.89
and so the estimate µ̂ of µ is different each time. We attempt to understand how x̄ varies by
calculating the probability distribution of the corresponding estimator, x̄.
The estimator X is a random variable, the value of which is unknown before the experiment
is carried out. As a random variable, X has a probability distribution, known as the sampling
distribution. The form of this distribution can often be calculated, and used to understand how x̄
varies. In the case where the sample data have a Normal distribution, the following theorem gives
the sampling distributions of the maximum likelihood estimators;
Interpretation : This theorem tells us how the sample mean and variance will behave if the
original random sample is assumed to come from a Normal distribution. In particular, it tells us
that
£ ¤ n−1 2
E X =µ E[S 2 ] = σ E[s2 ] = σ 2
n
If we believe that X1 , ..., X10 are i.i.d random variables from a Normal distribution with parameters
µ = 10.0 and σ 2 = 25, then X has a Normal distribution with parameters µ = 10.0 and σ 2 =
25/10 = 2.5.
The result will be used to facilitate formal tests about model parameters. For example, given a
sample of experimental, we wish to answer specific questions about parameters in a proposed
probability model.
44 CHAPTER 3. STATISTICAL ANALYSIS
Figure 3.1: CRITICAL REGIONS IN A Z-TEST (taken from Schaum’s ELEMENTS OF STA-
TISTICS II, Bernstein & Bernstein)
• TEST STATISTIC
• NULL DISTRIBUTION
• P-VALUE, denoted p.
• CRITICAL VALUE(S)
• α = 0.05 is the significance level of the test (we could use α = 0.01 if we require a “stronger”
test)
• The solution CR of Φ(CR ) = 1 − α/2 (CR = 1.96 above) gives the critical values of the test
±CR .
46 CHAPTER 3. STATISTICAL ANALYSIS
EXAMPLE : A sample of size 10 has sample mean x̄ = 19.7. To test the hypothesis
H0 : µ = 20.0
H1 : µ 6= 20.0
under the assumption that the data follow a Normal distribution with σ = 1.0. We have
19.7 − 20.0
z= √ = −0.95
1/ 10
which lies between the critical values ±1.96, and therefore we have no reason to reject H0 . Also,
the p-value is given by p = 2Φ(−0.95) = 0.342, which is greater than α = 0.05, which confirms that
we have no reason to reject H0 .
and T has a Student-t distribution with n − 1 degrees of freedom, denoted St(n − 1). Thus we can
repeat the procedure used in the σ known case, but use the sampling distribution of T rather than
that of Z to assess whether the test statistic is “surprising” or not. Specifically, we calculate
(x̄ − µ)
t= √
s/ n
and find the critical values for a α = 0.05 significance test by finding the ordinates corresponding
to the 0.025 and 0.975 percentiles of a Student-t distribution, St(n − 1) (rather than a N (0, 1))
distribution.
EXAMPLE : A sample of size 10 has sample mean x̄ = 19.7. and s2 = 0.782 . To test
H0 : µ = 20.0
H1 : µ 6= 20.0
under the assumption that the data follow a Normal distribution with σ unknown. We have test
statistic t given by
19.7 − 20.0
t= √ = −1.22.
0.78/ 10
3.5. HYPOTHESIS TESTING 47
0.4
0.3
f(x)
0.2
0.1
0.0
-4 -2 0 2 4
Figure 3.2: Student-t distribution for different values of the degrees of freedom.
The upper critical value CR is obtained by solving FSt(n−1) (CR ) = 0.975, where FSt(n−1) is the
c.d.f. of a Student-t distribution with n − 1 degrees of freedom; here n = 10, so we can use the
statistical tables to find CR = 2.262, and not that, as Student-t distributions are symmetric the
lower critical value is −CR . Thus t lies between the critical values, and therefore we have no reason
to reject H0 . The p-value is given by
½
2FSt(n−1) (t) t<0
p=
2(1 − FSt(n−1) (t)) t ≥ 0
so here, p = 2FSt(n−1) (−1.22) which we can find to give p = 0.253; this confirms that we have no
reason to reject H0 .
(n − 1)s2
Q= ∼ χ2n−1
σ2
(n − 1)s2
q=
c
48 CHAPTER 3. STATISTICAL ANALYSIS
0.5
0.4
0.3
f(x)
0.2
0.1
0.0
0 5 10 15 20
Figure 3.3: Chi-squared distribution for different values of the degrees of freedom.
then we can compare q with the critical values derived from a χ2n−1 distribution; we look for the
0.025 and 0.975 quantiles - note that the Chi-squared distribution is not symmetric, so we need
two distinct critical values.
so q is not a surprising observation from a χ2n−1 distribution, and hence we cannot reject H0 .
which we can compare with the standard normal distribution; if z is a surprising observation
from N (0, 1), and lies outside of the critical region, then we can reject H0 . This procedure is
the Two Sample Z-Test.
It can be shown that, if H0 is true then t should be an observation from a Student-t distribution
with nX + nY − 2 degrees of freedom. Hence we can derive the critical values from the tables
of the Student-t distribution.
3. If σ X 6= σ Y , but both parameters are known, we can use a similar approach to the one above
to derive test statistic z defined by
x̄ − ȳ
z=s
σ 2X σ2
+ Y
nX nY
4. If σ X =
6 σ Y , but both parameters are unknown, we can use a similar approach to the one
above to derive test statistic t defined by
x̄ − ȳ
z=s
s2X s2
+ Y
nX nY
50 CHAPTER 3. STATISTICAL ANALYSIS
for which the distribution if H0 is true is not analytically available, but can be adequately
approximated by a Student (m) distribution, where
(wX + wY )2
m= µ 2 ¶
wX wY2
+
nX − 1 nY − 1
where
s2X s2Y
wX = wY =
nX nY
Clearly, the choice of test depends on whether σ X = σ Y or otherwise; we may test this hypothesis
formally; to test
H0 : σ X = σ Y
H1 : σ X 6= σ Y
we compute the test statistic q = s2X /s2Y , which has a null distribution known as the Fisher
or F distribution with (nX − 1, nY − 1) degrees of freedom; this distribution can be denoted
F (nX − 1, nY − 1), and its quantiles are tabulated. Hence we can look up the 0.025 and 0.975
quantiles of this distribution (the F distribution is not symmetric), and hence define the critical
region; informally, if the test statistic q is very small or very large, then it is a surprising observation
from the F distribution and hence we reject the hypothesis of equal variances.
0.4
0.2
0.0
0 1 2 3 4 5
H0 : µ = c
H1 : µ 6= c
3.5. HYPOTHESIS TESTING 51
which is referred to as a two-sided test, that is, the alternative hypothesis is supported by an
extreme test statistic in either tail of the distribution. We may also consider a one-sided test of
the form
H0 : µ = c H0 : µ = c
or .
H1 : µ > c H1 : µ < c
Such a test proceeds exactly as the two-sided test, except that a significant result can only occur
in the right (or left) tail of the null distribution, and there is a single critical value, placed, for
example, at the 0.95 (or 0.05) probability point of the null distribution.
X −µ
Z= √ ∼ N (0, 1)
σ/ n
that is, that, for critical values ±CR in the test at the 5 significance level
· ¸
x̄ − µ
P [−CR ≤ Z ≤ CR ] = P −CR ≤ √ ≤ CR = 0.95
σ/ n
from which we deduce a 95 % Confidence Interval for µ based on the sample mean x̄ of
σ
x̄ ± 1.96 √
n
We can derive other confidence intervals (corresponding to different significance levels in the equiv-
alent tests) by looking up the appropriate values of the critical values. The general approach for
construction of confidence interval for generic parameter θ proceeds as follows. From the modelling
assumptions, we derive a pivotal quantity, that is, a statistic, TP Q , say, (usually the test statistic
random variable) that depends on θ, but whose sampling distribution is “parameter-free” (that is,
does not depend on θ). We then look up the critical values CR1 and CR2 , such that
P [CR1 ≤ TP Q ≤ CR2 ] = 1 − α
where α is the significance level of the corresponding test. We then rearrange this expression to
the form
P [c1 ≤ θ ≤ c2 ] = 1 − α
where c1 and c2 are functions of CR1 and CR2 respectively. Then a 1 − α Confidence Interval for
θ is [c1 , c2 ].
52 CHAPTER 3. STATISTICAL ANALYSIS
SUMMARY
For the tests discussed in previous sections, the calculation of the form of the confidence intervals
is straightforward: in each case, CR1 and CR2 are the α/2 and 1 − α/2 quantiles of the distribution
of the pivotal quantity.
X −µ √
T − T EST T = √ St(n − 1) µ x̄ ± CR s/ n
s/ n
· ¸
(n − 1)s2 (n − 1)s2 (n − 1)s2
Q − T EST Q= χ2n−1 σ2 :
σ2 CR2 CR1
EXAMPLE For the N (0, 1) model, FX ≡ Φ is only available numerically (for example via statis-
tical tables). Here the probability plot consists of examining {(qi , xi ) : i = 1, ..., n} where
µ ¶
i −1 i
Φ(qi ) = =⇒ qi = Φ
n+1 n+1
EXAMPLE For the N (µ, σ 2 ) model, is again only available numerically (for example via statistical
tables). Here the probability plot consists of examining {(qi , xi ) : i = 1, ..., n} where
µ ¶
qi − µ i
FX (qi ) = Φ =
σ n+1
then if the model is correct, a plot of {(qi∗ , xi ) : i = 1, ..., n} should be approximately a straight
line with intercept µ and slope σ; hence µ, σ can again be estimated from this plot by using linear
regression.
Suppose that the data are recorded as the number of observations, Oi , say in a sample of size
n that fall into each of k categories or “bins”. Suppose that under the hypothesized model with
mass/density function fX or c.d.f. FX , the data follow a specific probability distribution specified
by probabilities {pi : i = 1, ..., k}. These probabilities can be calculated directly from fX or FX ,
possibly after parameters in the model have been estimated using maximum likelihood. Then, if
the hypothesized model is correct, Ei = npi observations would be expected to fall into category i.
An intuitively sensible measure of the goodness-of-fit of the data to the hypothesized distribution
is given by the chi-squared statistic
k
X
2 (Oi −E i )2
χ =
Ei
i=1
A formal hypothesis test of model adequacy can be carried out in the usual framework; here the chi-
squared statistic is the test statistic, and the null distribution (the distribution of the test statistic
if the hypothesis is TRUE) is approximately a chi-squared distribution with k − d − 1 degrees of
freedom, where d is the number of parameters in fX or FX that were estimated in order calculate
the probabilities p1 , ..., pk .
To test the hypothesis that the data follow a Poisson distribution, a chi-squared test can be per-
formed. First, we estimate Poisson parameter λ by its m.l.e., which is λ̂ = x̄ = 10126/2612 = 3.877.
Secondly, we calculate probabilities pi using the Poisson formula. Thirdly, we calculate the theo-
retical (expected) frequencies Ei = npi for each category. Finally, we calculate the χ2 statistic as
the sum of the (standardized) squared differences between observed and expected frequencies.
In this case, χ2 = 12.129. To complete the test we find that the 95th percentile of a Chi-squared
distribution with k − 1 − 1 = 12 degrees of freedom is 21.03. This implies that the χ2 statistic would
only be surprising at a significance level of 0.05 if it was larger than 21.03. Here, as χ2 = 12.129,
and therefore not surprising. Hence there is no evidence to indicate that the data are not from a
Poisson distribution.
NOTES
• Clearly, the categorization is arbitrary, and several of the categories in example 1 could be
combined. As a general rule, the categories should be chosen so that there is at least five
observed counts in each.
• Hence, to carry out a Chi-squared goodness of fit test, we use the following logic. If a given
hypothesis is true, it can be shown that the chi-squared statistic χ2 for a sample of data has
a particular Chi-squared distribution. If χ2 takes a value that is surprising or unlikely under
that probability distribution (for example if its value lies in the extreme right-hand tail and
is larger, say, than the 95th percentile of the distribution) it is very likely that the hypothesis
is false and should be rejected.
56 CHAPTER 3. STATISTICAL ANALYSIS
2. define a suitable test statistic T = T (X1 , ..., Xn ) (that is, some function of the original
random variables; this will define the test statistic), and a related pivotal random variable
TP Q = TP Q (X)
3. assume that H0 is true, and compute the sampling distribution of T , fT or FT ; this is the
null distribution
4. compute the observed value of T , t = T (x1 , ..., xn ); this is the test statistic
This strategy can be applied to more complicated normal examples, and also non-normal and
non-parametric testing situations. It is a general strategy for assessing the statistical evidence for
or against a hypothesis.
ONE-WAY ANOVA
The T-test can be extended to allow a test for differences between more than two data samples.
Suppose there are K samples of sizes n1 , ..., nK from different populations. The model can be
represented as follows: let ykj be the jth observation in the kth sample, then
ykj = µk + εkj
¡ ¢
for k = 1, ..., K, and εkj ∼ N 0, σ 2 . This model assumes that
¡ ¢
Ykj ∼ N µk , σ 2
and that the expectations for the different samples are different. We can view the data as a table
comprising K columns, with each column corresponding to a sample.
To test the hypothesis that each population has the same mean, that is, the hypotheses
H0 : µ1 = µ2 = ... = µK
H1 : not H0
To carry out a test of the hypothesis, the following ANOVA table should be completed;
X nk
K X X nk
K X K
X
2 2
T SS = (ykj − ȳ.. ) RSS = (ykj − ȳk ) F SS = nk (ȳk − ȳ.. )2
k=1 j=1 k=1 j=1 k=1
where TSS is the total sum-of-squares (i.e. total deviation from the overall data mean ȳ. )RSS is
the residual sum-of-squares (i.e. sum of deviations from individual sample means ȳk , k = 1, ..., K)
and FSS is the fitted sum-of-squares (i.e. weighted sum of deviations of sample means from the
overall data mean, with weights equal to number of data points in the individual samples) Note:that
T SS = F SS + RSS
If the F statistic is calculated in this way, and compared with an F distribution with parameters
K − 1, n − K, the hypothesis that all the individual samples have the same mean can be tested.
EXAMPLE Three genomic segments were used to studied in order to discover whether the dis-
tances (in kB) between successive occurrences of a particular motif were substantially different.
Several measurements were taken using for each segment;
Method
SEGMENT A SEGMENT B SEGMENT C
42.7 44.9 41.9
45.6 48.3 44.2
43.1 46.2 40.5
41.6 43.7
41.0
Mean 43.25 46.47 42.26
Variance 2.86 2.94 2.66
and the F statistic must be compared with an F2,9 distribution. For a significance test at the 0.05
level, F must be compared with the 95th percentile (in a one-sided test) of the F2,9 distribution.
This value is 4.26. Therefore, the F statistic is surprising, given the hypothesized model, and
therefore there is evidence to reject the hypothesis that the segments are identical.
58 CHAPTER 3. STATISTICAL ANALYSIS
TWO-WAY ANOVA
One-way ANOVA can be used to test whether the underlying means of several groups of ob-
servations are equal Now consider the following data collection situation Suppose there are K
treatments, and L groups of observations that are believed to have different responses, that all
treatments are administered to all groups, and measurement samples of size n are made for each
of the K × L combinations of treatments × groups. The experiment can be represented as follows:
let yklj be the jth observation in the kth treatment on the lth group, then
yklj = µk + δ l + εklj
¡ ¢
for¡ k = 1, ...,¢K, l = 1, ..., L, and again εklj ∼ N 0, σ 2 . This model assumes that Ykj ∼
N µk + δ l , σ 2 and that the expectations for the different samples are different. We can view
the data as a 3 dimensional-table comprising K columns and L rows, with n observations for each
column × row combination, corresponding to a sample.
It is possible to test the hypothesis that each treatment, and/or that each group has the
same mean, that is, the two null hypotheses
H0 : µ1 = µ2 = ... = µK
H0 : δ 1 = δ 2 = ... = δ L
against the alternative H1 : not H0 in each case. For these tests, a Two-way Analysis of
Variance (ANOVA) F-test may be carried out. The Two-Way ANOVA table is computed as
follows
Source D.F. Sum of squares Mean square F
F SS1 /(K − 1)
TREATMENTS K −1 F SS1 F SS1 /(K − 1)
RSS/(R + 1)
F SS2 /(L − 1)
GROUPS L−1 F SS2 F SS2 /(L − 1)
RSS/(R + 1)
Residual R+1 RSS RSS/(R + 1)
Total N −1 T SS
where N = K × L × n, R = N − L − K. and again
Two-way analysis of variance, with the rows and columns representing two source of variation
can be used to analyze such data. Two-way analysis of variance studies the variability due to
and calibrates them against the average level of variability in the data overall.. For example, for
the data above we have the following two-way ANOVA table
• The first F statistic (F = 31.54) is the test statistic for the test of equal means in the
rows, that is, that there is no difference between TREATMENTS. This statistic must be
compared with an F5,10 distribution (the two degrees of freedom being the entries in the
degrees of freedom column in the specimens and residual rows of the ANOVA table). The
95th percentile of the F5,10 distribution is 3.33, and thus the test statistic is more extreme
than this critical value, and thus the hypothesis that each specimen has the same mean can
be rejected.
• The second F statistic, (F = 5.57), is the test statistic for the test of equal means in the
columns, that is, that there is no difference between GROUPS. This statistic must be
compared with an F2,10 distribution (the two degrees of freedom being the entries in the
degrees of freedom column in the methods and residual rows of the ANOVA table). The
95th percentile of the F2,10 distribution is 4.10, and thus the test statistic is more extreme
than this critical value, and thus the hypothesis that each method has the same mean can be
rejected.
If replicate data are available, it is possible also to fit an interaction, that is, to discover whether the
pattern of variability is significantly different amongst the different TREATMENTS or GROUPS.
ANOVA F tests allow the comparison of between group and within group variability
• significant between group variability indicates a systematic difference between the groups
P
k1 P
k2
where nij = n. It is often of interest to test whether row classification is independent of
i=1 j=1
column classification, as this would indicate independence between row and column factors. An
approximate test of this hypothesis can be carried out using a Chi-Squared Goodness-of-Fit
statistic; if the independence model is correct, the expected cell frequencies n̂ij can be calculated
as
ni. n.j
n̂ij = i = 1, ..., k1 , j = 1, ..., k2
n
where ni. is the total of cell counts in row i and n.j is the total of cell counts in column j, and that,
under independence, the χ2 test statistic
k1 X
X k2
2 (nij − n̂ij )2
χ =
n̂ij
i=1 j=1
This statistic also has an approximate Chi-squared distribution χ2(k1 −1)(k2 −1) distribution, again
given that H0 is true.
We will see further analysis of count data in section (3.7.3) below, and in Chapter 4
3.7.4 2 × 2 TABLES
When k1 = k2 = 2, the contingency table reduces to a two-way binary classification
COLUMN
1 2 Total
1 n11 n12 n1.
ROW 2 n21 n22 n2.
Total n.1 n.2 n
62 CHAPTER 3. STATISTICAL ANALYSIS
In this case we can obtain some more explicit tests: one is again an exact test, the other is based
on a normal approximation. The chi-squared test described above is feasible, but other tests may
also be constructed:
where n! = 1 × 2 × 3 × .. × (n − 1) × n.
For the p-value, we need to assess the whether or not the observed table is surprising under
this null distribution; suppose we observe n11 = x, then we can compare p (x) with all p (y)
for all feasible y, that is y in the range max {0, n1. − (n − n.1 )} ≤ y ≤ min {n, n.1 }. We are
thus calculating the null distribution exactly given the null distribution assumptions and the
row and column totals; if the observed test statistic lies in the tail of the distribution, we can
reject the null hypothesis of independent factors.
A
YES NO Total
YES n11 n12 n1.
B NO n21 n22 n2.
Total n.1 n.2 n
that is, n11 pairs were observed for which both A and B classified individuals had dis-
ease/survival status YES, whereas n12 pairs were observed for which the A individual had
status NO, but the B individual had status YES, and so on.
3.7. HYPOTHESIS TESTING EXTENSIONS 63
An appropriate test statistic here for a test of symmetry or “discordancy” in these results
(that is, whether the two classifications are significantly different in terms of outcome) is
(n12 − n21 )2
χ2 =
n12 + n21
which effectively measures how different the off-diagonal entries in the table are. This statistic
is an adjusted Chi-squared statistic, and has a χ21 distribution under the null hypothesis that
there is no asymmetry. Again a one-tailed test is carried out: “surprising” values of the test
statistic are large.
It is easy to show that 0 ≤ T ≤ 1, but the null distribution of T is not available in closed
form. Fortunately, the p-value probability in the test for test statistic t, p =P[T > t] can be
obtained for various different sample sizes using statistical tables or packages.
where Oi and Ei are the observed and expected/predicted counts in each of k (cross) classifications
or categories (see section (3.6.2) for further details). The distribution of the test-statistic is typically
approximated by a Chi-squared distribution with an appropriately chosen degrees of freedom. This
approximation is good when the sample size is large, but not good when the table is “sparse”, with
some low (expected) cell entries (under the null hypothesis). The approximation breaks down for
small sample sizes due to the inappropriateness of the Normal approximation referred to in section
( 3.7.2)
We have also seen two examples of Exact Tests: the exact binomial test in section (3.7.2)
and Fisher’s Exact Test in section (3.7.4). For these tests, we proceeded as follows, mimicking the
general hypothesis strategy outlined at the start of the section.
3.7. HYPOTHESIS TESTING EXTENSIONS 65
2. Construct a test statistic T deemed appropriate for the hypothesis under study
4. Compare the observed value of T , t = T (x) for sample data x = (x1 , ..., xn ) with the null
distribution and assess whether the observed test statistic is a surprising observation from
fT ; if it is reject H0
Step 3 is crucial: for some tests (for example, one and two sample tests based on the Normal
distribution assumption), it is possible to find fT analytically for appropriate choices of T in
Step 2. For others, such as the chi-squared goodness of fit and related tests, fT is only available
approximately. However, the null distribution (and hence the critical regions and p-value) can,
in theory, always be found : it is the probability distribution of the statistic T under the model
restriction imposed by the null hypothesis.
EXAMPLE Suppose a data sample are collected and believed to be from a P oisson (λ) distrib-
ution, and we wish to test H0 : λ = 2. We might regard the sample mean statistic T = x̄ as an
appropriate test statistic. Then
FT (t; λ) = P [T ≤ t; λ] = P [X ≤ t; λ]
" n #
X
= P Xi ≤ nt; λ
i=1
= P [Y ≤ nt; λ]
P
n
where Y = Xi . But a result from elementary distribution theory tells us that Y ∼ P oisson(nλ),
i=1
so if H0 is true, we have the null distribution c.d.f. as
bntc −2 x
X e 2
FT (t; λ = 2) =
x!
x=0
and thus the critical values for the test, and the p-value, available numerically.
Note:
• The main difficulty with exact tests is in the computation of FT (t; λ); in the example above,
we can compute this easily, but in general it is rarely analytically possible.
• We will see in a later section that this null c.d.f. can be approximated using simulation
methods, or Permutation Tests.
66 CHAPTER 3. STATISTICAL ANALYSIS
Recall that to construct a test, the distribution of the test statistic under H0 is used to find
a critical region which will ensure the probability of committing a type I error does not exceed
some predetermined significance level, α. The power, β, of the test is its ability to correctly
reject the null hypothesis, or
β = 1 − P (T ypeII Error),
which is based on the distribution of the test statistic under H1 .The required sample size is then
a function of
• The target α;
Our objective here is to find a relationship between the above factors and the sample size that
enables us to select a sample size consistent with the desired α and β.
Xi ∼ N (µ, σ 2 )
for i = 1, ..., n. If σ 2 is known, to perform a two-sided test of equality the hypotheses would be as
follows:
H0 : µ = c0
H1 : µ = c1
3.8. POWER AND SAMPLE SIZE 67
The µmaximum
¶ likelihood estimate of µ is the sample mean, which is normally distributed,
σ2
X ∼ N µ, . The test statistic is
n
x̄ − µ
Z= √
σ/ n
and under H0 ,
x̄ − c0
Z= √ ∼ N (0, 1).
σ/ n
We reject H0 at significance level α if the z statistic is more extreme than the critical values of the
test are ³
σ α´
c0 ± CR √ CR = Φ−1 1 −
n 2
Now, if H1 is true, X ∼ N (c1 , σ 2 ), and hence
µ ¶
x̄ − c0 c1 − c0
Z= √ ∼N √ ,1 .
σ/ n σ/ n
Thus for fixed α, c0 , c1 and n, we can compute the power. Similar calculations are available for
other of the normal distribution-based tests.
In fact. the power equation can be rearranged to be explicit in one of the other parameters if
β is regarded as fixed. For example, if α, β, c0 and c1 are fixed, we can rearrange to get a sample
size calculation to test for fixed difference ∆ = c1 − c0
¡ ¢2
σ 2 CR + Φ−1 (1 − β)
n=
(c1 − c0 )2
• We would begin as usual by considering the null hypothesis, the test statistic and the null
distribution to derive the critical values.
• Then, the probability of correctly rejecting the null hypothesis for a specified alternative by
simulating data from the model under the alternative, and recording the proportion of times
that the null hypothesis is correctly rejected.
Thus the actual significance level for the series of tests is 1−(1 − α)k . For example, with α = 0.05
and k = 10 we get p = 0.9510 ≈ 0.60. This means, however that
³ ´
(i) (i)
P At least one test Ti decrees rejection of its H0 |All H0 are TRUE = 1 − (1 − α)k = 0.4
so that we now have a probability of 0.40 that one of these 10 tests will turn out significant, and
one of the H0 will be falsely rejected.
In order to guarantee that the overall significance test is still at the level, we have to adapt the
common significance level α0 of the individual tests. This results in the following relation between
the overall significance level α and the individual significance levels α0 satisfying (1 − α0 )k = 1 − α,
so that the Bonferroni correction αB (k) is defined by
α
αB (k) = α0 = 1 − (1 − α)1/k ≈
k
Thus, given k tests , and we compare the individual test p-values are less than or equal to αB (k),
then the experiment-wide p-value is less than or equal to α. Another justification for this result
follows from a probability result called the Bonferroni inequality
k
X
P (E1 ∪ E2 ∪ ... ∪ Ek ) ≤ P (Ei )
i=1
The Bonferroni correction is a conservative correction, in that it is overly stringent in reducing the
test-by-test significance level α. This is not the only correction that could be used; the package
SPLUS has other options, and an extensive list is given below: in section (3.9.2).
3.9. MULTIPLE TESTING 69
for appropriately chosen significance level α. This type of control is termed familywise control,
and the quantity α is termed the Type I, or Familywise error rate (FWER). An alternative
approach is to control the expected level of false discoveries, or the False Discovery Rate
(FDR) . The FDR is defined by F DR = N10 /R if R > 0, with F DR = 0 if R = 0.
A standard procedure, the BENJAMINI-HOCHBERG (BH) procedure, adjusts the k
p-values that result from the tests in a sequential fashion such that the expected false discovery
rate is bounded above by α, for any suitably chosen α in the range (0, 1). The BH procedure is
described below:
1. Compute the p-values in the k original hypothesis tests: p1 , ..., pk , and sort them into ascend-
ing order so that
p(1) < p(2) < ... < p(k) .
2. Define p(0) = 0 and
½ ¾
i
RBH = max 0 ≤ i ≤ k : p(i) ≤ α
k
(j)
3. Reject H0 for each test j where pj ≤ pRBH .
This procedure guarantees that the expected FDR is bounded above by α , a result that holds
for independent tests (where the samples themselves are independent) and also for some samples
that are not independent. An adjusted procedure can be used for false negative, or Type II results,
or the False Non Discovery Rate (FNDR), defined by F N DR = N01 /(k − R) if R > 0, with
F N DR = 0 if R = 0.
In summary, therefore a variety of corrections can be made. Let pi be the p -value derived from
the ith hypothesis test, then the following p-value thresholds may be used
IDENTITY αI (k) =α
BONFERRONI αB (k) = α/k
THRESHOLD αT (k) =t
FIRST r αTr (k) = p(r) the rth largest p-value
BH αBH (k) = pRBH
70 CHAPTER 3. STATISTICAL ANALYSIS
• Holm step-down
© © ªª
p∗(j) = min min (k − i + 1) p(i) , 1
1≤i≤j
• Bonferroni step-down
© © ªª
p∗(j) = min min (k − i + 1) p(i) , 1
j≤i≤k
These adjustments to observed p−values are all attempts to preserve the integrity of the tests
in large multiple testing situations. The final two Westfall and Young procedures can often only
be computed in a simulation study.
Note that these methods do not alter the ordering of the test results from “most significant” to
“least significant”; it may be sensible, therefore to fix on a number of results to report.
3.10. PERMUTATION TESTS AND RESAMPLING METHODS 71
Here are the steps we will follow to use a permutation test to analyze the differences between
the two groups. For the original order the sum for Group 1 is 173. In this example, if the groups
were truly equal ( and the null hypothesis was true) then randomly moving the observations
among the groups would make no difference in the sum for Group 1. Some of the sums would be a
little larger than the original sum and some would be a bit smaller. For the six observations there
are 720 permutations of which there are 20 distinct combinations for which we can compute the
sum of Group 1.
Of these 20 different orderings only one has a Group 1 sum that greater than or equal to the
Group 1 sum from our original ordering. Therefore the probability that a sum this large or larger
would occur by chance alone is 1/20 = 0.05 and can be considered to be statistically significant.
72 CHAPTER 3. STATISTICAL ANALYSIS
1. For two sample tests for samples of size n1 and n2 , compute the value of the test statistic for
the observed sample t∗
2. Randomly select one of the (n1 + n2 )! permutations, re-arrange the data according to this
permutation, allocate the first n1 to pseudo-sample 1 and the remaining n2 to pseudo-sample
2, and then compute the test statistic t1
3. Repeat 2. N times to obtain a random sample of t1 , t2 , ..., tN of test statistics from the TRUE
null distribution.
Number of t1 , t2 , ..., tN more extreme than t∗
4. Compute the p-value by reporting
N
• THE BOOTSTRAP: In bootstrap resampling, B new samples, each of the same size
as the observed data, are drawn with replacement from the observed data. The statistic is
first calculated using the observed data and then recalculated using each of the new samples,
yielding a bootstrap distribution. The resulting replicates are used to calculate the bootstrap
estimates of bias, mean, and standard error for the statistic.
Using the bootstrap and jackknife procedures, all informative summaries (mean, variance, quan-
tiles etc) for the sample-based estimates’ sampling distribution can be approximated.
3.11. REGRESSION ANALYSIS AND THE LINEAR MODEL 73
3.11.1 TERMINOLOGY
Y is the response or dependent variable
X is the covariate or independent variable
A simple relationship between Y and X is the linear regression model, where
E[Y |X = x] = α + βx,
that is, conditional on X = x, the expected or “predicted” value of Y is given by α + βx, where
α and β are unknown parameters; in other words, we model the relationship between Y and X
as a straight line with intercept α and slope β . For data {(xi , yi ) : i = 1, ..., n}, the objective is
to estimate the unknown parameters α and β. A simple estimation technique, is least-squares
estimation.
Let S(α, β) denote the error in fitting a linear regression model with parameters α and β. Then
n
X n
X n
X
(P ) 2
S(α, β) = e2i = (yi − yi ) = (yi − α − βxi )2
i=1 i=1 i=1
To calculate the least-squares estimates, we have to minimize S(α, β) as a function of α and β. This
can be achieved in the usual way by taking partial derivatives with respect to the two parameters,
and equating the partial derivatives to zero simultaneously.
X n
∂
(1) {S(α, β)} = −2 (yi − α − βxi ) = 0
∂α
i=1
X n
∂
(2) {S(α, β)} = −2 xi (yi − α − βxi ) = 0
∂β
i=1
74 CHAPTER 3. STATISTICAL ANALYSIS
b gives
Solving (2) in the same way, and combining the last two equations, and solving for β
P
n P
n P
n P
n
b P x2
n
xi yi − xi yi xi yi − β i
b = n i=1 i=1 i=1 nSxy − Sx Sy i=1 i=1 b x̄
β ½ ¾ = =⇒ α
b= = ȳ − β
P n Pn 2
nSxx − {Sx }2 P
n
n x2i − xi xi
i=1 i=1 i=1
n
X n
X n
X n
X
Sx = xi Sy = yi Sxx = x2i Sxy = xi yi
i=1 i=1 i=1 i=1
Therefore it is possible to produce estimates of parameters in a linear regression model using least-
squares, without any specific reference to probability models. In fact, the least-squares approach is
very closely related to maximum likelihood estimation for a specific probability model.
The correlation coefficient, r, measures the degree of association between X and Y variables
and is given by
nSxy − Sx Sy
r=q
(nSxx − Sx2 )(nSyy − Sy2 )
b
and therefore is quite closely related to β.
E[Y |X = x] = α + βx,
ei = yi − α − βxi .
Now, ei is the vertical discrepancy between observed and expected behaviour, and thus ei could be
interpreted as the observed version of a random variable, say ²i , which represents the random
uncertainty involved in measuring Y for a given X. A plausible probability model might therefore
be that the random variables ²i , i = 1, ...n, were independent and identically distributed, and ²i ∼
N (0, σ 2 ) for some error variance parameter σ 2 . Implicit in this assumption is that the distribution
of the random error in measuring Y does not depend on the value of X at which the measurement
is made. This distributional assumption about the error terms leads to a probability model for the
variable Y . As we can write Y = α + βX + ², where ² ∼ N (0, σ 2 ), then given on X = xi , we have
the conditional distribution Yi as
Yi |X = xi ∼ N (α + βxi , σ 2 ),
where random variables Yi and Yj are independent (as ²i and ²j are independent). On the basis of
this probability model, we can derive a likelihood function, and hence derive maximum likelihood
3.11. REGRESSION ANALYSIS AND THE LINEAR MODEL 75
estimates. For example, we have the likelihood L(θ) = L(α, β, σ 2 ) defined as the product of the n
conditional density terms derived as the conditional density of the observed yi given xi ,
Yn Yn ½ ¾
1 1 2
L(θ) = f (yi ; xi , θ) = √ exp − 2 (yi − α − βxi )
2πσ 2 2σ
i=1 i=1
µ ¶n/2 ( n
)
1 1 X 2
= exp − 2 (yi − α − βxi )
2πσ 2 2σ
i=1
The maximum likelihood estimates of α and β, and error variance σ 2 , are obtained as the values
at which L(α, β, σ 2 ) is maximized. But, L(α, β, σ 2 ) is maximized when the term in the exponent,
that is
Xn
(yi − α − βxi )2
i=1
is minimized. But this is precisely the least-squares criterion described above, and thus the m.l.e s
of α and β assuming a Normal error model are exactly equivalent to the least-squares estimates.
where yˆi = α b i is the fitted value of Y at X = xi . Note also that, having fitted a model
b + βx
with parameters α b we can calculate the error in fit at each data point, or residual, denoted
b and β,
ei , i = 1, ..., n, where ei = yi − yˆi = yi − α b i.
b − βx
y∗ = α b ∗
b + βx
where s is the square-root of the corrected estimate of the error variance. It is good statistical
practice to report standard errors whenever estimates are reported. The standard error of a para-
meter also allows a test of the hypothesis “parameter is equal to zero”. The test is carried out by
calculation of the t-statistic, that is, the ratio of a parameter estimate to its standard error. The
t-statistic must be compared with the 0.025 and 0.975 percentiles of a Student-t distribution with
n − 2 degrees of freedom as described below.
α
b−c b−c
β
tα = tβ =
s.e.(b
α) b
s.e.(β)
to test the null hypothesis that the parameter is equal to c.
Typically, we use a test at the 5 significance level, so the appropriate critical values are the
0.025 and 0.975 quantiles of a St(n − 2) distribution. It is also useful to report, for each parameter,
a confidence interval in which we think the true parameter value (that we have estimated by α b or
b lies with high probability. It can be shown that the 95% confidence intervals are given by
β)
α:α
b ± tn−2 (0.975)s.e.(b
α) b ± tn−2 (0.975)s.e.(β)
β:β b
where tn−2 (0.975) is the 97.5th percentile of a Student-t distribution with n − 2 degrees of freedom.
The confidence intervals are useful because they provide an alternative method for carrying out
hypothesis tests. For example, if we want to test the hypothesis that α = c, say, we simply note
whether the 95% confidence interval contains c. If it does, the hypothesis can be accepted; if not
the hypothesis should be rejected, as the confidence interval provides evidence that α 6= c.
We may carry out a hypothesis test to carry out whether there is significant correlation between
two variables. We denote by ρ the true correlation; then to test the hypothesis
H0 : ρ = 0
H1 : ρ 6= 0
Again, we can use maximum likelihood estimation to obtain estimates of the parameters in the
model, that is, parameter vector (α, β 1 , ..., β p , σ 2 ), but the details are slightly more complex, as we
have to solve p + 1 equations simultaneously. The procedure is simplified if we write the parameters
as a single vector, and perform matrix manipulation and calculus to obtain the estimates.
x 0.54 2.03 3.15 3.96 6.25 8.17 11.08 12.44 14.04 14.34 18.71 19.90
y 11.37 11.21 11.61 8.26 14.08 16.25 11.00 14.94 16.91 15.78 21.26 20.25
We want to calculate estimates of α and β from these data. First, we calculate the summary
statistics;
n
X n
X n
X n
X
Sx = xi = 118.63 Sy = yi = 172.92 Sxx = x2i = 1598.6 Sxy = xi yi = 1930.9
i=1 i=1 i=1 i=1
α
b 9.269 b
β 0.520
tα = = = 7.109 tβ = = = 4.604.
s.e.(b
α) 1.304 b
s.e.(β) 0.113
78 CHAPTER 3. STATISTICAL ANALYSIS
The 0.975 percentile of a Student-t distribution with n − 2 = 10 degrees of freedom is found from
tables to be 2.228. Both t-statistics are more extreme than this critical value, and hence it can be
concluded that both parameters are significantly different from zero.
To calculate the confidence intervals for the two parameters. we need to use the 0.975 percentile
of a St(10) distribution. >From above, we have that St(10)(0.975) = 2.228, and so the confidence
intervals are given by
α : α
b ± tn−2 (0.975)s.e.(b
α) = 9.269 ± 2.228 × 1.304 = (6.364 : 12.174)
b ± tn−2 (0.975)s.e.(β)
β : β b = 0.5201 ± 2.228 × 0.113 = (0.268 : 0.772)
so that, informally, we are 95% certain that the true value of α lies in the interval (6.724 : 12.174),
and that the true value of β lies in the interval (0.268 : 0.772). This amounts to evidence that, for
example, α 6= 0 (as the confidence interval for α does not contain 0), and evidence that β 6= 1 (as
the confidence interval for β does not contain 1).
In this section, we demonstrate how this model can be represented in matrix form, and demonstrate
that many of the simple models and tests studied previously in section (3.5) can be viewed as special
cases of the more general class of Linear Models.
In addition, we demonstrate how the linear model can be extended to allow for the modelling
of data that are not normally distributed; often we collect discrete data for example, or data where
the normality assumption ( 3.1) is not appropriate.
This form of the regression model illustrates the fact that the model is linear in β (that is, the
elements of β appear in their untransformed form). This is important as it allows particularly
straightforward calculation of parameter estimates and standard errors, and also makes clear that
3.12. GENERALIZING THE LINEAR MODEL 79
some of the other models that we have already studied, such as ANOVA models, also fall into the
linear model class.
It is reasonably straightforward to show that the least-squares/maximum likelihood estimates
of β for any linear model take the form:
¡ ¢
b = XT X −1 XT y 1 ³ b
´T ³
b
´
β b2 =
σ y − Xβ y − Xβ
n−p
where XT is the transpose of matrix X : the (i, j + 1) element of X is xij for j = 1, 2, ..., p, which
is the (j + 1, i) element of XT . The p × p variance-covariance matrix is
¡ ¢−1
b 2 XT X
σ
and the diagonal elements of this matrix give the squared standard errors for the estimates and
hence quantify uncertainty. A goodness-of-fit measure that records the adequacy of the model in
representing that data is the log-likelihood value evaluated at the maximum likelihood estimates
³ ´ 1 ³ ´T ³ ´
−2 log L β,b σ
b 2 = n log σ
b 2 + 2 y − Xβ b b
y − Xβ
b
σ
Note that, here, the entries in the design matrix are merely the raw values for the p predictors
and the n data points. However, these entries can be replaced by any functions of the predictor
values, such as polynomial or non-linear functions of the xij , for example
x2ij , x3ij , ... gij (xij ) = exij
The most important feature is that is the model is still linear in β.
EXAMPLE In the case of Binomial data, each individual data point Yi is Bernoulli (θ) distrib-
uted, so that
µ = E [Y ] = θ
where 0 ≤ θ ≤ 1. Hence suitable link functions must be mappings from the range [0, 1] to R. Such
functions include
µ ¶
µ
• the logistic link g(µ) = log
1−µ
• the probit link g(µ) = Φ−1 (µ)
EXAMPLE In the case of P oisson data, where each individual data point Yi is P oisson (λ)
distributed, so that
µ = E [Y ] = λ
where λ > 0. Hence suitable link functions must be mappings from the range (0, ∞) to R. One
such function is the log link
g(µ) = log (µ)
Inference for GLMs can be carried out using similar techniques to those studied already, such
as the maximum likelihood procedure. Usually, the maximum likelihood estimates are obtained by
numerical maximization; GLM estimation functions are readily available in most statistics packages
such as SPLUS. The results of a GLM fit are of a similar form to those for the ordinary linear model,
that is, including
³ ´
b and standard errors s.e. β
• a set of parameter estimates β b ,
³ ´
b and fitted values yb = g −1 Xβ
• a set of linear predictions Xβ b
³ ´
b
• a goodness of fit measure −2 log L β
3.13. CLASSIFICATION 81
3.13 CLASSIFICATION
Classification is a common statistical task in which the objective is to allocate or categorize an
object to one of a number of classes or categories on the basis of a set of predictor or covariate
measurements. Typically, the predictor measurements relate to two or more variables, and
the response is univariate, and often a nominal variable, or label. Specifically, we aim to the
observed variability in a response variable Y via consideration of predictors X = (X1 , ..., XK ). The
principal difference between classification and conventional regression is that the response variable
is a nominal categorical variable, that is, for data item i
Yi ∈ {0, 1, 2, ...K}
so that the value of Yi is a label rather than a numerical value, where the label represents the
group or class to which that item belongs.
We again wish to use the predictor information in X to allocate Y to one of the classes There
are two main goals:
• to partition the observations into two or more labelled classes. The emphasis is on deriving
a rule that can be used to optimally assign a new object to the labeled classes.
• we have a set of data available where both the response and predictor information is known
• we also have a set of data where only the predictor information is known, and the response
is to be predicted
• often we will carry out an exercise of model-building and model-testing on a given data
set by extracting a training set, building a model using the training data, whilst holding
back a proportion (the test set) for model-testing.
An important distinction can be drawn between such classification problems in which training
data, that is response and predictor pairs for known cases, are available, which are referred to as
supervised learning problems, and problems where no such training data are available, and all
inferences about substructure within the data must be extracted from the test data alone, possibly
only with some background or prior knowledge.
82 CHAPTER 3. STATISTICAL ANALYSIS
The conditional probability, P (2|1), of classifying an object into class 2 when, in fact, it is from
class1 is: Z
P (2|1) = f1 (x) dx.
R2
Similarly, the conditional probability, P (1|2), of classifying an object into class 1 when, in fact, it
is from class 2 is: Z
P (1|2) = f2 (x) dx
R1
Let p1 be the prior probability of being in class 1 and p2 be the prior probability of 2, where
p1 + p2 = 1. Then,
Now suppose that the costs of misclassification of a class 2 object as a class 1 object, and vice versa
are, respectively.c (1|2) and c(2|1). Then the expected cost of misclassification is therefore
The idea is to choose the regions R1 and R2 so that this expected cost is minimized. This can be
achieved by comparing the predictive probability density functions at each point x
½ ¾ ½ ¾
f1 (x) p1 c (1|2) f1 (x) p1 c (1|2)
R1 ≡ x : ≥ R2 ≡ x : <
f2 (x) p2 c (2|1) f2 (x) p2 c (2|1)
or by minimizing the total probability of misclassification
Z Z
p1 f1 (x) dx + p2 f2 (x) dx
R2 R1
If p1 = p2 , then ½ ¾
f1 (x) c (1|2)
R1 ≡ x : ≥
f2 (x) c (2|1)
and if c (1|2) = c (2|1), equivalently
½ ¾
f1 (x) p2
R1 ≡ x : ≥
f2 (x) p1
• class 1: X ∼ Nd (µ1 , Σ1 )
µ ¶d/2 ½ ¾
1 1 1 T −1
f1 (x) = exp − (x − µ1 ) Σ1 (x − µ1 )
2π |Σ1 |1/2 2
• class 2: X ∼ Nd (µ2 , Σ2 )
µ ¶d/2 ½ ¾
1 1 1 T −1
f2 (x) = exp − (x − µ2 ) Σ2 (x − µ2 )
2π |Σ2 |1/2 2
where µ ¶
1 |Σ1 | 1 ¡ T −1 ¢
k= log + µ1 Σ1 µ1 − µT2 Σ−1
2 µ2
2 |Σ2 | 2
2d + d (d + 1)
parameters to estimate Thus with limited data in d dimensions, we may be limited in the
type of analysis can be done. In fact, we may have to further restrict the type of covariance
structure that we may assume; for example, we might have to restrict attention to
3.13.3 DISCRIMINATION
Discriminant analysis works in a very similar fashion; from equations (3.3) and (3.4) we note
that the boundary between regions R1 and R2 takes one of two forms
A1 x + a0
where A1 is a d × d matrix
xT B2 x + B1 x + b0
• the within-sample classification error: the proportion of elements in the training sample
that are misclassified by the rule
• the leave-one-out classification error: the proportion of elements in the training sample
when the model is built (that is, the parameters are estimated) on a training sample that
omits a single data point, and then attempts to classify that point on the trained model
• an m-fold cross-validation : the data are split into m subsamples of equal size, and one
is selected at random to act as a pseudo-test sample. The remaining data are used as
training data to build the model, and the prediction accuracy on the pseudo-test sample
is computed. This procedure is repeated for all possible splits, and the prediction accuracy
computed as a average of the accuracies over all of the splits.
The classification procedures above reduce to a simple rule; we classify an individual to class 1 if
x < t0
for some threshold t0 , and to class 2 otherwise. We then consider the following quantities:
3.13. CLASSIFICATION 85
• Sensitivity: probability that a test result will be positive when the disease is present (true
positive rate, expressed as a percentage).
• Specificity: probability that a test result will be negative when the disease is not present
(true negative rate, expressed as a percentage).
• Positive likelihood ratio: ratio between the probability of a positive test result given the
presence of the disease and the probability of a positive test result given the absence of the
disease
T rueP ositiveRate
F alseP ositiveRate
• Negative likelihood ratio: ratio between the probability of a negative test result given the
presence of the disease and the probability of a negative test result given the absence of the
disease
False Negative Rate
True Negative Rate
• Positive predictive value: probability that the disease is present when the test is positive
(expressed as a percentage).
• Negative predictive value: probability that the disease is not present when the test is
negative (expressed as a percentage).
Disease Class
1 2 Total
Predicted 1 a c a+c
Class 2 b d b+d
Total a+b c+d a+b+c+d
• Sensitivity:/Specificity:
a d
Sensitivity : Specif icity :
a+b c+d
• Likelihood Ratios
Sensitivity 1 − Sensitivity
P LR = N LR =
1 − Specif icity Specif icity
• Predictive Values
a d
PPV = NPV =
a+c b+d
As the classifier producing the predicted class depends on the threshold t0 , we can produce a plot
of how these quantities change as t0 changes.
86 CHAPTER 3. STATISTICAL ANALYSIS
If we plot
x(t0 ) : 1 − Specif icity at t0
y(t0 ) : Sensitivity at t0
then we obtain an ROC curve;
• for a good classifier would rise steeply and then flatten off ; such a curve would have a large
area underneath it on the unit square (the domain of (x(t0 ), y(t0 )))
• for a poor classifier would be have an ROC curve near the line y = x.
Mathematical Construction
In PCA the data matrix is typically arranged with observations in rows, and different predictors
in columns. In a classification context, we might wish to see how much information the predictor
variables contained. Suppose that the N × p data matrix X is so arranged, but also that X is
centred, so that the mean within a column is zero - this is achieved by taking the raw predictor
data matrix, and subtracting from each element in a column that columns’ sample mean In linear
regression, the matrix X was referred to as the design matrix, and used to estimate parameters in
the regression model using the formula for response vector y.
¡ ¢
βb = XT X −1 XT y (3.5)
with prediction
¡ ¢
b = X XT X −1 XT y
b = Xβ
y
Note that if X is the centred matrix as defined, we have that
XT X
S=
N
is the sample covariance matrix. Now using standard matrix techniques, we may (uniquely)
write X in the following way
X = UDVT (3.6)
where U is (N × p) , V is (p × p) such that
UT U = VT V = Ip
88 CHAPTER 3. STATISTICAL ANALYSIS
for p-dimensional identity matrix Ip (that is, U and V are orthogonal matrices), and D is a
(p × p) matrix with diagonal elements d1 ≥ d2 ≥ ... ≥ dp ≥ 0 and zero elements elsewhere. The
representation in ( 3.6) is termed the singular value decomposition (SVD ) of X. Note that,
using this form, we have
l1 ≥ l2 ... ≥ lp ≥ 0;
these are termed the eigenvalues of XT X. The columns of the matrix V are termed the eigen-
vectors of XT X, and the j th column, vj is the eigenvector associated with eigenvalue lj . The
principal components of XT X are defined via the columns of V, v1 , ..., vp . The j th principal
component is zj , defined by
zj = Xvj = lj uj
for normalized vector uj . The first principal component z1 has largest sample variance amongst
all normalized linear combinations of the columns of X; we have that
l1
V ar [z1 ] =
N
Now, recall that in the SVD, VT V = Ip , that is the columns of V are orthogonal. Hence the
principal components z1 , ..., zp are also orthogonal. The total variance explained by the data is
a straightforward function of the centered design matrix; it is the sum of the diagonal elements (or
trace) of the matrix S, given by
p ¡ ¢ p
X trace XT X trace (L) 1 X
trace (S) = [S]jj = = = lj
N N N
j=1 j=1
lj
p (3.8)
X
lj
k=1
of the total variance. Using principal components, therefore, it is possible to find the “directions”
of largest variability in terms of a linear combination of the columns of the design matrix; a
linear combination of column vectors x1 , ..., xp is a vector w of the form
p
X
w= π j xj
j=1
Statistical Properties
It is of practical use to be able to see how many principal components are needed to explain the
variability in the data. To do this we need to study the statistical properties of the elements of
the decomposition used. If the predictor variables have a multivariate normal distribution,
we have the following statistical result. Suppose that predictor vector Xi = (Xi1 , ..., Xip ) have a
multivariate normal distribution for i = 1, ..., N ,
Xi ∼ Np (µ, Σ)
Σ = ΓΛΓT
¡ ¢
for eigenvalue matrix Λ = diag (λ1 , ..., λp ) and eigenvector matrix Γ = γ 1 , ..., γ p . Then the
centred sample covariance matrix S
XT X
S=
N
with eigen decomposition
S = VLVT
for sample eigenvalue matrix L = diag (l1 , ..., lp ) and eigenvector matrix V = (v1 , ..., vp ) is such
that, approximately, as N → ∞,
¡ ¢
l = (l1 , ..., lp )T ∼ N 0, 2Λ2
that is, the sample eigenvalues are approximately independently normally distributed with variance
2λ2j
N −1
Uses
The main use of principal components decomposition is in data reduction or feature extraction.
It is a method for looking for the main sources of variability in the predictor variables, and the
argument follows that the first few principal components contain the majority of the explanatory
power in the predictor variables. Thus, instead of using the original predictor variables in the linear
(regression) model
Y = Xβ X +²
we can use instead the principal components as predictors
Y = Zβ Z +².
where Z = XV, where β X and β Z are the parameters vectors in the regression model, both of
dimension (p × 1). The data compression or feature extraction arises if, instead of taking all p of
the principal components, we take only the first k, that is, we extract the first k columns of matrix
Z, and reduce β Z to being a (k × 1) vector. Choosing k can be done by inspection of the “ scree”
plot of the successive scaled eigenvalues as in (3.8).
90 CHAPTER 3. STATISTICAL ANALYSIS
1. Let xj = (xj1 , ...xjn )T be the j th column of the design matrix X, appropriately centred (by
subtracting the column mean x̄j ) and scaled (by column sample standard deviation sj ) to
have sample mean zero and sample variance one.
(0)
b (0) = 1ȳ and xj
2. Set y = xj
3. For m = 1, 2..., p,
p
X
• zm = b x(m−1) , where
φmj j
j=1
n
X
b = hv1 , v2 i =
φ v1i v2i
mj
i=1
b (m) is defined by
• y
y b (m−1) + b
b (m) = y θm zm
(m)
• xj is defined by
D (m−1)
E
zm , xj
(m) (m−1)
xj = xj − zm
hzm , zm i
(m)
so that, for each j = 1, ..., p, xj is orthogonal to zm .
b (1) , y
4. Record the sequence of fitted vectors y b (2) , ..., y
b (p) .
In the construction of each PLS direction zm the predictors are weighted by the strength of their
univariate impact on y. The algorithm first regresses y first on z1 , giving coefficient b θ1 , and then
orthogonalizes x1 , ..., xp to z1 , then proceeds to regress y first on z2 on these orthogonalized vectors,
and so on. After M ≤ p steps, the vectors
z1 , ..., zM
have been produced, and can be used as the inputs in a regression type model, to give the
b (P LS) , ..., β
β b (P LS)
1 M
92 CHAPTER 3. STATISTICAL ANALYSIS
CHAPTER 4
STATISTICAL MODELS AND METHODS IN
BIOINFORMATICS
• observable variables usually consist of nucleotides for DNA sequences and amino-acid residues
for protein sequences, that is, quantities that we can “measure” or observe at all points in the
sequence. These quantities are usually observed without error, but occasionally can be sub-
ject to uncertainty (such the uncertainty arising from base-calling algorithms). Occasionally,
other, larger scale observable quantities might be available, such as DNA motifs, or protein
secondary structure.
• unobservable variables correspond to hidden or latent structure that is not observable, such
as CpG island categories, regulatory regions, introns/exons, protein secondary and higher-
order structures. These variables are the main focus of our interest.
P [X1 = x1 , X1 = x2 , X1 = x3 , ...] = P [X1 = x1 ]×P [X2 = x2 |X1 = x1 ]×P [X3 = x3 |X1 = x1 , X2 = x2 ] ...
Such a model will form the basis of much of the statistical analysis of biological sequences, as it
allows us to build up the probability distribution for an observed sequence.
93
94 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
1. X(0) = 0
that is, the numbers of events occurring in disjoint intervals are probabilistically independent.
It can be shown that the sequence of discrete variables {X(t), t ≥ 0} each follow a discrete Poisson
distribution, that is, if Pn (t) =P[Precisely n events occur in (0, t]] it can be shown that
e−λt (λt)n
Pn (t) = P [X(t) = n] = n = 0, 1, 2, ...
n!
The sequence {X(t), t ≥ 0} form a homogeneous Poisson Process with rate λ (that is, a
Poisson process with constant rate λ)
2. If T1, T2 , T3 , ... define the inter-event times of events occurring as a Poisson process, that is,
for n = 1, 2, 3, ...
Tn = “time between (n − 1)st and nth event 00
then T1, T2 , T3 , ... are a sequence of independent and identically distributed random vari-
ables with
Tn ∼ Exponential(λ)
4.2. STOCHASTIC PROCESSES 95
3. If Y1, Y2 , Y3 , ... define the times of events occurring as a Poisson process, that is, for n =
1, 2, 3, ...Yn =“time at which nth event occurs”, then Y1, Y2 , Y3 , ... are a sequence of random
variables with
Yn ∼ Gamma(n, λ)
4. Consider the interval of length L, (0, L], and suppose that k events occur (according to the
Poisson process rules) in that interval. If V1, V2 , ..., Vk define the (ordered) event times of the
k events occurring in that interval, then, given L and k, V1, V2 , ..., Vk are the order statistics
derived from an independent and identically distributed random sample U1, U2 , ..., Uk
where
Ui ∼ U nif orm(0, L)
The homogeneous Poisson process is the standard model for discrete events that occur in con-
tinuous time. It can be thought of as a limiting case of an independent Bernoulli process (a model
for discrete events occurring in discrete time); let
Xt ∼ Bernouilli (θ) t = 1, 2, 3,
0010101000100101000010010001.....
• the number of 1s that occur in any finite and disjoint subsequences are independent random
variables
• the number of 1s that occur in any finite subsequence of n time points is a Binomial (n, θ)
random variable
• the numbers of trials between successive 1s are i.i.d. Geometric (θ) random variables.
Now consider the a large sequence where θ is small; the Bernoulli sequence approximates a contin-
uous time sequence, where the events (i.e. the 1s) occur at a constant rate of λ = nθ per n trials,
and the Poisson process requirements are met.
The homogeneous Poisson process is a common model that is used often in many scientific fields.
For example, in genetics, the Poisson process is used to model the occurrences of crossings-over
in meiosis. In sections below we will see how the Poisson process model can be used to represent
occurrences of motifs in a biological sequence.
96 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
for i, j ∈ {1, 2, 3, ..., S}, say, that does not depend on t. The probabilistic specification can be
encapsulated in the S × S matrix
p11 p12 p13 ··· p1S
p21 p22 p23 ··· p2S
P = . .. .. .. ..
.. . . . .
pS1 pS2 pS3 · · · pSS
which is called the transition matrix where the element in row i and column j defines the
probability of moving from state i to state j . Note that the row totals must equal 1.
The sequence of random variables described by the matrix P form a Markov Chain. Thinking
back to the chain rule, in order to complete the specification, a probability specification for the
initial state random variable X1 is required;
³ we can denote
´ this discrete probability distribution
(1) (1) (1) (1)
by row vector of probabilities π = π 1 , π 2 , ...π S . To compute the (marginal) probability
of random variable X³t taking the value ´ i, we can use matrix algebra and an iterative calculation
(t) (t) (t) (t)
as follows; let π = π 1 , π 2 , ...π S denote the probability distribution of Xt . First, using the
Theorem of Total Probability (chapter 1), conditioning on the different possible values of Xt−1
S
X
P [Xt = j] = P [Xt = j|Xt−1 = i] P [Xt−1 = i]
i=1
or in matrix form
π (t) = π (t−1) P
Using this definition recursively, we have
which gives a mechanism for computing the marginal probability after t steps.
4.2. STOCHASTIC PROCESSES 97
so the chain gets “stuck” in state S. Here the states 0 and S are termed absorbing states. This
type of Markov process is termed a random walk.
Then, the equilibrium distribution π can be obtained by solving the system of equations
Note that the last row of P is not used in the calculation, but is replaced by the probability
distribution constraint that the probabilities must sum to 1. This system can be solved (using
SPLUS or MAPLE) to find
This stationary distribution can be obtained easily by computing the n -step ahead transition
matrix Pn = P n = P × P × ... × P in the limit as n → ∞; some SPLUS code to do this is below
p<-matrix(c(0.6,0.1,0.2,0.1,0.1,0.7,0.1,0.1,0.2,0.2,0.5,0.1,0.1,0.3,0.1,0.5),nrow=4,byrow=T)
pT A pT C pT G pT T
The S-Plus Exercise 1 was concerned with estimating the parameters in this matrix; in fact Max-
imum Likelihood Estimation gives a formal justification for the technique used there. However,
there little justification for the assumption of homogeneity across large-scale genomic regions.
where fX and FX define the distribution of the original variables. For some of the distributions
we have studied, the distributions of the extreme order statistics also have simple forms.
fX (x) = 1 FX (x) = x
FY1 (y1 ) = 1 − {1 − y1 }n
1 x
fX (x) = FX (x) =
L L
100 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
EXAMPLE: The lifetime (until degradation) of cellular proteins or RNA molecules can be well
modelled by an Exponential distribution (Ewens and Grant, p43). Suppose n such molecules are
to be studied, and their respective lifetimes represented by random variables X1 , ..., Xn , regarded
as independent and identically distributed. Now, if X1 , ..., Xn ∼ Exponential(λ), then for x > 0
γ + log n π2
EfYn [Yn ] ≈ V arfYn [Yn ] ≈
λ 6λ2
where γ = 0.577216.
EXAMPLE: For discrete random variables X1 , ..., Xn ∼ Geometric(θ), then for x = 1, 2, 3, ...
For convenience we adjust the distribution (as in the SPLUS package) so that for x = 0, 1, 2, .
where φ = 1 − θ. In this adjusted distribution, for the extreme order statistics, the cdfs are given
by © ªn
M AXIM U M FYn (yn ) = 1 − φyn +1
© © ªªn
M IN IM U M FY1 (y1 ) = 1 − 1 − 1 − φy1 +1 = 1 − φn(y1 +1)
and hence, for Yn we have
© ªn
P [Yn ≤ yn ] = 1 − φyn +1
and thus
© ªn
P [Yn = yn ] = P [Yn ≤ yn ] − P [Yn ≤ yn − 1] = 1 − φyn +1 − {1 − φyn }n
Now, φ is a probability, so we can re-parameterize by writing φ = e−λ for some λ > 0 and hence
© ªn
P [Yn ≤ yn ] = 1 − e−λ(yn +1)
© ªn
P [Yn ≥ yn ] = 1 − 1 − e−λyn
© ªn © ªn
P [Yn = yn ] = 1 − e−λ(yn +1) − 1 − e−λyn
which are similar formulae to the Exponential case above. It can be shown (again after some work)
that
γ + log n 1 π2 1
EfYn [Yn ] ≈ − V arfYn [Yn ] ≈ 2 +
λ 2 6λ 12
that is, very similar to the results for the Geometric distribution above.
SOME APPROXIMATIONS
For large n, some further approximations can be made: If we let
γ + log n π2
µn ≈ σ 2max ≈
λ 6λ2
then
P [Yn ≤ yn ] ≈ exp {−n exp {−λyn }}
which can be compared with the exact result above
n on
P [Yn ≤ yn ] = 1 − e−λyn
102 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
In the discrete Geometric case, it can be using similar approaches that for large n
exp {−nC exp {−λyn }} ≤ P [Yn ≤ yn ] ≤ exp {−nC exp {−λ(yn + 1)}}
1 − exp {−nC exp {−λyn }} ≤ P [Yn ≥ yn ] ≤ 1 − exp {−nC exp {−λ(yn − 1)}}
where the constant C is a constant to be defined. These results will provide the probabilistic basis
for sequence analysis via BLAST.
U1 + U2 + ... + Un+1 = 1
The random variables U1 , U2 , ..., Un+1 are not independent so the theory derived above is not
applicable. It can be shown that if
then
1
P [Umin ≤ u] = 1 − (1 − (n + 1)u)n 0<u<
n+1
so that
FUmin (u) = 1 − (1 − (n + 1)u)n
then under H0 it can be shown that the distribution of run-lengths should follow an adjusted
Geometric (1 − pA ) distribution, that is, if Xi is defined the run-length of run i, then
Now suppose that Yn = max {X1 , ..., Xn } then using the extreme value theory results it can be
shown that (under H0 )
¡ ¢n ¡ ¢n
FYn (y − 1) = P [Yn < y] = 1 − pyA =⇒ P [Yn ≥ y] = 1 − 1 − pyA
Hence, for a formal significance test of the hypothesis H0 , we may use the observed version of Yn
(that
¡ is, the¢nsample maximum) as the test statistic, and compute a p-value p = P [Yn ≥ yn ] =
1 − 1 − pyAn . Note that n must be specified before this p-value can be computed (effectively, we
need to choose n large enough) A recommended choice is n ≈ (1 − pA ) N , giving that
¡ ¢(1−pA )N
p ≈ 1 − 1 − pyAn
which using an exponential approximation gives p ≈ 1 − exp {−(1 − pA )N pyn }. Hence, for a test
at the α = 0.05 significance level, we must check whether the computed p is less than α. If it is, we
reject the hypothesis H0 that the sequence is random and independent.
4.3.3 R-SCANS
In r-scan analysis, interest lies in detecting short nucleotide “words” in a long genomic segment of
length L. If L is very large, then the locations of the occurrences of the words, can be regarded as
points in the (continuous) interval (0, L), or, without loss of generality, the interval (0, 1) . Suppose
that the word is detected a total of k times. An appropriate test of the hypothesis H0 that the
locations are uniformly distributed in (0, 1) can be based on the test statistic random variable Yk+1
where
Yk+1 = “the maximum inter-location length”
(note that the k points segment the genomic interval into k + 1 regions of lengths U1 , U2 , ..., Uk+1 .
Then
Yk+1 = max {U1 , U2 , ..., Uk+1 }
Again, to construct the significance test and calculate a p-value, we seek the distribution of this
maximum order statistic. However, previous techniques cannot be used, as the random variables
U1 , U2 , ..., Uk+1 are not independent. With much work, it can be shown that, as k becomes large,
under H0 , n o
−(k+2)y
p = P [Yk+1 ≥ y] ≈ 1 − exp −(k + 2)e
which may be re-written
· ¸
log(k + 2) + y © ª
p = P Yk+1 ≥ ≈ 1 − exp −e−y
(k + 2)
This may be generalized to enable tests based on other test statistics to be carried out, such as
those based on “r-scan” values (maxima of the sums of r adjacent inter-point intervals. Finally, as
an alternative test, we could instead use the minimum order statistic Y1 Again, after much work,
it can be demonstrated that under H0 the p -value is given (approximately) by
© ª
p = P [Y1 ≤ y] ≈ 1 − exp −y(k + 2)2
104 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
Both of these statistics have an approximate Chi-squared distribution χ2(r−1)(c−1) = χ23 distrib-
ution, again given that H0 is true. Typically, a significance level of α = 0.05 is used for this test,
and the critical value in a one-tailed test of the hypothesis is at the 0.95 point of this distribution,
that is, at 7.81.
that match exactly in positions 1,3,6,9,10,11,13,14,23,24 and 25, allowing for the fact that evolu-
tionary forces (substitution/mutation, deletion and insertion) will disrupt the exact matching of
truly homologous sequences. In fact, is important to allow for partially aligned sequences
CGGGT A − −T CCAA
CCC − T AGGT CCCA
Again, considering the random variable Yn = max {X1 , ..., Xn } as the test statistic, then under H0
P [Yn ≥ y] = 1 − (1 − pym )n
For a formal significance test of the hypothesis H0 , therefore, we may again use the maximum run
length as the test statistic, and compute a p -value, p
p = P [Yn ≥ yn ] = 1 − (1 − pymn )n
106 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
p ≈ 1 − (1 − pymn )(1−pm )N
For a test at the α = 0.05 significance level, we must check whether the computed p-value is less
than α. If it is, we reject the hypothesis H0 that the sequence is random and independent. For the
sequences above, yn = 3, and if one of the assumptions of H0 is that pA = pC = pG = pT = 41 so
that pm = 14 , then the approximate p-value is
à µ ¶3 !26×(3/4)
1
1− 1− = 0.264
4
which is not less than α = 0.05, and so there is no evidence to reject H0 , and hence there is
no evidence of alignment (or homology). In fact, using this approximation, we would need a run
length ³ ´
log 1 − (1 − 0.05)(1/26)×(4/3)
yCRIT ≥ ¡ ¢ = 4.29
log 14
to reject the null model.
the probability distribution of such a variable is the Negative Binomial (generalized geometric)
distribution , and the probability distribution (cdf) of X is essentially given by the negative binomial
mass function as
Xx µ ¶
j j−k
FX (x) = P [X ≤ x] = p (1 − pm )k+1 x = k, k + 1, k + 2, ...
k m
j=k
For a test statistic, we will again consider a maximum order statistic Yn = max {X1 , ..., Xn } derived
from the analysis of n subsequences.
It is clear from the above formulae that both the definition of the test statistic, the calculation
of the null distribution, and p-value etc, is complex, and will be more complicated if the model
for sequence generation is generalized to be more realistic. However, the principles of statistical
hypothesis testing can be applied quite generally.
4.4. ANALYSIS OF MULTIPLE BIOLOGICAL SEQUENCES 107
1. Generate two sequences of length N under the model implied by H0 , that is, two independent
sequences with probabilities pA , pC , pG , pT for each nucleotide.
2. For different values of k (k = 0, 1, 2, 3, ...) trawl the sequences to find the longest (contiguous)
subsequence which contains at most k mismatches. Record the length y of this longest
subsequence.
3. Return to (1) and repeat a large number (1000000 say) of times and form a two-way table of
the frequencies with which the longest subsequence containing k mismatches was of length y
(y = k, k + 1, k + 2, ...)
4. Convert the frequencies into probabilities by dividing by the number of times the simulation
is repeated.
This Monte Carlo simulation procedure often gives a probability distribution of sufficient accu-
racy to allow a p-value to be computed.
Both algorithms can be modified to incorporate gaps in alignments and to penalize them appro-
priately. The Smith-Waterman algorithm is regarded as a superior algorithm as it computes the
optimal alignment between two sequences in a shorter computational time. Neither algorithm is
explicitly probabilistic in nature, but can be regarded (loosely) as a form of maximum probabil-
ity/maximum likelihood procedure.
108 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
• the null distribution is an extreme value distribution, of a similar form to those derived
in section 4.3.1
• the p-value is computed in terms of the E-VALUE which, for two sequences of length n and
m is defined as
E = Kmn exp {−λS} p = 1 − e−E
E is the expected number of “high scoring segment” pairs of sequences, and K and λ are
parameters to be specified or estimated from appropriate sequence databases
• Critical values in the test are evaluated from the null distribution in the usual way
110 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
Assessment of the statistical significance of the alignment of two biological sequences is based
on properties of a discrete state stochastic (Markov) process similar to the Markov chains
introduced in Section 4.1. Consider first two DNA sequences of equal length, and the positions at
which they match/align (coded 1) similar to the notation of previous sections:
G G A G A C T G T A G A C A G C T A A T G C T A T A
G A A C G C C C T A G C C A C G A G C C C T T A T C
1 0 1 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0
Now, suppose that when the Match sequence is read from left to right a Match is given an alignment
“score” of +1, whereas a non Match is given a score of -1, and a running or cumulative score is
recorded for each position
Sequence 1 G G A G A C T G T ··· G C T A T A
Sequence 2 G A A C G C C T A ··· C T T T T C
Match 1 0 1 0 0 1 0 0 1 ··· 0 0 1 1 1 0
Score +1 −1 +1 −1 −1 +1 −1 −1 +1 ··· −1 −1 +1 +1 +1 −1
Cumulative 1 0 1 0 −1 0 −1 −2 −1 ··· −5 −6 −5 −4 −3 −4
If Xi is the discrete random variable recording the match score at position i so that
(
+1 Match
Xi =
−1 non Match
and Si is the discrete random variable recording the cumulative match score at position i then by
definition
Xi
Si = Xj = Si−1 + Xi
j=1
and hence the sequence of random variables S1 , S2 , S3 , ... form a Markov process that is in fact a
random walk.on the integers (note that this random walk does not have any absorbing states).
For the two sequences above, the complete observed sequence s1 , s2 , s3 , ..., s25 , s26 is given by
CU M U LAT IV E SCORE
1, 0, 1, 0, −1, 0, −1, −2, −1, 0, 1, 0, 1, 2, 1, 0, −1, −2, −3, −4, −5, −6, −5, −4, −3, −4
4.6. THE STATISTICS OF BLAST AND PSI-BLAST 111
Note that such a sequence of random variables can be defined and observed whatever scoring method
is used to associate alignment scores with positions in the sequence; this is of vital importance when
alignment of protein sequences is considered.
We wish to use the observed sequence s1 , s2 , s3 , ... to quantify the degree of alignment between
the sequences; this is achieved as follows. A exact local alignment between the sequences is a
subsequence where the two sequences match exactly. In the sequences above, the exact local
alignments are observed at positions 1,3,6,9-11,13-14 and 23-25. Next consider the ladder points,
that is, those positions in the sequence at which the cumulative score is lower than any previous
point; the ladder points
LADDERP OIN T S 0 5 8 19 20 21 22
SCORE 0 −1 −2 −3 −4 −5 −6
Finally, consider the successive sections of the walk between the ladder points. In particular, con-
sider the excursions of the random walk, that is, the successive differences between the maximum
cumulative score for that subsection and the score at the previous ladder point.
SUBSECTION 1 2 3 4 5 6 7
Begins at Ladder Point 0 5 8 19 20 21 22
Ladder Point Score 0 −1 −2 −3 −4 −5 −6 (1)
Ends at 4 7 18 19 20 21 26
Maximum subsection score 1 0 2 −3 −4 −5 −3 (2)
Achieved at position(s) 1, 3 6 14 19 20 21 25
Excursion 1 1 4 0 0 0 3 (2)-(1)
The alignment, ladder points, maximum subsection scores and excursions can be displayed graph-
ically, as in Figure 4.1;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4
G G A G A C T G T A G A C A G C T A A T G C T A T A
G A A C G C C C T A G C C A C G A G C C C T T A T C
2
0
Score
-2
-4
-6
0 5 10 15 20 25
Position
The excursions measure the degree of local alignment between the two sequences; reading from left
to right along the alignment, within each subsection (between the ladder points) the magnitude of
112 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
the excursion measures the cumulative evidence that the sequences are evolutionarily related within
a localized window. Consider, for example subsection 3 that starts at the ladder point at position
8 (cumulative score -2) and extends to position 18 (before the next ladder point at position 19).
There is a close (but not exact) alignment between the first 7 positions, and the degree of support
for true alignment peaks at position 14 (cumulative score +4), before declining. Note that, due
to the Markovian nature of the underlying process, the sequence subsections between the ladder
points have identical probabilistic properties.
In biological sequence analysis, the objective is to optimize and quantify the statistical sig-
nificance between two arbitrary sequences. Some generalizations to the stochastic process model
described above are necessary (to allow for gapped and multiple alignments, analysis of sequences
of different lengths etc), but in principle the method is straightforward. Practically, when, for
example interrogating a protein database for matches to a potentially novel sequence, it is often
sufficient to specify some threshold cumulative score value that indicates an acceptable alignment.
The choice of such a threshold is clearly quite arbitrary, but some guidance as to what constitutes a
sensible threshold value may be obtained by studying the probabilistic properties of the stochastic
process models described above.
which is merely the relative entropy of the two distributions. Here Eq [S] is the expected score
under the alternative hypothesis. For high-scoring segments (i.e. where the degree of alignment is
high), we therefore have that the expected score is H/λ.
The degree of alignment between two sequences can be quantified using statistical significance
testing. If the maximum alignment statistic Ymax is obtained, then for any y
· ¸
−λy 1 −λ(y−1)
1 − e−Ke ≤ P Ymax > log N + y ≤ 1 − e−Ke (4.5)
λ
where N is the sequence length and K = Ce−λ /A, where C and A are the constants defined
previously. The central probability can be re-written
· ¸
1
P Ymax > log N + x = P [λYmax − log N > y]
λ
Another quantity reported by BLAST is the expected number of high-scoring excursions, E 0 , defined
by
E 0 = N Ke−λymax (4.8)
that is the expected number of excursions that (would) exceed the observed maximum excursion
for aligned sequences of length N . It is easily seen that
© ª
S 0 = − log E 0 ∴ p ≈ 1 − exp −E 0 ⇒ E 0 = − log (1 − p) ≈ p if pis small (4.9)
thus we have a further approximation to the p-value in terms of E 0 (that is merely a function of
the observed data an modelling assumptions).
where Ni0 = Ni − λYmax /H, for i = 1, 2, and where H is the entropy quantity that appears in (4.4).
For the Karlin-Altschul Sum statistic, the correction is different, with
µ ¶X
r
λ r+1
(r − 1) + 1− f Yi
H r
i=1
being subtracted from N1 and N2 where f is a fixed overlap adjustment factor taken to be
around 0.1 − 0.2.
Then, if the total length of the database interrogated is D then the expected number of segments
that score at least υ is approximately
D¡ ¢
ED = 1 − e−E
N2
as the entire database is a factor of D/N2 times longer than the sequence that gave the highest
alignment score. Hence an appropriate approximation to the required p-value is
pD = 1 − e−ED
For tests involving the sum statistics of section 4.5.4, a similar correction to the expected value and
p-value is obtained.
116 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
P [Ht+1 = j|Ht = i] = Pr (Region type i at time t → Region type jat time t + 1) = θij i, j ∈ H
(4.11)
which may be represented as
For illustration consider a nucleotide sequence, with Xi ∈ {A, C, G, T }, near the boundary of a
coding/non-coding region. Let H = {0, 1} be the set of available region types, where 0 classifies a
non-coding region and 1 classifies a coding region. If the coding region begins at nucleotide 7, then
a possible configuration would be
Observed sequence A C T C G A A C C G
Latent sequence 0 0 0 0 0 0 1 1 1 1
so that the realized latent sequence is h = (h1 , h2 , ..., h10 ) = (0, 0, 0, 0, 0, 0, 1, 1, 1, 1). However, of
course, in practice the latent sequence is not observed and therefore the statistical analysis issues
centre on inference about (estimation of) the latent sequence {Ht , t ≥ 1} and the parameters in the
Markov transition matrices Ph , h ∈ H, and Pθ .
be the nX × nX Markov transition matrix for region type k, and the (nH + 1) × (nH + 1) Markov
transition matrix between regions respectively. Recall that, in each case, the rows of these matrices
4.7. HIDDEN MARKOV MODELS 117
are conditional probability distributions and therefore must sum to one. Typically, we will be
considering nX = 4 (for DNA sequences) and nX = 20 for protein sequences, and nH up to 5.
Given the latent sequence h and using the notation P = (P0 , P1 , ..., PnH , Pθ ), the likelihood
function derived from the observed data x,can be defined in the spirit of earlier sections, and using
the chain-rule for probabilities as
Now, because of the Markov assumption for the observed data, the conditional probability expres-
sions can be simplified. Specifically, for t = 2, 3, ..., n
as the observation in position t conditional on previous values is dependent only on the observation
in position t − 1. Furthermore, if ht = ht−1 = k say (that is, there is no change in region type
between position t − 1 and position t), then
where
(k)
f (xt ; xt−1 , ht−1 , ht , Pk ) = pij if xt−1 = i and xt = j
is the probability of a transition between states i and j within region type k between positions t − 1
and t. If ht 6= ht−1 , say ht−1 = k1 and ht = k2 with k1 6= k2 then it is assumed that
Observed sequence A C T C G A A C C G
Coded 1 2 4 2 3 1 1 2 2 3
Latent sequence 0 0 0 0 0 0 1 1 1 1
(0) (0) (0) (0) (0) (0) (1) (1) (1) (1)
= p1 × p12 × p24 × p42 × p23 × p31 × p1 × p12 × p22 × p23
|{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z}
|P os1 P os2 P os3 {z P os4 P os5 P os6} |P os7 P os8 {z P os9 P os10}
Region type0 Region type1
118 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
Previously, such a likelihood has formed the basis for statistical inference. Using maximum
likelihood estimation we could estimate the unobserved parameters (h1 , h2 , ..., hn , P) by choosing
those values at which L (h1 , h2 , ..., hn , P) is maximized, that is, we choose
³ ´
ĥ1 , ĥ2 , ..., ĥn , P̂ = arg max L (h1 , h2 , ..., hn , P)
where, recall,
L (h1 , h2 , ..., hn , P) = f (x|h, P)
The inference problem now is twofold. We wish to
(a) report the most probable states h = (h1 , h2 , ..., hn ) in light of the data x
The estimation of the most probable states is complicated by the structure in this latent sequence.
Remember that the Markov assumption means that the joint distribution of random variables
H = (H1 , H2 , ..., Hn ) should be written (using the chain rule) as
f (h1 , h2 , h3 , ..., hn ) = f (h1 ) × f (h2 |h1 ) × f (h3 |h1 , h2 ) × ... × f (hn |h1 , h2 , h3 , ..., hn−1 )
(4.13)
= f (h1 ) × f (h2 |h1 ) × f (h3 |h2 ) × ... × f (hn |hn−1 )
and this dependence structure should be taken into account. Recall also that the joint distribution of
vector H depends on transition matrix Pθ , terms in (4.13) will be either transition probabilities
θij or equilibrium probabilities θi derived from Pθ .
For HMMs, likelihood based inference is carried out via Bayes Rule that allows the posterior
probability of the states in the latent sequence to be computed. The key quantity is the joint
conditional probability of the hidden states, given the observed sequence, that is p (h|x) where
f (x|h)f (h)
f (h|x) = (4.14)
f (x)
suppressing the dependence on the other parameters, where the first term in the numerator comes
from (4.12) and the second term comes from (4.13) The denominator is the joint (unconditional)
probability of observing the data sequence x that can be computed via the Total Probability
result as X
f (x) = f (x|h)f (h) (4.15)
h
where the summation is over all possible state vector configurations. Inference will require efficient
computational methods as the summation in ( 4.15) and the maximizations that are required both
involve large numbers of terms
(ii) Find the state vector that maximizes the joint conditional probability in (4.14), that is,
The doubly-Markov model described above, that is, with a Markov structure in the observed
data and a Markov structure in the unobserved states is a model that requires much computational
effort. Typically, a simplification is made, in that the matrices P0 , P1 , ..., PnH are assumed to be
(k) (k)
diagonal, that is, we may write for k = 0, 1, ..., nH , Pk = diag(p1 , . . . , pnX ), so that the observed
data are conditionally independent given the unobserved states, so that there is no dependence
between characters in adjacent positions in the sequence. In this case the likelihood is formed (in
the example) as
(0) (0) (0) (0) (0) (0) (1) (1) (1) (1)
= p1 × p2 × p4 × p2 × p3 × p1 × p1 × p2 × p2 × p2
|{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z} |{z}
|P os1 P os2 P os3 {z P os4 P os5 P os6} |P os7 P os8 {z P os9 P os10}
Regiontype0 Regiontype1
Even if this assumption is made, the computational task is still considerable. For example,
for task (i), it is possible merely to list all possible state vector configurations that appear as the
summand in (4.15), and to sum out over them. However, this is a calculation requiring a large
number of computations; for a sequence of length n, the direct calculation requires
2n × (nH + 1)n
and for a sequence of length 10 with nH = 1 as in example 1, this number is 20 × 210 = 20480,
but this number increases quickly as the sequence length/number of region type increases. For
example:
nH
1 2 3 4
10 2.048 × 104 1.181 × 106 2.097 × 107 1.953 × 108
n 100 2.535 × 1032 2.103 × 1050 3.321 × 1062 4.158 × 1072
500 3.273 × 10153 3.636 × 10241 1.072 × 10304 3.055 × 10352
so even for moderate-sized problems the number of computations is large. Thus, instead of direct
calculation, the Forward and Backward algorithms are used. See section B for full details
For the next stage of the inferential process, it is required to compute the most likely sequence
of unobserved states, given the observed data, that is
f (x|h)f (h)
ĥ = arg max f (h|x) = arg max = arg max f (x|h)f (h) = arg max f (x, h)
f (x)
This is achieved via the Viterbi Algorithm The final stage involves parameter estimation,
and the Baum-Welch Algorithm.; see section B for full details.
120 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
the expression data for each of N genes, with ni replicate observations of a time series of length T
for gene i. The hybridization experiments are carried out under strict protocols, and every effort
is made to regularize the production procedures, from the preparation stage through to imaging.
Typically, replicate experiments are carried out; the same array gene/oligo set are used to investigate
the portions of the same test sample.
3. Discovery of gene clusters: the partitioning of large sets of genes into smaller sets that
have common patterns of regulation.
There are typically several key issues and models that arise in the analysis of microarray data; we
have previously studied these techniques in a general statistical context.
• array normalization: arrays are often imaged under slightly different experimental condi-
tions, and therefore the data are often very different even from replicate to replicate. This
is a systematic experimental effect, and therefore needs to be adjusted for in the analysis of
differential expression. A misdiagnosis of differential expression may be made purely due to
this systematic experimental effect.
• measurement error: the reported (relative) gene expression levels models are only in fact
proxies for the true level gene expression in the sample. This requires a further level of
variability to be incorporated into the model.
• modelling: the sources of variability present in the data can be explained using conventional
statistical tools of linear and non-linear models. In addition, it may be necessary also to used
mixed regression models, where gene specific random-effects terms are incorporated into the
model. For example, a common linear mixed model for non time-course data is as follows:
for gene i under condition j, in replicate (array) l, we have that
(l) (l)
yij = α(l) + γ ij Zij + εij
122 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
(l)
where α(l) is an array effect, γ ij is a gene specific (random) effect for gene i under condition j,
Zij is an indicator variable determining whether the ith gene is in fact differentially expressed
(l)
under the j th condition, and εij is an uncorrelated random effect.
• testing: one- and two-sample hypothesis testing techniques, based on parametric and non-
parametric testing procedures can be used in the assessment of the presence of differential
expression. For detecting more complex (patterns of) differential expression, in more general
structured models, the tools of analysis of variance (ANOVA) can be used to identify the
chief sources of variability.
• classification: the genetic information contained in a gene expression profile derived from
microarray experiments for, say, an individual tissue or tumour type may be sufficient to
enable the construction of a classification rule that will enable subsequent classification of
new tissue or tumour samples.
• cluster analysis: the discovery of subsets of larger sets of genes that have common patterns
of regulation.can be achieved using the statistical techniques of cluster analysis (see section
4.9).
• computer-intensive inference: for many testing and estimation procedures needed for
microarray data analysis, simulation-based methods (bootstrap estimation, Monte Carlo and
permutation tests, Monte Carlo and Markov chain Monte Carlo) are often necessary to enable
the appropriate calibration of the inferences being made. This is especially true when complex
and hierarchical or multi-level models are used to represent the different sources of variability
in the data.
• experimental design: statistical experimental design can assist in determining the number
of replicates, the number of samples, the choice of time points at which the array data are
collected and many other aspects of microarray experiments. In addition, power and sample
size assessments can inform the experimenter as to the statistical worth of the microarray
experiments that have been carried out.
Typically, data derived from both types of microarray highly noise and artefact corrupted. The
statistical analysis of such data is therefore quite a challenging process. In many cases, the replicate
experiments are very variable. The other main difficulty that arises in the statistical analysis of
microarray data is the dimensionality; a vast number of gene expression measurements are available,
usually only on a relatively small number of individual observations or samples, and thus it is hard
to establish any general distributional models for the expression of a single gene.
4.9. CLUSTER ANALYSIS OF MICROARRAY DATA 123
Data sets for clustering of N observations can have either of the following structures:
• an N × p data matrix, where rows contain the different observations, and columns contain
the different variables.
• an N × N dissimilarity matrix, whose (i, j)th element is dij , the distance or dissimilarity
between observations i and j that has the properties
– dii = 0
– dij ≥ 0
– dji = dij
• Typical data distance measures between two data points i and j with measurement vectors
xi and xj include
k=1
For ordinal (categorical) or nominal (label) data, other dissimilarities can be defined.
124 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
1. The k-Means algorithm: In the k-means algorithm the observations are classified as belong-
ing to one of k groups. Group membership is determined by calculating the centroid for
each group (the multidimensional version of the mean) and assigning each observation to the
group with the closest centroid. The k-means algorithm alternates between calculating the
centroids based on the current group memberships, and reassigning observations to groups
based on the new centroids. Centroids are calculated using least-squares, and observations
are assigned to the closest centroid based on least-squares. This assignment is performed in
an iterative fashion, either from a starting allocation or configuration, or from a set of starting
centroids.
2. Partitioning around medoids (PAM): The PAM method uses medoids rather than
centroids (that is, medians rather than means in each dimension. This approach increases
robustness relative to the least squares approach given above.
• Heuristic Criteria The basic hierarchical agglomeration algorithm starts with each object
in a group of its own. At each iteration it merges two groups to form a new group; the merger
chosen is the one that leads to the smallest increase in the sum of within-group sums of
squares. The number of iterations is equal to the number of objects minus one, and at the
end all the objects are together in a single group. This is known as Ward’s method, the sum of
squares method, or the trace method. The hierarchical agglomeration algorithm can be used
with criteria other than the sum of squares criterion, such as the average, single or complete
linkage methods described below.
• Model-Based Criteria Model-based clustering is based on the assumption that the data
are generated by a mixture of underlying probability distributions. Specifically, it is assumed
that the population of interest consists of kdifferent subpopulations, and that the density of
an observation from the th subpopulation is for some unknown vector of parameters.
Hence, hierarchical clustering is a method of organizing a set of objects into sets of using a similar-
ity/discrepancy measure or by some overall potential function. Agglomerative clustering initially
places each of the N items in its own cluster. At the first level, two objects are to be clustered
together, and the pair is selected such that the potential function increases by the largest amount,
leaving N − 1 clusters, one with two members, the remaining N − 2 each with one. At the next
level, the optimal configuration of N − 2 clusters is found, by joining two of the existing clusters.
This process continuous until a single cluster remains containing all N items.
In conventional hierarchical clustering, the method of agglomeration or combining clusters is
determined by the distance between the clusters themselves, and there are several available choices.
For merging two clusters Ci and Cj , with N1 and N2 elements respectively, the following criteria
can be used
4.9. CLUSTER ANALYSIS OF MICROARRAY DATA 125
• In average (or average linkage) clustering, the two clusters that have the smallest average
distance between the points in one cluster and the points in the other
1 X
d (Ci , Cj ) = dkl
N1 N2
k∈Ci ,l∈C2
are merged
• In connected (single linkage, nearest-neighbour) clustering, the two clusters that have the
smallest distance between a point in the first cluster and a point in the second cluster
are merged.
• In compact (complete linkage, furthest-neighbour) clustering, the two clusters that have the
largest distance between a point in the first cluster and a point in the second cluster
are merged.
γ = (γ 1 , ..., γ N )
then maximizing the likelihood is the same as minimizing the sum of within-group sums of squares
that underlies Ward’s method. Thus, Ward’s method corresponds to the situation where clusters
are hyperspherical with the same variance. If clusters are not of this kind, (for example, if
they are thin and elongated), Ward’s method tends to break them up into hyperspherical blobs.
Other forms of Σk yield clustering methods that are appropriate in different situations. The
key to specifying this is the eigen decomposition of Σk , given by eigenvalues λ1 , ..., λp and eigen-
vectors v1 , ..., vp , as in Principal Components Analysis (section 3.14.1, equation (3.7)) The
eigenvectors of Σk , specify the orientation of the k th cluster, the largest eigenvalue λ1 specifies
126 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
its variance or size, and the ratios of the other eigenvalues to the largest one specify its shape.
Further, if
Σk = σ 2k Ip
the criterion corresponds to hyperspherical clusters of different sizes; this is known as the Spherical
criterion.
Another criterion results from constraining only the shape to be the same across clusters.. This
is achieved by fixing the eigenvalue ratios
λj
αj = j = 2, 3, ..., p
λ1
across clusters; common choices for the specification are
yt = Xt β + εt
y = Xβ + ε (4.17)
and a classical linear model. The precise form of design matrix X is at the moment left unspec-
ified. Typically we take the random error terms {εt } as independent and identically distributed
Normal variables with variance σ 2 , implying that the conditional distribution of the responses Y is
multivariate normal ¡ ¢
Y |X, β, σ 2 ∼ N Xβ, σ 2 IT (4.18)
where now X is T × p where IT is the T × T identity matrix. For this model, the maximum
likelihood/ordinary least squares estimates of β and σ 2 are
¡ T ¢−1 T 1
b
β b2 = (y − yb)T (y − yb)
ML = X X X y σ
T −p
¡ T ¢−1 T
b
for fitted values yb = X β ML = X X X X y. as seen in
4.9. CLUSTER ANALYSIS OF MICROARRAY DATA 127
¡ ¢
where L y; x, β, σ 2 is the likelihood function from section 3.3.1. In the linear model context,
typically, a so-called conjugate prior specification is used where
¡ ¢ ¡ ¢ ¡ ¢ ³α γ ´
p β|σ 2 ≡ N v, σ 2 V p σ 2 ≡ InverseGamma , (4.19)
2 2
(v is p × 1, V is p × p positive definite and symmetric, all other parameters are scalars) and using
this prior standard Bayesian calculations show that conditional on the data
µ ¶
¡ 2
¢ ¡ ∗ 2 ∗¢ ¡ 2 ¢ T +α c+γ
p β|y, σ ≡ N v , σ V p σ |y ≡ InverseGamma , (4.20)
2 2
where
¡ ¢−1 ¡ ¢−1 ¡ T ¢
V ∗ = X T X + V −1 v ∗ = X T X + V −1 X y + V −1 v (4.21)
¡ ¢T ¡ T ¢−1 ¡ T ¢
c = y T y + v T V −1 v − X T y + V −1 v X X + V −1 X y + V −1 v
In regression modelling, it is usual to consider a centred parameterization for β so that v = 0,
giving
¡ ¢−1 T
v ∗ = X T X + V −1 X y
¡ ¢−1 T ³ ¡ ¢−1 T ´
c = y T y − y T X T XX + V −1 X y = y T IT − X X T X + V −1 X y
IThe critical quantity in a Bayesian clustering procedure is the marginal likelihood or prior
predictive distribution for the data in light of the model.
Z
¡ ¢ ¡ ¢ ¡ ¢
p(y) = p y|β, σ 2 p β|σ 2 p σ 2 dβdσ 2 . (4.22)
• The Bayesian Information Criterion (BIC) A more reliable approximation to twice the
log Bayes factor called the Bayesian Information Criterion, which, for model M is given by
316 280
449119
83 55 216149 325
366 55 61
400
330
396
321
340 489 359
86
292 257
444
127
474
101354
458
177 40 79
432 329
138120
141
144
401
30416
472
494 51 189
427
147
322
392
326
145 258
426
314
34 161
255 441
14105
272
477
233
443
108
333 6 46
437
305
195
412 222
43626
81
338
480 418 97
238132
67
313
64
341399
82 152
165
215 303
113
250
302
487 319
328 348
169
80166 129 316
40083
330 65
184
182
248
363
187
405346 396
321
340354
458
60
455
372
306
492
240 442
459
155 280
44940 119
135
282
206
45113
156 364
290
369
142
126
151
28533
239
404
397
470 38
191
118
466393
194
107
349
367
347
198
253 429116
221
419 357
194
429
393
38 20
223 358
414 223
20394
344
389
191
221
419116 172
459 249
410
94
460 252 201
142
442
387
446
4933 56
465
406
486 294
335
57
17
52
301104
425 386
16
422
143179
7
476
295
332
112
2775 424
115
276
68224
304
73
54
428
265
32
362 171
100
398
411407
62
323
384
490
24
234 28
402
167
435
4275
214
123
196
491
11
150
181 212
230 133
146
479
484
445
140
337
374
488 438
461
254
200
225
283
388
378
186
188
310 179 439
203
462158
178
468
143
424
185
190 763 37
96
312
481473
77
170
193
262242
88 266
361 356
21031
373 93
434
300
497
25
157
278 53
92
128
415311
453
463
469
219 53 47
89 80
169 66136
27
92
128
231415 311
66136 198
253 347
33328 166
281
176
85
308 47
89 387346
4253104
446
28919
58
482111
4199 493
172
301
63
242
199
12
355
174
114
213 281
19
58
154
452
176
207
192
260
102
159
121 85
308482
289 4
45
21
350
36
49
130
488 440
317
496
124 99
352
72
110
284
173
263 87 413 193
262
77
170130
218
385
125
293 320
43 497
434
300
25
157 88
331
391
44
251
464
345 57
17
52
123
196
491
229
273
382
375
383
160 11
150
181
445
185
190
268
351
241
271
360
211 219
231
62
323
384
490
342
227
409
232
309
106 24
234
315
78
140 9
10
131
485
117
327 259 337
370
390
374
488
124
413
352
317
496 99 463
469378
388
186
188
45
440
315
78 9
296 310107
349
367
43320
298
71
209
236
336 27 331
148
163
168
237
476
339
26495
456
103
270 295
332
278
453
265
54
287
228
478
39
431
90 32
362
112
277
735
202
368
475
217
286 284
125
293
263
218
385
403
41256
448
499
4232 72
110 87
428
488
12
307
365109
204
91
261 355
174
114
213 109
204
134
243 74
153297
69 423
35
226
244 222
164
245 41
395
Average Linkage
98
380 23
275
Compact Linkage
318467
135
226
244 245
98
380
318467
395
23
275 1 18
376
381
208
4.9. CLUSTER ANALYSIS OF MICROARRAY DATA
353
408
376
381
2081822 433
478
286
403
122
235
433
122
235
371
279
291 371
279
291
197
500
183
197
500
139
430
299
15 205
485
117 448
499
259
76
137
274
450
420
379 327
353
408
84
343 111
417
148
163
168
237
288 70
495
334
471
180
324
447
220
48350
247
84 175
267207
246
342
29
34370
495
334
471
180 421
464
269
498
162
457
324
183
205
246
29
421 420
379
417
351
241
271
175
267
269
498
162
457 211
360
345
229
273
391
377
45459
312
481 61 44
251
137
274
450 59
266
361
210
37393
31 220
48350
247
288
447
356
46
437 319
105
473 397
470
103
270 239
404
Figure 4.2: Average Linkage.
P [ξ ∗ = i|y ∗ , y, z] ∝ p (y ∗ |ξ ∗ = i, y, z) P [ξ ∗ = i|y, z]
216
489
292359
86 69 426
164 215129
256
The two terms in (4.24) are to be determined on the basis of the clustering output.
expression profile to one of the subsets, and one of the clusters within that subset. ,
(4.24)
subset i. Let y ∗ denote a new profile to be classified, and ξ ∗ be the binary classification-to-subset,
the organism from which they were derived. A new objective could be to allocate a novel gene and
130 CHAPTER 4. STATISTICAL MODELS AND METHODS IN BIOINFORMATICS
CHAPTER A
STOCHASTIC PROCESSES AND RANDOM WALKS
131
132 CHAPTER A. STOCHASTIC PROCESSES AND RANDOM WALKS
The case a = −1 is relevant to previous discussions of the BLAST stochastic process {Sn } if
we consider the subsections determined by the ladder points; a random walk starting at the most
recent ladder point carries out an excursion that comes to a halt when the next ladder point is
reached. Thus, if the most recent ladder point is at position i and has cumulative score si , then
the new random walk {Sn0 } defined by
0
Sn−i = Sn − si n≥i
starts at h = 0 and is comes to a halt/is absorbed at a = −1 when the next ladder point is reached.
(iii) The step sizes that have non-zero probability also have no common divisor other than 1.
- such a chain forms the basis for PSIBLAST analysis of protein sequences.
Again, assuming that such a random walk starts at h = 0, and is absorbed at state a = −1 or
reaches threshold b = y ≥ 1, it is straightforward to see that the random walk will come to a halt
at one of the states −c, −c + 1, ..., −1, y, y + 1, ...., y + d − 1, and if
Equation (A.5) can be used to compute θ for a given set of pj . Clearly, Pk for k = −c, −c +
1, ..., −1, y, y + 1, ...., y + d − 1 depends on the threshold value y, but the limiting probability
A.2. SIMPLE RANDOM WALK GENERALIZATIONS 133
Rj = lim Pj can be defined and computed, and used to derive the limiting expected absorption
y→∞
time
c
X
1
A=− jR−j (A.6)
mST EP
j=1
This general Markov random walk has properties that are essentially the same as for the simple
random walk described above. For example, it can be shown that if Y is the value at the maximum
state reached by the walk then analogously to (A.4) above, by considering an absorbing state at
threshold y and the probability of being absorbed at y , we have
P [Y ≥ y] ≈ Ce−θy (A.7)
The quantities Q1 , Q2 , ..., Qd are probabilities defined by the behaviour of an unrestricted general
random walk with the step sizes defined above and with a negative expected step size (and hence
a downward drift) In fact, for k ≥ 0
d
X
Qk ekθ = 1.
k=1
Finally Q is the probability that the random walk never reaches a positive state, that is
Q = 1 − Q1 − Q2 − ... − Qk
Although these expressions are complicated, and require detailed computation, the most important
facts are contained in the formula in (A.7) and the knowledge that the constants C, θ and A can
be computed.
134 CHAPTER A. STOCHASTIC PROCESSES AND RANDOM WALKS
CHAPTER B
ALGORITHMS FOR HMMs
that is, the joint probability of the observing the actual data up to position t and having the region
type at position t equal to i. Now, if, for all i, the values of α(n, i) are known, then as the terms in
(4.15) can be rewritten
nH
X (B.2)
= P [X1 = x1 , ..., Xt+1 = xt+1 , Ht+1 = ht+1 = i, Ht = ht = j]
j=0
using the Total Probability rule, partitioning with respect to the state in position t. However, we
have using conditional probability arguments that the summand can be rewritten
which can be further simplified as, by assumption the first term is merely
where
135
136 CHAPTER B. ALGORITHMS FOR HMMS
and
P [X1 = x1 , ..., Xt = xt , Ht = ht = j] = α (t, j) (B.5)
Hence combining (B.2)-(B.5) gives
nH
X
α(t + 1, i) = p(i)
xt+1 θ ji α (t, j) (B.6)
j=0
and so we have a recursion formula. In fact (B.1) and (B.6) combined give a method of computing
the (conditional) likelihood X
f (x|P) = f (x|h, P)f (h|P) (B.7)
h
required for (i) that can be completed in n × n2H steps. This number is relatively small compared
to 2n × (nH + 1)n .
and
nH
X
β(t − 1, i) = px(j) θ β(t, j)
t ij
j=0
f (x|h)f (h)
ĥ = arg max f (h|x) = arg max = arg max f (x|h)f (h) = arg max f (x, h)
f (x)
δ 1 (i) = P [H1 = h1 = i, X1 = x1 ]
and
δ t (i) = max P [H1 = h1 , ..., Ht−1 = ht−1 , Ht = ht = i, X1 = x1 , ..., Xt = xt ]
h1 ,...,ht−1
B.4. THE BAUM-WELCH ALGORITHM 137
so that δ t (i) is the maximum probability, over all possible routes, of ending up in unobserved state
i at time t. Then
maxδ n (i) = max P [H1 = h1 , ..., Hn = hn = i, X1 = x1 , ..., Xn = xn ]
i h1 ,...,hn
is the maximum probability, over all possible routes, of ending in unobserved state i at time n.
Secondly, compute the δs recursively; for each i define
δ 1 (i) = θi px(i)1
and for t = 2, 3, ..., n, and 0 ≤ j ≤ nH
δ t (j) = maxδ t−1 (i) θij p(j)
xt
i
Finally, let
ĥn = arg maxδ n (i)
i
and for t = n − 1, n − 2, ..., 2, 1 define
ĥt = arg maxδ t (i) θiĥt+1
i
so that ĥt for each t is the state that maximizes the joint probability. Eventually we have a
computed a vector ³ ´
ĥ = ĥ1 , ..., ĥn = arg max f (x, h)
that is required for step (ii).
(i)
I Initialization: choose initial values for θij , pj and π i from some appropriate probability
distribution, or from prior knowledge of the modelling situation
for i, j ∈ H, where the superscript (d) indicates calculation from the training data sample d..
From the conditional probability definition, this expression can be re-written
h i
(d) (d)
P qt = i, qt+1 = j, Y
(d)
ξ t (i, j) = (B.9)
P [Y]
where the denominator can be computed using the Forward or Backward algorithm above,
and the numerator can be calculated by using the Forwards and Backwards variables α (., .)
and β (., .) of the previous algorithms
h i
(d) (d)
P qt = i, qt+1 = j, Y = α (t, i) θij p(d,j)
yt+1 β(t + 1, j)
(d,j)
where pyt+1 is the probability of observing character yt+1 in from in position t + 1 in region
type j in the training data sample d. Let
( (d)
(d) 1 if qt =i
It (i) =
0 otherwise
be an indicator random variable. Then the number of times region type i is observed in the
training sample is
X X (d)
It (i)
d t
B.4. THE BAUM-WELCH ALGORITHM 139
(recall d indexes training sample sequences) and the expected number of times is
X X h (d) i XX h
(d)
i XX h
(d)
i
E It (i)|Q(d) = P It (i) = 1|Q(d) = P qt = i|Q(d)
d t d t d t
and hence the expected number of times region type i is observed in the training sample is
nH
XXX (d)
ξ t (i, j) . (B.10)
d t j=1
Similarly, the expected number of transitions from region type i to region type j is
X X (d)
ξ t (i, j) (B.11)
d t
These formulae can be substituted into (B.8) to compute the iterative procedure. The only
remaining quantity to be estimated is
h i
E Ni (j) |Y(D)
(i)
that appears in the numerator in the final iterative formula for pbj . This is estimated in a
similar fashion to the other quantities; let
( (d) (d)
(d) 1 if qt = iand Yt = j
It (i, j) =
0 otherwise
be the indicator variable that is equal to one if, for training sample d, character j occurs in
region type i at position t. Then
h i XX X X
nH
(D) (d)
E Ni (j) |Y = ξ t (i, j)
d t Yt
(d)
=j j=1