Notes PDF
Notes PDF
Notes PDF
Statistics
With a view towards the natural sciences
Lecture notes
The present lecture notes have been developed over the last couple of years for a
course aimed primarily at the students taking a Master’s in bioinformatics at the
University of Copenhagen. There is an increasing demand for a general introductory
statistics course at the Master’s level at the university, and the course has also
become a compulsory course for the Master’s in eScience. Both educations emphasize
a computational and data oriented approach to science – in particular the natural
sciences.
The aim of the notes is to combine the mathematical and theoretical underpinning
of statistics and statistical data analysis with computational methodology and prac-
tical applications. Hopefully the notes pave the way for an understanding of the
foundation of data analysis with a focus on the probabilistic model and the method-
ology that we can develop from this point of view. In a single course there is no
hope that we can present all models and all relevant methods that the students will
need in the future, and for this reason we develop general ideas so that new models
and methods can be more easily approached by students after the course. We can,
on the other hand, not develop the theory without a number of good examples to
illustrate its use. Due to the history of the course most examples in the notes are
biological of nature but span a range of different areas from molecular biology and
biological sequence analysis over molecular evolution and genetics to toxicology and
various assay procedures.
Students who take the course are expected to become users of statistical methodology
in a subject matter field and potentially also developers of models and methodology
in such a field. It is therefore intentional that we focus on the fundamental principles
and develop these principles that by nature are mathematical. Advanced mathemat-
ics is, however, kept out of the main text. Instead a number of math boxes can
be found in the notes. Relevant, but mathematically more sophisticated, issues are
treated in these math boxes. The main text does not depend on results developed in
i
ii
the math boxes, but the interested and capable reader may find them illuminating.
The formal mathematical prerequisites for reading the notes is a standard calculus
course in addition to a few useful mathematical facts collected in an appendix. The
reader who is not so accustomed to the symbolic language of mathematics may,
however, find the material challenging to begin with.
To fully benefit from the notes it is also necessary to obtain and install the statisti-
cal computing environment R. It is evident that almost all applications of statistics
today require the use of computers for computations and very often also simula-
tions. The program R is a free, full fledge programming language and should be
regarded as such. Previous experience with programming is thus beneficial but not
necessary. R is a language developed for statistical data analysis and it comes with
a huge number of packages, which makes it a convenient framework for handling
most standard statistical analyses, for implementing novel statistical procedures, for
doing simulation studies, and last but not least it does a fairly good job at producing
high quality graphics.
We all have to crawl before we can walk – let alone run. We begin the notes with
the simplest models but develop a sustainable theory that can embrace the more
advanced ones too.
Last, but not least, I owe a special thank to Jessica Kasza for detailed comments on
an earlier version of the notes and for correcting a number of grammatical mistakes.
November 2010
Niels Richard Hansen
Contents
1 Introduction 1
1.1 Notion of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Statistics and statistical models . . . . . . . . . . . . . . . . . . . . . 4
2 Probability Theory 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Probability measures on discrete sets . . . . . . . . . . . . . . . . . . 21
2.5 Descriptive methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Probability measures on the real line . . . . . . . . . . . . . . . . . . 32
2.7 Descriptive methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7.1 Histograms and kernel density estimation . . . . . . . . . . . 43
2.7.2 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7.3 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.8 Conditional probabilities and independence . . . . . . . . . . . . . . 59
2.9 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9.1 Transformations of random variables . . . . . . . . . . . . . . 63
2.10 Joint distributions, conditional distributions and independence . . . 70
iii
iv Contents
A R 267
A.1 Obtaining and running R . . . . . . . . . . . . . . . . . . . . . . . . 267
A.2 Manuals, FAQs and online help . . . . . . . . . . . . . . . . . . . . . 268
A.3 The R language, functions and scripts . . . . . . . . . . . . . . . . . 269
A.3.1 Functions, expression evaluation, and objects . . . . . . . . . 269
A.3.2 Writing functions and scripts . . . . . . . . . . . . . . . . . . 270
A.4 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
A.5 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
A.5.1 Bioconductor . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
A.6 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
A.7 Other resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
B Mathematics 277
B.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
B.2 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
B.3 Limits and infinite sums . . . . . . . . . . . . . . . . . . . . . . . . . 279
B.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
B.4.1 Gamma and beta integrals . . . . . . . . . . . . . . . . . . . 282
B.4.2 Multiple integrals . . . . . . . . . . . . . . . . . . . . . . . . . 283
vi Contents
1
Introduction
Flipping coins and throwing dice are two commonly occurring examples in an in-
troductory course on probability theory and statistics. They represent archetypical
experiments where the outcome is uncertain – no matter how many times we roll
the dice we are unable to predict the outcome of the next roll. We use probabilities
to describe the uncertainty; a fair, classical dice has probability 1/6 for each side to
turn up. Elementary probability computations can to some extent be handled based
on intuition, common sense and high school mathematics. In the popular dice game
Yahtzee the probability of getting a Yahtzee (five of a kind) in a single throw is for
instance
6 1
5
= 4 = 0.0007716.
6 6
The argument for this and many similar computations is based on the pseudo theorem
that the probability for any event equals
number of favourable outcomes
.
number of possible outcomes
Getting a Yahtzee consists of the six favorable outcomes with all five dice facing the
same side upwards. We call the formula above a pseudo theorem because, as we will
show in Section 2.4, it is only the correct way of assigning probabilities to events
under a very special assumption about the probabilities describing our experiment.
The special assumption is that all outcomes are equally probable – something we
tend to believe if we don’t know any better, or can see no way that one outcome
should be more likely than others.
However, without some training most people will either get it wrong or have to give
up if they try computing the probability of anything except the most elementary
1
2 Introduction
events – even when the pseudo theorem applies. There exist numerous tricky prob-
ability questions where intuition somehow breaks down and wrong conclusions can
be drawn if one is not extremely careful. A good challenge could be to compute the
probability of getting a Yahtzee in three throws with the usual rules and provided
that we always hold as many equal dice as possible.
0.7
0.6
0.5
0.4
0.3
Figure 1.1: The relative frequency of times that the dice sequence ¥ ¨ § comes
out before the sequence ¦ ¥ ¨ as a function of the number of times the dice game
has been played.
The Yahtzee problem can in principle be solved by counting – simply write down all
combinations and count the number of favorable and possible combinations. Then
the pseudo theorem applies. It is a futile task but in principle a possibility.
In many cases it is, however, impossible to rely on counting – even in principle. As
an example we consider a simple dice game with two participants: First I choose
a sequence of three dice throws, ¥ ¨ §, say, and then you choose ¦ ¥ ¨, say. We
throw the dice until one of the two sequences comes out, and I win if ¥ ¨ § comes
out first and otherwise you win. If the outcome is
¨ª¦§©¥§¦¨©¥¨§
then I win. It is natural to ask with what probability you will win this game. In
addition, it is clearly a quite boring game, since we have to throw a lot of dice and
simply wait for one of the two sequences to occur. Another question could therefore
be to ask how boring the game is? Can we for instance compute the probability for
Notion of probabilities 3
having to throw the dice more than 100, or perhaps 500, times before any of the
two sequences shows up? The problem that we encounter here is first of all that the
pseudo theorem does not apply simply because there is an infinite number of favor-
able as well as possible outcomes. The event that you win consists of the outcomes
being all finite sequences of throws ending with ¦ ¥ ¨ without ¥ ¨ § occurring
somewhere as three subsequent throws. Moreover, these outcomes are certainly not
equally probable. By developing the theory of probabilities we obtain a framework
for solving problems like this and doing many other even more subtle computations.
And if we cannot compute the solution we might be able to obtain an answer to
our questions using computer simulations. Moreover, the notes introduce probabil-
ity theory as the foundation for doing statistics. The probability theory will provide
a framework, where it becomes possible to clearly formulate our statistical questions
and to clearly express the assumptions upon which the answers rest.
400
300
Frequency
200
100
0
0−10
10−20
20−30
30−40
40−50
50−60
60−70
70−80
80−90
90−100
100−110
110−120
120−130
130−140
140−150
150−160
160−170
170−180
180−190
190−200
200−210
210−220
220−230
230−240
240−250
250−260
260−270
270−280
280−290
290−300
300−310
310−320
320−330
330−340
340−350
350−360
360−370
370−380
380−390
390−400
400−410
410−420
420−430
430−440
440−450
450−460
460−470
470−480
480−490
490−500
Figure 1.2: Playing the dice game 5000 times, this graph shows how the games are
distributed according to the number of times, n, we had to throw the dice before
one of the sequences ¥ ¨ § or ¦ ¥ ¨ occurred.
Enough about dice games! After all, these notes are about probability theory and
statistics with applications to the natural sciences. Therefore we will try to take
examples and motivations from real biological, physical and chemical problems, but
it can also be rewording intellectually to focus on simple problems like those from
a dice game to really understand the fundamental issues. Moreover, some of the
questions that we encounter, especially in biological sequence analysis, are similar
in nature to those we asked above. If we don’t know any better – and in many cases
we don’t – we may regard a sequence of DNA as simply being random. If we don’t
4 Introduction
have a clue about whether the sequence encodes anything useful, it may just as well
be random evolutionary leftovers. Whether this is the case for (fractions of) the
intergenic regions, say, of eukaryote genomes is still a good question. But what do
we actually mean by random? A typical interpretation of a sequence being random
is to regard the DNA sequence as the outcome of throwing a four sided dice, with
sides A, C, G, and T, a tremendously large number of times.
One purpose of regarding DNA sequences as being random is to have a background
model. If we have developed a method for detecting novel protein coding genes, say,
in DNA sequences, a background model for the DNA that is not protein coding is
useful. Otherwise we cannot tell how likely it is that our method finds false protein
coding genes, i.e. that the method claims that a segment of DNA is protein coding
even though it is not. Thus we need to compute the probability that our method
claims that a random DNA sequence – with random having the meaning above – is
a protein coding gene. If this is unlikely, we believe that our method indeed finds
truly protein coding genes.
A simple gene finder can be constructed as follows: After the start codon, ATG, a
number of nucleotides occur before one of the stop codons, TAA, TAG, TGA is reached
for the first time. Our protein coding gene finder then claims that if more than 99
(33 codons) nucleotides occur before any of the stop codons is reached then we have
a gene. So what is the chance of getting more than 99 nucleotides before reaching a
stop coding for the first time? The similarity between determining how boring our
little dice game is should be clear. The sequence of nucleotides occurring between a
start and a stop codon is called an open reading frame, and what we are interested
in is thus how the lengths of open reading frames are distributed in random DNA
sequences.
The probability theory provides the tools for computing probabilities. If we know
the probability measure – the assignment of probabilities to events – the rest is in
principle just mathematical deduction. Sometimes this is what we need, but in many
cases we are actually interested in questions that go beyond what we can obtain by
pure deduction. Situations where we need the interplay between data – experimental
data or computational data – and the probability theory, are where statistics comes
into the picture. Hopefully the reader will realize upon reading these notes that the
four sided dice model of DNA is rarely useful quantitatively, but may provide some
qualitative insight into more practical problems we really want to solve.
do we find the mechanisms that generated those outcomes, and how do we under-
stand those mechanisms? What brings the topics of probability theory and statistics
together is the use of probability theory to model the generating mechanisms, and
the corresponding mathematical framework that comes along provides us with meth-
ods for doing inductive inference and for understanding and interpreting the results.
Probability measures have the ability to capture unpredictable or uncontrollable
variation (randomness) together with systematic variability, and are therefore ideal
objects to model almost any kind of experimental data – with “experiments” to be
understood in the widest possible sense. Theoretical statistics offers a range of sta-
tistical models, which for the present section can be thought of families of probability
measures, and methods for transferring data into probability measures. The transfer
of data into a probability measure is a cornerstone in statistics, which is often called
statistical inference. The primary aim of the notes is to give a solid understanding of
the concept of a statistical model based on probability theory and how the models
can be used to carry out the inductive step of doing statistical inference.
To get a good understanding of what a statistical model is, one needs to see some
examples. We provide here a number of examples that will be taken up and recon-
sidered throughout the notes. In this section we describe some of the background
and the experiments behind the data. In the rest of the notes these examples will
be elaborated on to give a fully fledged development of the corresponding statistical
models and inference methods as the proper theoretical tools are developed.
Example 1.2.1 (Neuronal interspike times). Neuron cells in the brain are very well
studied and it is known that neurons transmit electrochemical signals. Measurements
of a cells membrane potential show how the membrane potential can activate voltage-
gated ion channels in the cell membrane and trigger an electical signal known as a
spike.
At the most basic level it is of interest to understand the interspike times, that
is, the times between spikes, for a single neuron in a steady state situation. The
interspike times behave in an intrinsically stochastic manner meaning that if we
want to describe the typical interspike times we have to rely on a probabilistic
description.
A more ambitious goal is to relate interspike times to external events such as visual
stimuli and another goal is to relate the interspike times of several neurons. ⋄
Example 1.2.2 (Motif counting). Special motifs or patterns in the genomic DNA-
sequences play an important role in the development of a cell – especially in the way
the regulation of gene expression takes place. A motif can be a small word, in the
DNA-alphabet, or a collection of words as given by for instance a regular expres-
sion or by other means. One or more proteins involved in the expression regulation
mechanisms are then capable of binding to the DNA-segment(s) that corresponds
to the word or word collection (the binding site), and in doing so either enhance or
suppress the expression of a protein coding gene, say.
6 Introduction
Example 1.2.3 (Forensic Statistics). In forensic science one of the interesting prob-
lems from the point of view of molecular biology and statistics is the ability to iden-
tify the person who committed a crime based on DNA-samples found at the crime
scene. One approach known as the short tandem repeat (STR) analysis is to con-
sider certain specific tandem repeat positions in the genome and count how many
times the pattern has been repeated. The technique is based on tandem repeats with
non-varying flanking regions to identify the repeat but with the number of pattern
repeats varying from person to person. These counts of repeat repetitions are useful
as “genetic fingerprints” because the repeats are not expected to have any function
and the mutation of repeat counts is therefore neutral and not under selective pres-
sure. Moreover, the mutations occur (and have occurred) frequently enough so that
there is a sufficient variation in a population for discriminative purposes. It would
be of limited use if half the population, say, have a repeat count of 10 with the other
half having a repeat count of 11.
Without going into too many technical details, the procedure for a DNA-sample
from a crime scene is quite simple. First one amplifies the tandem repeat(s) using
PCR with primers that match the flanking regions, and second, one extracts the
sequence for each of the repeats of interest and simply count the number of repeats.
Examples of STRs used include TH01, which has the pattern AATG and occurs in
intron 1 of the human tyrosine hydroxylase gene, and TPOX, which has the same
repeat pattern but is located in intron 10 of the human thyroid peroxidase gene.
One tandem repeat is not enough to uniquely characterize an individual, so several
tandem repeats are used. A major question remains. Once we have counted the
number of repeats for k, say, different repeat patterns we have a vector (n1 , . . . , nk )
of repeat counts. The Federal Bureau of Investigation (FBI) uses for instance a
standard set of 13 specific STR regions. If a suspect happens to have an identical
vector of repeat counts – the same genetic fingerprint – we need to ask ourselves
Statistics and statistical models 7
what the chance is that a “random” individual from the population has precisely this
genetic fingerprint. This raises a number of statistical questions. First, what kind of
random procedure is the most relevant – a suspect is hardly selected completely at
random – and second, what population is going to be the reference population? And
even if we can come up with a bulletproof solution to these questions, it is a huge
task and certainly not a practical solution to go out and count the occurrences of
the fingerprint (n1 , . . . , nk ) in the entire population. So we have to rely on smaller
samples from the population to estimate the probability. This will necessarily involve
model assumptions – assumptions on the probabilistic models that we will use.
One of the fundamental model assumptions is the classical Hardy-Weinberg equilib-
rium assumption, which is an assumption about independence of the repeat counts
at the two different chromosomes in an individual. ⋄
Example 1.2.4 (Sequence evolution). We will consider some of the most rudimen-
tary models for evolution of biological sequences. Sequences evolve according to a
complicated interaction of random mutations and selection, where the random mu-
tations can be single nucleotide substitutions, deletions or insertions, or higher order
events like inversions or crossovers. We will only consider the substitution process.
Thus we consider two sequences, DNA or proteins, that are evolutionary related via
a number of nucleotide or amino acid substitutions. We will regard each nucleotide
position (perhaps each codon) or each amino acid position as unrelated to each other,
meaning that the substitution processes at each position are independent. We are
interested in a model of these substitution processes, or alternatively a model of the
evolutionary related pairs of nucleotides or amino acids. We are especially interested
in how the evolutionary distance – measured for instance in calendar time – enters
into the models.
The models will from a biological point of view be very simplistic and far from
realistic as models of real sequence evolution processes. However, they form the
starting point for more serious models, if one, for instance, wants to enter the area
of phylogenetics, and they are well suited to illustrate and train the fundamental
concepts of a statistical models and the methods of statistical inference.
Obtaining data is also a little tricky, since we can rarely go out and read of evolution-
ary related sequences where we know the relation “letter by letter” – such relations
are on the contrary established computationally using alignment programs. How-
ever, in some special cases, one can actually observe real evolution as a number
of substitutions in the genome. This is for instance the case for rapidly evolving
RNA-viruses.
Such a dataset was obtained for the H strain of the hepatitis C virus (HCV) (Ogata
et al., Proc. Natl. Acad. Sci., 1991 (88), 3392-3396). A patient, called patient H, was
infected by HCV in 1977 and remained infected at least until 1990 – for a period
of 13 years. In 1990 a research group sequenced three segments of the HCV genome
obtained from plasma collected in 1977 as well as in 1990. The three segments, de-
noted segment A, B and C, were all directly alignable without the need to introduce
8 Introduction
Position 42 275 348 447 556 557 594 652 735 888 891 973 979 1008 1011 1020 1050 1059 1083 1149 1191 1195 1224 1266
H77 G C C A G C C C T C T G G C G C T T C T T T T A
H90 A T T G A T T T C T C A A T A T A C T C C A C G
Table 1.1: The segment position and nucleotides for 24 mutations on segment A of
the hepatitis C virus.
insertions or deletions. The lengths of the three segments are 2610 (A), 1284 (B)
and 1029 (C) respectively.
H90 H90 H90
A C G T A C G T A C G T
A 1 11 1 A 0 5 0 A 1 2 0
H77 C 4 1 20 H77 C 1 0 8 H77 C 1 2 5
G 13 3 1 G 1 1 1 G 4 0 0
T 3 19 1 T 2 6 0 T 1 3 1
Table 1.2: Tabulation of all mutations in the three segments A, B and C of the
hepatitis C virus genome from the 1977 H strain to the 1990 H strain.
In Table 1.2.4 we see the position for the first 24 mutations as read from the 5’-end
of segment A out of the total of 78 mutations on segment A. In Table 1.2.4 we have
tabulated all the mutations in the three segments. ⋄
Example 1.2.5 (Toxicology). “Alle Dinge sind Gift und nichts ist ohne Gift; allein
die Dosis macht, dass ein Ding kein Gift ist.” Theophrastus Phillipus Auroleus Bom-
bastus von Hohenheim (1493-1541). All compounds are toxic in the right dose – the
question is just what the dose is.
The dose-response experiment is a classic. In a controlled experiment, a collection of
organisms is subdivided into different groups and each group is given a specific dose
of a compound under investigation. We focus here on whether the concentration is
lethal, and thus subsequently we count the number of dead organisms in each group.
How can we relate the dose to the probability of death? What is the smallest con-
centration where there is more than 2% chance of dying? What is the concentration
where 50% of the organisms die – the so-called LD50 value?
We want a model for each of the groups that captures the probability of dying in that
particular group. In addition, we want to relate this probability across the groups
to the concentration – or dose – of the compound.
Figure 1.3 shows the data from an experiment with flies that are given different doses
of the insecticide dimethoat. The figure also shows a curve that is inferred from the
given dataset, which can be used to answer the questions phrased above. A major
topic of the present set of notes is to develop the theory and methodology for how
such a curve is inferred and how we assess the uncertainty involved in reporting e.g.
the inferred value of LD50.
Statistics and statistical models 9
1.0
0.8
Probability of death
0.6
0.4
0.2
0.0
−4 −3 −2 −1 0
log(concentration)
Figure 1.3: The relative frequency of dead flies plotted against the log(concentration)
of the insecticide dimethoat together with a curve inferred from the data that provide
a relation between the probability of fly death and the concentration
is sometimes taken to be the standard normal distribution, but which may in princi-
ple be anything. Specification of the residual distribution or clearly announcing the
lack of such a specification is a part of the location-scale model. ⋄
an experiment is an image, which shows light intensities measured over the slide.
The image is preprocessed to yield a single light intensity measurement for each
probe on the slide. It should be emphasized that there are numerous details in
the experimental design that can give rise to errors of very different nature. We
are not going to focus on these details, but one remark is appropriate. For this
technology it is not common to infer standard curves from data for each array.
Consequently it is not possible to obtain absolute concentration measurements only
relative measurements. In the comparison between measurements on different arrays,
elaborate normalization techniques are necessary to replace the lack of knowledge
of standard curves. Still we will only be able to discuss relative differences between
measurements across arrays and not absolute differences.
The resulting dataset from a microarray experiment is a high-dimensional vector of
measurements for each gene represented on the array. A typical experimental design
will include several arrays consisting of measurements for several individuals, say,
that may be grouped for instance according to different disease conditions. The mi-
croarray technology presents us with a lot of statistical challenges. High-dimensional
data, few replications per individual (if any), few individuals per group etc., and all
this give rise to high uncertainty in the statistical procedures for inference from
the data. This is then combined with the desire to ask biologically quite difficult
questions. Indeed a challenging task. We will in these notes mostly use microarray
data to illustrate various simple models on a gene-by-gene level. That is, where we
essentially consider each gene represented on the array independently. This is an im-
portant entry to the whole area of microarray data analysis, but these considerations
are only the beginning of a more serious analysis.
The use of microarrays for gene expression measurements is a well developed tech-
nique. It is, however, challenged by other experimental techniques. At the time of
writing the so-called next-generation sequencing techniques promise to replace mi-
croarrays in the near future. Though fundamentally different from a technical point
of view, the nature of the resulting data offer many of the same challenges as men-
tioned above – high-dimensional data with few replications per individual, and still
the biological questions are difficult. The hope is that the sequencing based tech-
niques have smaller measurement noise and that the need for normalization is less
pronounced. In other words, that we get closer to a situation where a single standard
curve is globally valid. Whether this is actually the case only time will show. ⋄
2
Probability Theory
2.1 Introduction
Probability theory provides the foundation for doing statistics. It is the mathemat-
ical framework for discussing experiments with an outcome that is uncertain. With
probability theory we capture the mathematical essence of the quantification of un-
certainty by abstractly specifying what properties such a quantification should have.
Subsequently, based on the abstract definition we derive properties about probabil-
ities and give a number of examples. This approach is axiomatic and mathematical
and the mathematical treatment is self-contained and independent of any interpre-
tation of probabilities we might have. The interpretation is, however, what gives
probability theory its special flavor and makes it applicable. We give a mathemat-
ical presentation of probability theory to develop a proper language, and to get
accustomed to the vocabulary used in probability theory and statistics. However, we
cannot and will not try to derive everything we need along the way. Derivations of
results are made when they can illustrate how we work with the probabilities and
perhaps illuminate the relation between the many concepts we introduce.
We will throughout use E to denote a set, called the sample space, such that elements
x ∈ E represent the outcome of an experiment we want to conduct. We use small
letters like x, y, z to denote elements in E. An event, A ⊆ E, is a subset of E, and
we will use capital letters like A, B, C to denote events. We use the word experiment
in a wide sense. We may have a real wet lab experiment in mind or another classical
empirical data collection process in mind. But we may also have a database search or
some other algorithmic treatment of existing data in mind – or even an experiment
13
14 Probability Theory
Mass spectrum
40
30
Intensity
20
10
m/z
Figure 2.1: A raw mass spectrum. The sample is regarded as a (peaky and oscillating)
continuous intensity of molecules as a function of the ratio between the mass and
charge of the molecule (the m/z-ratio).
scanning the array one obtains a light intensity as a function of the position on the
slide. The light intensity in the vicinity of a given probe is a measure of the amount
of RNA that binds to the probe and thus a measure of the (relative) amount of
that particular RNA-molecule in the sample. The sample space is a set of two-
dimensional functions – the set of potential light intensities as a function of position
on the slide. In most cases the actual outcome of the experiment is not stored, but
only some discretized version or representation of the outcome. This is mostly due
to technical reasons, but it may also be due to preprocessing of the samples by
computer programs. For microarrays, the light intensities are first of all stored as
an image of a certain resolution, but this representation is typically reduced even
further to a single quantity for each probe on the array.
Though the choice of sample space in many cases is given by the experiment we
consider, we may face situations where several choices seem appropriate. The raw
data from a microarray experiment is an image, but do we want to model the ac-
tual image produced from the scanning? Or are we satisfied with a model of the
summarized per probe measurements? In particular when we encounter complicated
measurement processes we may in practice need different preprocessing or adapta-
tion steps of the collected data before we actually try to model and further analyze
the data. It is always good practice then to clearly specify how these preprocessing
steps are carried out and what the resulting final sample space and thus the actual
data structure is.
If we are about to conduct an uncertain experiment with outcome in the sample space
E we use probabilities to describe the result of the experiment prior to actually
performing the experiment. Since the outcome of the experiment is uncertain we
cannot pinpoint any particular element x ∈ E and say that x will be the outcome.
Rather, we assign to any event A ⊆ E a measure of how likely it is that the event
will occur, that is, how likely it is that the outcome of the experiment will be an
element x belonging to A.
(i) P (E) = 1,
We will throughout use the name probability distribution, or just distribution, inter-
changeably with probability measure. We can immediately derive a number of useful
and direct consequences of the definition.
For any event A, we can write E = A∪Ac with A and Ac disjoint, hence by additivity
P (∅) = 1 − P (E) = 0.
B = A ∪ (B\A)
Finally, if A, B ⊆ E are any two events – not necessarily disjoint – then with C =
A ∩ B we have that A = (A\C) ∪ C with (A\C) and C disjoint, thus by additivity
Moreover,
A ∪ B = (A\C) ∪ B
with the two sets on the right hand side being disjoint, thus by additivity again
P (A ∪ B) = P (A\C) + P (B)
= P (A) + P (B) − P (A ∩ B). (2.2)
Intuitively speaking, the result states that the probability of the union A ∪ B is the
sum of the probabilities for A and B, but when the sets are not disjoint we have
“counted” the probability of the intersection A ∩ B twice. Thus we have to subtract
it.
We summarize the results derived.
Probability measures are usually required to be σ-additive, meaning that for any
infinite sequence A1 , A2 , A3 , . . . of disjoint sets it holds that
• if A ∈ E then Ac ∈ E
• if A1 , A2 , . . . ∈ E then A1 ∪ A2 ∪ . . . ∈ E
• P (∅) = 0
Note that the abstract definition of a probability measure doesn’t say anything about
how to compute the probability of an event in a concrete case. But we are on the
other hand assured that if we manage to come up with a probability measure, it
assigns a probability to any event, no matter how weird it may be. Even in cases
where there is no chance that we would ever be able to come up with an analytic
computation of the resulting probability. Moreover, any general relation or result
derived for probability measures, such as those derived above, apply to the concrete
situation.
The assignment of a probability P (A) to any event A is a quantification of the
uncertainty involved in our experiment. The closer to one P (A) is the more certain
it is that the event A occurs and the closer to zero the more uncertain. Some people,
especially those involved in gambling, find it useful to express uncertainty in terms
of odds. Given a probability measure P we define for any event A the odds that A
occurs as
P (A) P (A)
ξ(A) = = . (2.4)
1 − P (A) P (Ac )
Thus we assign to the event A ⊆ E a value ξ(A) ∈ [0, ∞], and like the probability
measure this provides a quantification of the uncertainty. The larger ξ(A) is the more
certain it is that A occurs. A certain event (when P (A) = 1) is assigned odds ∞. It
follows that we can get the probabilities back from the odds by the formula
1
P (A) = . (2.5)
1 + ξ(Ac )
Odds are used in betting situations because the odds tell how fair1 bets should
be constructed. If two persons, player one and player two, say, make a bet about
whether event A or event Ac occurs, how much should the loser pay the winner for
this to be a fair bet? If player one believes that A occurs and is willing to bet 1
kroner then for the bet to be fair player two must bet ξ(Ac ) kroner on event Ac . For
gambling, this is the way British bookmakers report odds – they say that odds for
event A are ξ(Ac ) to 1 against. With 1 kroner at stake and winning you are paid
ξ(Ac ) back in addition to the 1 kroner. Continental European bookmakers report
the odds as ξ(Ac ) + 1, which include what you staked.
The frequency interpretation states that the probability of an event A should
equal the long run frequency of the occurrence of event A if we repeat the exper-
iment indefinitely. That is, suppose that we perform n independent and identical
experiments all governed by the probability measure P and with outcome in the
sample space E, and suppose that we observe x1 , . . . , xn . Then we can compute the
relative frequency of occurrences of event A
n
1X
εn (A) = 1A (xi ). (2.6)
n
i=1
1
Fair means that on average the players should both win (and loose) 0 kroner, cf. the frequency
interpretation.
Probability measures 19
Here 1A is the indicator function for the event A, so that 1A (xi ) equals 1 if xi ∈ A
and 0 otherwise. We sometimes also write 1(xi ∈ A) instead of 1A (xi ). We see
that εn (A) is the fraction of experiments in which the event A occurs. As n grows
large, the frequency interpretation says that εn (A) must be approximately equal to
P (A). Note that this is not a mathematical result! It is the interpretation of what
we want the probabilities to mean. To underpin the interpretation we can show
that the mathematical theory based on probability measures really is suitable for
approximating relative frequencies from experiments, but that is another story. We
also need to make a precise definition of what we mean by independent in the first
place.
The frequency interpretation provides the rationale for using odds in the construction
of fair bets. If the two players repeat the same bet n times with n suitably large, and
if the bet is fair according to the definition above with player one betting 1 kroner
on event A each time, then in the i’th bet, player one wins
because this equals ξ(Ac ) if event A comes out and −1 if Ac comes out. Considering
all n bets, and excluding the case P (A) = 0, player one will on average win
1 Xh i
n
ξ(Ac )1A (xi ) − 1Ac (xi ) = ξ(Ac )εn (A) − εn (Ac )
n
i=1
≃ ξ(Ac )P (A) − P (Ac )
P (Ac )
= P (A) − P (Ac ) = 0.
P (A)
which may deviate substantively from 0 – and the deviation will become larger as n
increases.
The Bayesian interpretation allows us to choose the probability measure de-
scribing the experiment subjectively. In this interpretation the probability measure
is not given objectively by the experiment but reflects our minds and our knowledge
of the experiment before conducting it. One theoretical justification of the Bayesian
interpretation is that we play a mind game and make everything into a betting sit-
uation. Thus we ask ourselves which odds (for all events) we believe to be fair in
a betting situation before conducting the experiment. Note that this is an entirely
subjective process – there is no theory dictating what fairness means – but we are
20 Probability Theory
nevertheless likely to have an opinion about what is fair and what is not. It is pos-
sible to show that if we make up our minds about fair odds in a consistent manner2
we necessarily end up with a probability measure defined by (2.5). This is the so-
called Dutch book argument. The probabilities don’t represent the long term relative
frequencies when repeating the experiment. After having conducted the experiment
once we have gained new information, which we might want to take into account
when deciding what we then believe to be fair odds.
The two interpretations are fundamentally different of nature and this has conse-
quently lead to different opinions in the statistical literature about how to develop a
suitable and well-founded statistical theory. The methodologies developed based on
either interpretation differ a lot – at least in principle. On the other hand there are
many practical similarities and most Bayesian methods have a frequency interpre-
tation and vice versa. Discussions about which of the interpretations, if any, is the
correct one is of a philosophical and meta mathematical nature – we cannot prove
that either interpretation is correct, though pros and cons for the two interpreta-
tions are often based on mathematical results. We will not pursue this discussion
further. The interested reader can find a balanced an excellent treatment in the book
Comparative Statistical Inference, Wiley, by Vic Barnett.
Throughout, probabilities are given the frequency interpretation, and methodology
is justified from the point of view of the frequency interpretation. This does not by
any means rule out the use of Bayesian methodology a priori, but the methodology
must then be justified within the framework of the frequency interpretation.
Example 2.3.3 (Coin Flipping). When we flip a coin it lands with either heads or
tails upwards. The sample space for describing such an experiment is
E = {heads, tails}
sample space for encoding the outcome of a binary experiment is E = {0, 1}. When
using this naming convention we talk about a Bernoulli experiment. In the coin
flipping context we can let 1 denote heads and 0 tails, then if x1 , . . . , xn denote the
outcomes of n flips of the coin we see that x1 + . . . + xn is the total number of heads.
Moreover, we see that
P (x) = px (1 − p)1−x (2.7)
because either x = 1 (heads) in which case px = p and (1 − p)1−x = 1, or x = 0
(tails) in which case px = 1 and (1 − p)1−x = 1 − p. ⋄
Exercises
Exercise 2.3.1. If P (A) = 0.5, P (B) = 0.4 and P (A∪B) = 0.6, compute P ((A∩B)c ).
Exercise 2.3.2. Compute the odds ξ(A) of the event A if P (A) = 0.5, if P (A) = 0.9
and if P (Ac ) = 0.9. Compute the probability P (A) of the event A if ξ(A) = 10 and if
ξ(A) = 2.
Exercise 2.3.3. Consider the sample space E = {1, . . . , n} and let xi (k) = 1 if i = k
and = 0 if i 6= k. Show that
The DNA, RNA and amino acid alphabets are three examples of finite and hence
discrete sets.
22 Probability Theory
Example 2.4.2. Considering any finite set E we can define the set
E ∗ = {x1 x2 . . . xn | n ∈ N, xi ∈ E},
which is the set of all sequences of finite length from E. We claim that E ∗ is discrete.
If E is the DNA-alphabet, say, it is clear that there is an infinite number of DNA-
sequences of finite length in E ∗ , but it is no problem to list them as a sequence in
the following way:
Hence we first list all sequences of length one, then those of length two, those of
length three, four, five and so on and so forth. We can encounter the use of E ∗ as
sample space if we want a probabilistic model of all protein coding DNA-sequences.
⋄
Example 2.4.3. In genetics we find genes occurring in different versions. The ver-
sions are referred to as alleles. If a gene exists in two versions it is quite common to
refer to the two versions as a and A or b and B. Thus if we sample an individual
from a population and check whether the person carries allele a or A we have the
outcome taking values in the sample space
E = {a, A}.
However, we remember that humans are diploid organisms so we actually carry two
copies around. Thus the real outcome is an element in the sample space
The reason that aA is not an outcome also is simply that we cannot distinguish
between aA and Aa. The information we can obtain is just that there is an a and
an A, but not in any particular order.
If we also consider another gene simultaneously, the sample we get belongs to the
sample space
The sum as written above over all x ∈ E can be understood in the following way:
Since E is discrete it is either finite or countably infinite. In the former case it is
just a finite sum so we concentrate on the latter case. If E is countable infinite we
know that we can order the elements as x1 , x2 , x3 , . . .. Then we simply define
X ∞
X
p(x) = p(xn ).
x∈E n=1
It even holds that the infinite sum on the right hand side is a sum of all positive
numbers. The careful reader may get worried here, because the arrangement of the
elements of E is arbitrary. What if we choose another way of listing the elements
will that affect the sum? The answer to this is no,
P because the terms are all positive,
in which case we can live happily with writing x∈E p(x) without being specific on
the order of summation. The same does not hold if the terms can be positive as well
as negative, but that is not a problem here.
p(x) = 0.05
for all x ∈ E. Clearly p(x) ∈ [0, 1] and since there are 20 amino acids
X
p(x) = 20 × 0.05 = 1.
x∈E
24 Probability Theory
This results in a vector x of length five. Even a single number is always regarded
as a vector of length one in R. The c above should therefore be understood as a
concatenation of five vectors of length one. If we define another vector of length
two by
Under this probability measure all amino acids are equally likely, and it is known as
the uniform distribution on E, cf. the example below. A more reasonable probability
measure on E is given by the relative frequencies of the occurrence of the amino acids
in real proteins, cf. the frequency interpretation of probabilities. The Robinson-
Robinson probabilities come from a survey of a large number of proteins. They read
Probability measures on discrete sets 25
Using the Robinson-Robinson probabilities, some amino acids are much more prob-
able than others. For instance, Leucine (L) is the most probable with probability
0.091 whereas Tryptophan (W) is the least probable with probability 0.013. ⋄
Example 2.4.6 (Uniform distribution). If E is a finite set containing n elements
we can define a probability measure on E by the point probabilities
1
p(x) =
n
P
for all x ∈ E. Clearly p(x) ∈ [0, 1] and x∈E p(x) = 1. This distribution is called
the uniform distribution on E.
If P is the uniform distribution on E and A ⊆ E is any event it follows by the
definition of P that
X1 |A|
P (A) = =
n n
x∈A
with |A| denoting the number of elements in A. Since the elements in E are all
possible but we only regard those in A as favorable, this result gives rise to the
formula
number of favourable outcomes
P (A) = ,
number of possible outcomes
which is valid only when P is the uniform distribution. Even though the formula
looks innocent, it can be quite involved to apply it in practice. It may be easy to
specify the sample space E and the favorable outcomes in the event A, but counting
the elements in A can be difficult. Even counting the elements in E can sometimes
be difficult too. ⋄
26 Probability Theory
> library(Biostrings)
0.20
0.15
0.15
Probability
Probability
0.10
0.10
0.05
0.05
0.00
0.00
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25
n n
Figure 2.2: The point probabilities for the Poisson distribution with λ = 5 (left) and
λ = 10 (right).
Example 2.4.7 (The Poisson Distribution). The (infinite) Taylor expansion of the
Descriptive methods 27
is the frequency of the occurrence of x in the sample. Note the relation between the
tabulation of (absolute) frequencies and the computation of the relative frequencies
εn (x) is
nx
εn (x) = .
n
For variables with values in a discrete sample space we have in general only the
possibility of displaying the data in a summarized version by tabulations. Depending
on the structure of the sample space E the tables can be more or less informative. If
E is a product space, that is, if the data are multivariate, we can consider marginal
tables for each coordinate as well as multivariate tables where we cross-tabulate
28 Probability Theory
0.20
0.20
0.15
0.15
Probability
Probability
0.10
0.10
0.05
0.05
0.00
0.00
0 2 4 6 8 10 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25
n n
Figure 2.3: Example of a theoretical bar plot. Here we see the Poisson point prob-
abilities (λ = 4) plotted using either the plot function with type="h" (left) or the
barplot function (right).
two or more variables. Tables in dimension three and above are quite difficult to
comprehend.
For the special case with variables with values in Z we can also use bar plots to display
the tabulated data. We compare the bar plot with a bar plot of the corresponding
point probabilities p(n) as a function of n ∈ Z to get a picture of where the measure
is concentrated and what the “shape” of the measure is. Usually, the frequencies and
point probabilities are plotted as horizontal bars. It is preferable to plot the relative
frequencies in the bar plot so that the y-axis becomes comparable to the y-axis for
the theoretical bar plot of the point probabilities.
For probability measures on a discrete sample space E ⊆ R we can define the mean
and variance in terms of the point probabilities.
Definition 2.5.1. If P is a probability measure on E with point probabilities (p(x))x∈E
that fulfill X
|x|p(x) < ∞
x∈E
then we define the mean under P as
X
µ= xp(x).
x∈E
Descriptive methods 29
R Box 2.5.1 (Bar plots). We can use either the standard plot function in
R or the barplot function to produce the bar plot. The theoretical bar plot
of the point probabilities for the Poisson distribution with parameter λ = 4
is seen in Figure 2.3. The figure is produced by:
We can make R generate the 500 outcomes from the Poisson distribution
and tabulate the result using table, and then we can easily make a bar plot
of the frequencies. Figure 2.4 shows such a bar plot produced by:
One should note that the empirical bar plot – besides the y-axis – should
look approximately like the theoretical bar plot. One may choose to nor-
malize the empirical bar plot (use relative frequencies) so that it becomes
directly comparable with the theoretical bar plot. Normalization can in par-
ticular be useful, if one wants to compare two bar plots from two samples of
unequal size. Note that if we simulate another 500 times from the Poisson
distribution, we get a different empirical bar plot.
If, moreover, X
x2 p(x) < ∞
x∈E
we define the variance under P as
X
σ2 = (x − µ)2 p(x).
x∈E
100
100
80
80
60
60
Frequency
Frequency
40
40
20
20
0
0
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12
n n
Figure 2.4: Example of an empirical bar plot. This is a plot of the tabulated values
of 500 variables from the Poisson distribution (λ = 4) plotted using either the plot
function (left) or the barplot function (right).
where we have used the formula 12 +22 +32 +. . .+(n−1)2 +n2 = n(n+1)(2n+1)/6.
Rearranging the result into a single fraction yields that the variance for the uniform
distribution is
(n + 1)(n − 1)
σ2 = .
12
⋄
If the observations are all realizations from the same probability measure P the
sample mean is an estimate of the (unknown) mean under P – provided there is a
mean. Likewise, the sample variance
n
1X
σ̃n2 = (xk − µ̂n )2
n
k=1
is an estimate of the variance under P provided there is a variance.
Exercises
or
P (A) = 0.3, P (C) = 0.4, P (G) = 0.1, P (T) = 0.2.
What happens to the number of codons before the occurrence of the first stop codon?
Exercise 2.5.6. An open reading frame in a DNA-sequence is a segment of codons
between a start and a stop codon. A long open reading frame is an indication that
the open reading frame is actually coding for a protein since long open reading frames
would be unlikely by chance. Discuss whether you believe that an open reading frame
of length more than 33 is likely to be a protein coding gene.
32 Probability Theory
Exercise 2.5.7. Compute the mean, variance and standard deviation for the uniform
distribution on {1, . . . , 977}.
⋆ Exercise 2.5.8. Show by using the definition of mean and variance that for the
Poisson distribution with parameter λ > 0
µ = λ and σ 2 = λ. (2.10)
Compute the mean under the geometric distribution for p = 0.1, 0.2, . . . , 0.9.
Hint: If you are not able to compute a theoretical formula for the mean try to compute
the value of the infinite sum approximately using a finite sum approximation – the
computation can be done in R.
Ï Exercise 2.5.10. Plot the point probabilities for the Poisson distribution, dpois,
with λ = 1, 2, 5, 10, 100 using the type="h". In all five cases compute the probability
of the events A1 = {n ∈ N0 | − σ ≤ n − µ ≤ σ}, A2 = {n ∈ N0 | − 2σ ≤ n − µ ≤ 2σ},
A3 = {n ∈ N0 | − 3σ ≤ n − µ ≤ 3σ}.
Ï Exercise 2.5.11. Generate a random DNA sequence of length 10000 in R with each
letter having probability 1/4. Find out how many times the pattern ACGTTG occurs in
the sequence.
Ï Exercise 2.5.12. Repeat the experiment above 1000 times. That is, for 1000 se-
quences of length 10000 find the number of times that the pattern ACGTTG occurs
in each of the sequences. Compute the average number of patterns occurring per se-
quence. Make a bar plot of the relative frequencies of the number of occurrences and
compare with a theoretical bar plot of a Poisson distribution with λ chosen suitably.
Defining a probability measure on the real line R yields, to an even larger extent than
in the previous section, the problem: How are we going to represent the assignment
of a probability to all events in a manageable way? One way of doing so is through
distribution functions.
That is, F (x) is the probability that under P the outcome is less than or equal to x.
Probability measures on the real line 33
We immediately observe that since (−∞, y] ∪ (y, x] = (−∞, x] for y < x and that
the sets (−∞, y] and (y, x] are disjoint, the additive property implies that
F (x) = P ((−∞, x]) = P ((−∞, y]) + P ((y, x]) = F (y) + P ((y, x]),
or in other words
P ((y, x]) = F (x) − F (y).
We can derive more useful properties from the definition. If x1 ≤ x2 then (−∞, x1 ] ⊆
(−∞, x2 ], and therefore from (2.1)
Likewise, by similar arguments, when ε > 0 tends to 0 the set (−∞, x + ε] shrinks
towards (−∞, x] hence
It is of course useful from time to time to know that a distribution function satisfies
property (i), (ii), and (iii) in Result 2.6.2, but that these three properties completely
characterize the probability measure is more surprising.
Result 2.6.3. If F : R → [0, 1] is a function that has property (i), (ii), and (iii)
in Result 2.6.2 there is precisely one probability measure P on R such that F is the
distribution function for P .
34 Probability Theory
1.0
1.0
0.5
0.5
−8 −4 0 4 8 −4 0 4 8
Figure 2.5: The logistic distribution function (left, see Example 2.6.4). The Gumbel
distribution function (right, see Example 2.6.5). Note the characteristic S-shape of
both distribution functions.
This result not only tells us that the distribution function completely characterizes P
but also that we can specify a probability measure just by specifying its distribution
function. This is a useful result but also a result of considerable depth, and a formal
derivation of the result is beyond the scope of these notes.
If our sample space E is discrete but actually a subset of the real line, E ⊆ R, like N
or Z, we have two different ways of defining and characterizing probability measures
on E: through point probabilities or through a distribution function. The connection
Probability measures on the real line 35
is given by X
F (x) = P ((−∞, x]) = p(y).
y≤x
1.0
1.0
0.5
0.5
0 5 10 15 20 25 0 5 10 15 20 25
Figure 2.6: The distribution function for the Poisson distribution, with λ = 5 (left)
and λ = 10 (right).
Example 2.6.6. The distribution function for the Poisson distribution with param-
eter λ > 0 is given by
⌊x⌋
X λn
F (x) = exp(−λ)
n!
n=0
where ⌊x⌋ is the largest integer smaller than x. It is a step function with steps at
n
each of the non-negative integers n ∈ N0 and step size at n being p(n) = exp(−λ) λn!
⋄
The reader may be unfamiliar with doing integrations over an arbitrary event A. If
f is a continuous function and A = [a, b] is an interval it should be well known that
the integral !
Z b Z
f (y)dy = f (y)dy
a [a,b]
is the area under the graph of f from a to b. It is possible for more complicated sets
A to assign a kind of generalized area to the set under the graph of f over A. We
will not go into any further details. An important observation is that we can specify
a distribution function F by
Z x
F (x) = f (y)dy (2.11)
−∞
Indeed, if the total area from −∞ to ∞ under the graph of f equals 1, the area under
f from −∞ to x is smaller (but always positive since f is positive) and therefore
Z x
F (x) = f (y)dy ∈ [0, 1].
−∞
where 2h is the length of the interval [x − h, x + h]. Rearranging, we can also write
this approximate equality as
P ([x − h, x + h])
f (x) ≃ .
2h
Probability measures on the real line 37
0.4
1.0
0.3
0.2
0.5
0.1
−4 −2 0 2 4 −4 −2 0 2 4
Figure 2.7: The density (left) and the distribution function (right) for the normal
distribution.
and it is unfortunately not possible to give a (more) closed form expression for this
integral. It is, however, common usage to always denote this particular distribution
function with a Φ.
The normal distribution is the single most important distribution in statistics. There
are several reasons for this. One reason is that a rich and detailed theory about the
normal distribution and a large number of statistical models based on the normal
distribution can be developed. Another reason is that the normal distribution ac-
tually turns out to be a reasonable approximation of many other distributions of
interest – that being a practical observation as well as a theoretical result known
as the Central Limit Theorem, see Result 4.7.1. The systematic development of the
statistical theory based on the normal distribution is a very well studied subject in
the literature. ⋄
1.0
1.0
0.5
0.5
0 2 4 6 8 0 2 4 6 8
Figure 2.8: The density (left) and the distribution function (right) for the exponential
distribution with intensity parameter λ = 1 (Example 2.6.9).
f (x) = λ exp(−λx), x ≥ 0.
Let f (x) = 0 for x < 0. Clearly, f (x) is positive, and we find that
Z ∞ Z ∞
f (x)dx = λ exp(−λx)dx
−∞ 0
∞
= − exp(−λx) = 1.
0
For the last equality we use the convention exp(−∞) = 0 together with the fact that
Probability measures on the real line 39
for x ≥ 0 (and F (x) = 0 for x < 0). The parameter λ is sometimes called the
intensity parameter. This is because the exponential distribution is often used to
model waiting times between the occurrences of events. The larger λ is, the smaller
will the waiting times be, and the higher the intensity of the occurrence of the events
will be. ⋄
It is quite common, as for the exponential distribution above, that we only want to
specify a probability measure living on an interval I ⊆ R. By “living on” we mean
that P (I) = 1. If the interval is of the form [a, b], say, we will usually only specify the
density f (x) (or alternatively the distribution function F (x)) for x ∈ [a, b] with the
understanding that f (x) = 0 for x 6∈ [a, b] (for the distribution function, F (x) = 0
for x < a and F (x) = 1 for x > b).
1.0
1.0
0.5
0.5
−1 0 1 2 −1 0 1 2
Figure 2.9: The density (left) and the distribution function (right) for the uniform
distribution on the interval [0, 1] (Example 2.6.10).
That is, f is constantly equal to 1/(b − a) on [a, b] and 0 outside. Then we find that
Z ∞ Z b
f (x)dx = f (x)dx
−∞ a
Z b
1
= dx
a b−a
1
= × (b − a) = 1.
b−a
Since f is clearly positive it is a density for a probability measure on R. This probabil-
ity measure is called the uniform distribution on the interval [a, b]. The distribution
function can be computed (for a ≤ x ≤ b) as
Z x
F (x) = f (y)dy
Z−∞x
1
= dy
a b−a
x−a
= .
b−a
In addition, F (x) = 0 for x < a and F (x) = 1 for x > b. ⋄
R Box 2.6.1. Distribution functions and densities for a number of standard prob-
ability measures on R are directly available within R. The convention is that if a
distribution is given the R-name name then pname(x) gives the distribution function
evaluated at x and dname(x) gives the density evaluated at x. The normal distri-
bution has the R-name norm so pnorm(x) and dnorm(x) gives the distribution and
density function respectively for the normal distribution. Likewise the R-name for
the exponential function is exp so pexp(x) and dexp(x) gives the distribution and
density function respectively for the exponential distribution. For the exponential
distribution pexp(x,3) gives the density at x with intensity parameter λ = 3.
where Γ(λ) is the Γ-function evaluated in λ, cf. Appendix B. The Γ-distribution with
λ = 1 is the exponential distribution. The Γ-distribution with shape λ = f /2 for
f ∈ N and scale β = 2 is known as the χ2 -distribution with f degrees of freedom. The
σ 2 χ2 -distribution with f degrees of freedom is the χ2 -distribution with f degrees of
freedom and scale parameter σ 2 , thus it is the Γ-distribution with shape parameter
λ = f /2 and scale parameter β = 2σ 2 . ⋄
Probability measures on the real line 41
4
2.0
3
1.5
2
1.0
1
0.5
0.0
Figure 2.10: The density for the B-distribution (Example 2.6.12) with parameters
(λ1 , λ2 ) = (4, 2) (left) and (λ1 , λ2 ) = (0.5, 3) (right)
Example 2.6.12 (The B-distribution). The density for the B-distribution (pro-
nounced β-distribution) with parameters λ1 , λ2 > 0 is given by
1
f (x) = xλ1 −1 (1 − x)λ2 −1
B(λ1 , λ2 )
for x ∈ [0, 1]. Here B(λ1 , λ2 ) is the B-function, cf. Appendix B. This two-parameter
class of distributions on the unit interval [0, 1] is quite flexible. For λ1 = λ2 = 1
we obtain the uniform distribution on [0, 1], but for other parameters we can get a
diverse set of shapes for the density – see Figure 2.10 for two particular examples.
Since the B-distribution always lives on the interval [0, 1] it is frequently encountered
as a model of a random probability – or rather a random frequency. In population
genetics for instance, the B-distribution is found as a model for the frequency of
occurrences of one out of two alleles in a population. The shape of the distribution,
i.e. the proper values of λ1 and λ2 , then depends upon issues such as the mutation
rate and the migration rates. ⋄
From a basic calculus course the intimate relation between integration and differen-
tiation should be well known.
Result 2.6.13. If F is a differentiable distribution function the derivative
f (x) = F ′ (x)
0.3
0.2
0.2
0.1
0.1
−8 −4 0 4 8 −4 −2 0 2 4 6 8
Figure 2.11: The density for the logistic distribution (left, see Example 2.6.14) and
the density for the Gumbel distribution (right, see Example 2.6.15). The density
for the Gumbel distribution is clearly skewed, whereas the density for the logistic
distribution is symmetric and quite similar to the density for the normal distribution.
Example 2.6.14 (Logistic distribution). The density for the logistic distribution is
found to be
exp(−x)
f (x) = F ′ (x) = .
(1 + exp(−x))2
Example 2.6.15 (Gumbel distribution). The density for the Gumbel distribution
is found to be
f (x) = F ′ (x) = exp(−x) exp(− exp(−x)).
Exercises
⋆
Exercise 2.6.1. Argue that the function
for β > 0 is a distribution function. It is called the Weibull distribution with parameter
β. Find the density on the interval [0, ∞) for the Weibull distribution.
Descriptive methods 43
for β > 0 is a distribution function on [x0 , ∞). It is called the Pareto distribution on
the interval [x0 , ∞). Find the density on the interval [x0 , ∞) for the Pareto distribu-
tion.
Ï Exercise 2.6.3. Write two functions in R, pgumb and dgumb, that computes the
distribution function and the density for the Gumbel distribution.
Ï Exercise 2.6.4. Let
1
fλ (x) = x2 λ+ 12
(1 + 2λ )
for x ∈ R and λ > 0. Argue that fλ (x) > 0 for all x ∈ R. Use numerical integration
in R, integrate, to compute
Z ∞
c(λ) = fλ (x)dx
−∞
√
for λ = 12 , 1, 2, 10, 100. Compare the results with π and 2π. Argue that c(λ)−1 fλ (x)
is a density and compare it, numerically, with the density for the normal distribution.
The probability measure with density c(λ)−1 fλ (x) is called the t-distribution with
shape parameter λ, and it is possible to show that
√ 1
c(λ) = 2λB(λ, )
2
where B is the B-function.
to denote the fraction or relative frequency of observations that belong to the event
A ⊆ R.
The underlying assumption in this section is that we have a data set of observations
x1 , . . . , xn and that these observations are generated by a distribution P that pos-
44 Probability Theory
Definition 2.7.1. The histogram with break points q1 < q2 < . . . < qk , chosen so
that
q1 < min xi ≤ max xi < qk ,
i=1,...,n i=1,...,n
together with fˆ(x) = 0 for x 6∈ (q1 , qn ]. Usually one plots fˆ as a box of height fˆ(qi+1 )
located over the interval (qi , qi+1 ], and this is what most people associate with a
histogram.
where we use the fact that all the data points are contained within the interval
(q1 , qn ]. Since the function fˆ integrates to 1 it is a probability density. The purpose of
the histogram is to approximate the density of the true distribution of X – assuming
that the distribution has a density.
Sometimes one encounters the unnormalized histogram, given by the function
Here f˜(x) is constantly equal to the number of observations falling in the interval
(qi , qi+1 ]. Since the function doesn’t integrate to 1 it cannot be compared directly
with a density, nor is it possible to compare unnormalized histograms directly for
two samples of unequal size. Moreover, if the break points are not equidistant, the
unnormalized histogram is of little value.
It can be of visual help to add a rug plot to the histogram – especially for small or
moderate size dataset. A rug plot is a plot of the observations as small “tick marks”
along the first coordinate axis.
Descriptive methods 45
Example 2.7.2. We consider the histogram of 100 and 1000 random variables whose
distribution is N (0, 1). The are generated by the computer. We choose the breaks
to be equidistant from −4 to 4 with a distance of 0.5, thus the break points are
We find the histograms in Figure 2.12. Note how the histogram corresponding to
the 1000 random variables approximates the density more closely. ⋄
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
Figure 2.12: The histograms for the realization of 100 (left) and 1000 (right) sim-
ulated N (0, 1) random variables. A rug plot is added to both histograms, and we
compare the histograms with the corresponding density for the normal distribution.
Example 2.7.3. Throughout this section we will consider data from a microarray
experiment. It is the so-called ALL dataset (Chiaretti et. al., Blood, vol. 103, No.
7, 2004). It consists of samples from patients suffering from Acute Lymphoblastic
Leukemia. We will consider only those patients with B-cell ALL, and we will group
the patients according to presence or absence of the BCR/ABL fusion gene.
On Figure 2.13 we see the histogram of the log (base 2) expression levels3 for six
(arbitrary) genes for the group of samples without BCR/ABL.
On Figure 2.14 we have singled out the signal from the gene probe set with the
poetic name 1635 at, and we see the histograms for the log expression levels for the
two groups with or without BCR/ABL. The figure also shows examples of kernel
density estimates. ⋄
3
Some further normalization has also been done that we will not go into here.
46 Probability Theory
> hist(x)
This plots a histogram using default settings. The break points are by default
chosen by R in a suitable way. It is possible to explicitly set the break points
by hand, for instance
> hist(x,breaks=c(0,1,2,3,4,5))
0.6
1.0
0.5
1.5
0.8
0.4
1.0
0.6
0.3
0.4
0.2
0.5
0.2
0.1
0.0
0.0
0.0
4.5 5.0 5.5 6.0 6.0 6.5 7.0 7.5 8.0 8.5 5.5 6.0 6.5 7.0
1.5
0.6
1.5
0.5
1.0
0.4
1.0
0.3
0.5
0.2
0.5
0.1
0.0
0.0
0.0
7.0 7.2 7.4 7.6 7.8 8.0 8.2 6.0 6.5 7.0 7.5 6.5 7.0 7.5 8.0 8.5 9.0 9.5
Figure 2.13: Examples of histograms for the log (base 2) expression levels for a
random subset of six genes from the ALL dataset (non BCR/ABL fusion gene).
Descriptive methods 47
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
Density
Density
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
5 6 7 8 9 10 11 5 6 7 8 9 10 11
x x
0.7
0.7
Default (bw=0.25) Default (bw=0.35)
Bandwidth bw=1 Bandwidth bw=1
0.6
0.6
Bandwidth bw=0.1 Bandwidth bw=0.1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
5 6 7 8 9 10 11 5 6 7 8 9 10 11
Figure 2.14: Histograms and examples of kernel density estimates (with a Gaussian
kernel) of the log (base 2) expression levels for the gene 1635 at from the ALL
microarray experiment with (right) or without (left) presence of the BCR/ABL
fusion gene.
The histogram is a crude and in some ways unsatisfactory approximation of the den-
sity f . The plot we get is sensitive to the choice of break points – the number as well
as their location. Moreover, the real density is often thought to be a rather smooth
function whereas the histogram by definition is very non-smooth. The use of kernel
density estimation remedies some of these problems. Kernel density estimation or
smoothing is computationally a little harder than computing the histogram, which
for moderate datasets can be done by hand. However, with modern computers, uni-
variate kernel density estimation can be done just as rapidly as drawing a histogram
– that is, apparently instantaneously.
Kernel density estimation is based on the interpretation of the density as given by
48 Probability Theory
(2.13). Rearranging that equation then says that for small h > 0
1
f (x) ≃ P ([x − h, x + h]),
2h
and if we then use P ([x − h, x + h]) ≃ εn ([x − h, x + h]) from the frequency inter-
pretation, we get that for small h > 0
1
f (x) ≃ εn ([x − h, x + h]).
2h
The function
1
fˆ(x) = εn ([x − h, x + h])
2h
is in fact an example of a kernel density estimator using the rectangular kernel . If
we define the kernel Kh : R × R → [0, ∞) by
1
Kh (x, y) = 1 (y),
2h [x−h,x+h]
we can see by the definition of εn ([x − h, x + h]) that
n
1X 1
fˆ(x) = 1 (xi )
n 2h [x−h,x+h]
i=1
n
1X
= Kh (x, xi ).
n
i=1
One may be unsatisfied with the sharp cut-offs at ±h – and how to choose h in
the first place? We may therefore choose smoother kernel functions but with similar
properties as Kh , that is, they are largest for x = y, they fall off, but perhaps more
smoothly than the rectangular kernel, when x moves away from y, and they integrate
to 1 over x for all y.
K : R × R → [0, ∞)
is called a kernel if Z ∞
K(x, y)dx = 1
−∞
for all y. If K is a kernel we define the kernel density estimate for our dataset
x1 , . . . , xn ∈ R by
n
ˆ 1X
f (x) = K(x, xi ).
n
i=1
Descriptive methods 49
Both of these kernels as well as the rectangular kernel are examples of choosing a
“mother” kernel, k : R → [0, ∞), and then taking
1 x−y
Kh (x, y) = k .
h h
Note that the mother kernel k needs to be a probability density itself. In this case
the optional parameter, h > 0, is called the bandwidth of the kernel. This parameter
determines how quickly the kernel falls off, and plays in general the same role as h
does for the rectangular kernel. Indeed, if we take the mother kernel k(x) = 1[−1,1] (x),
the kernel Kh defined above is precisely the rectangular kernel. Qualitatively we can
say that large values of h gives smooth density estimates and small values gives
density estimates that wiggle up and down.
As a curiosity we may note that the definition of the kernel density estimate fˆ does
not really include the computation of anything! Every computation is carried out at
evaluation time. Each time we evaluate fˆ(x) we carry out the evaluation of K(x, xi ),
i = 1, . . . , n – we cannot do this before we know which x to evaluate at – and then the
summation. So the definition of the kernel density estimate fˆ can mostly be viewed
as a definition of what we intend to do at evaluation time. However, implementations
such as density in R do the evaluation of the kernel density estimate once and for
all at call time in a number of pre-specified points.
The mean and variance for probability measures on discrete subsets of R were defined
through the point probabilities, and for probability measures on R given in terms of
a density there is an analogous definition of the mean and variance.
Definition 2.7.5. If P is a probability measure on R with density f that fulfills
Z ∞
|x|f (x)dx < ∞
−∞
50 Probability Theory
produces a plot of the density estimate for the realization of 100 standard
normally distributed random variables using bandwidth 1 and 1024 points
for evaluation.
If, moreover,
Z ∞
x2 f (x)dx < ∞
−∞
for x ≥ 0 (and f (x) = 0 for x < 0). We find by partial integration that
Z ∞
µ = xλ exp(−λx)dx
0
∞ Z ∞
= x exp(−λx) + exp(−λx)dx
0 0
1 ∞ 1
= − exp(−λx) = .
λ 0 λ
⋄
Example 2.7.7. Consider the uniform distribution on [a, b]. Then the density is
1
f (x) = 1 (x).
b − a [a,b]
Descriptive methods 51
Example 2.7.8. The Γ-distribution on [0, ∞) with shape parameter λ > 0 and
scale parameter β > 0 has finite mean
Z ∞
1 λ−1 x
µ = xx exp − dx
β λ Γ(λ) 0 β
Z ∞
1 λ x
= x exp − dx
β λ Γ(λ) 0 β
β λ+1 Γ(λ + 1) λΓ(λ)
= =β = βλ.
β λ Γ(λ) Γ(λ)
A similar computation reveals that the variance is β 2 λ. In particular, the mean and
variance of the σ 2 χ2 -distribution with f degrees of freedom is σ 2 f and 2σ 4 f . ⋄
which is clearly seen to be symmetric, cf. Example 2.7.9. Moreover, with the substi-
tution y = x2 /2 we have that dy = xdx, so
Z ∞ Z ∞
1
xf (x)dx = √ exp(−y)dy
0 2π 0
1
= √ < ∞.
2π
Thus by Example 2.7.9 the normal distribution has mean µ = 0.
Regarding the variance, we use the symmetry argument again and the same substi-
tution to obtain
Z ∞ Z ∞
2 2
σ = x f (x)dx = 2 x2 f (x)dx
−∞ 0
Z ∞p
2 2Γ(3/2)
= √ 2y exp(−y)dy = √ =1
2π 0 π
√
where we use that that Γ(3/2) = π/2, cf. Appendix B. ⋄
As for discrete probability measures the sample mean and sample variance for a
dataset of observations from a distribution P work as estimates of the unknown
theoretical mean and variance under P – provided they exist.
2.7.3 Quantiles
While histograms and other density estimates may be suitable for getting an idea
about the location, spread and shape of a distribution, other descriptive methods
are more suitable for comparisons between two datasets, say, or between a dataset
and a theoretical distribution. Moreover, the use of histograms and density estimates
builds on the assumption that the distribution is actually given by a density. As an
alternative to density estimation we can try to estimate the distribution function
directly.
for x ∈ R.
The empirical distribution functions qualifies for the name “distribution function”,
because it is really a distribution function. It is increasing, with limits 0 and 1 when
x → −∞ and x → ∞ respectively, and it is right continuous. But it is not continuous!
It has jumps at each xi for i = 1, . . . , n and is constant in between. Since distribution
functions all have a similar S-shape, empirical distribution functions in themselves
are not really ideal for comparisons either. We will instead develop methods based
Descriptive methods 53
on the quantiles, which are more suitable. We define the quantiles for the dataset
first and then subsequently we define quantiles for a theoretical distribution in terms
of its distribution function. Finally we show how the theoretical definition applied
to the empirical distribution function yields the quantiles for the dataset.
> library(stats)
Then
gives the empirical distribution function for the data in x. One can evaluate
this function like any other function:
> edf(1.95)
> plot(edf)
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.8 1.9 2.0 2.1 2.2 1.9 2.0 2.1 2.2 2.3
Figure 2.15: The empirical distribution function for the log (base 2) expression levels
for the gene 40480 s at from the ALL microarray experiment with (right) or without
(left) presence of the BCR/ABL fusion gene.
Q : (0, 1) → R
such that for all q ∈ (0, 1) we have that Q(q) is a q-quantile. Whether one prefers
x(n/2) , x(n/2+1) , or perhaps (x(n/2) + x(n/2+1) )/2 as the median if n is even is largely
a matter of taste.
Quantiles can also be defined for theoretical distributions. We prefer here to consider
the definition of a quantile function only.
Definition 2.7.12. If F : R → [0, 1] is a distribution function for a probability
measure P on R, then Q : (0, 1) → R is a quantile function for P if
> quantile(x)
computes the 0%, 25%, 50%, 75%, and 100% quantiles. That is, the mini-
mum, the lower quartile, the median, the upper quartile, and the maximum.
> quantile(x,probs=c(0.1,0.9))
computes the 0.1 and 0.9 quantile instead, and by setting the type param-
eter to an integer between 1 and 9, one can select how the function handles
the non-uniqueness of the quantiles. If type=1 the quantiles are the gener-
alized inverse of the empirical distribution function, which we will deal with
in Section 2.11. Note that with type being 4 to 9 the result is not a quantile
for the empirical distribution function – though it may still be a reason-
able approximation of the theoretical quantile for the unknown distribution
function.
has in fact 9 different types of quantile computations. Not all of these computations
give quantiles for the empirical distribution function, though, but three of them do.
q2 = median(F ) = Q(0.5).
In addition we call q1 = Q(0.25) and q3 = Q(0.75) the first end third quartiles of F .
The difference
IQR = q3 − q1
is called the interquartile range.
Note that the definitions of the median and the quartiles depend on the choice
of quantile function. If the quantile function is not unique these numbers are not
necessarily uniquely defined. The median summarizes in a single number the location
of the probability measure given by F . The interquartile range summarizes how
spread out around the median the distribution is.
The definition of a quantile function for any distribution function F and the quantiles
defined for a given dataset are closely related. With
the empirical distribution function for the data then any quantile function for the
distribution function F̂n also gives empirical quantiles as defined for the dataset.
56 Probability Theory
We will as mentioned use quantiles to compare two distributions – that being either
two empirical distributions or an empirical and a theoretical distribution. In princi-
ple, one can also use quantiles to compare two theoretical distributions, but that is
not so interesting – after all, we then know whether they are different or not – but
the quantiles may tell something about the nature of a difference.
Definition 2.7.14. If F1 and F2 are two distribution functions with Q1 and Q2
their corresponding quantile functions a QQ-plot is a plot of Q2 against Q1 .
> qqplot(x,y)
> qqnorm(x)
results in a QQ-plot of the empirical quantiles for x against the quantiles for the
normal distribution.
15
Sample Quantiles
Sample Quantiles
0
10
−1
5
−2
−2 −1 0 1 2 5 10 15
Figure 2.16: A QQ-plot for 100 simulated data points from the standard normal
distribution (left) and a QQ-plot for 100 simulated data points from the Poisson
distribution with parameter λ = 10 (right). In the latter case the sample points are
for visual reasons “jittered”, that is, small random noise is added to visually separate
the sample quantiles.
When making a QQ-plot with one of the distributions, F2 , say, being empirical, it
is common to plot
2i − 1
Q1 , x(i) , i = 1, . . . , n − 1.
2n
Descriptive methods 57
That is, we compare the smallest observation x(1) with the 1/2n’th quantile, the
second smallest observation x(2) with the 3/2n’th quantile and so on ending with
the largest observation x(n) and the 1 − 1/2n’th quantile.
If the empirical quantile function Q1 is created from a dataset with n data points all
being realizations from the distribution function F with quantile function Q1 then
the points in the QQ-plot should lie close to a straight line with slope 1 and intercept
0. It can be beneficial to plot a straight line, for instance through suitably chosen
quantiles, to be able to visualize any discrepancies from the straight line.
11
11
10
10
9
9
8
8
7
7
6
6
5
Figure 2.17: Comparing the empirical distributions of the six genes with probe
set names 1635 at,1636 g at,39730 at,40480 s at, 2039 s at, 36643 at for those with
BCR/ABL (right) and those without (left) using boxplots
Another visualization that is quite useful – specially for comparing three or more
empirical distributions – is the box plot, which is also based on quantiles. Histori-
cally it was also useful for visualizing just a single empirical distribution to get a
rough idea about location and scale, but with current computational power there
are more informative plots for single distributions. As a comparative tool for many
distributions, box plots are on the other hand quite effective.
A box plot using quantile function Q is given in terms of a five-dimensional vector
(w1 , q1 , q2 , q3 , w2 )
with w1 ≤ q1 ≤ q2 ≤ q3 ≤ w2 . Here
R Box 2.7.6 (Box plots). For a numeric vector x we get a single box plot by
> boxplot(x)
If x is a dataframe the command will instead produce (in one figure) a box plot of
each column. By specifying the range parameter (= whisker coefficient), which by
default equals 1.5, we can change the length of the whiskers.
> boxplot(x,range=1)
are called the whiskers. The parameter c > 0 is the whisker coefficient. The box plot
is drawn as a vertical box from q1 to q3 with “whiskers” going out to w1 and w2 . If
data points lie outside the whiskers they are often plotted as points.
Exercises
Argue that f is a density and compute the constant c. Then compute the mean and
the variance under P .
Ï Exercise 2.7.2. Compute the mean and variance for the Gumbel distribution.
Hint: You are welcome to try to compute the integrals – it’s difficult. Alternatively,
you can compute the integrals numerically in R using the integrate function.
Exercise 2.7.3. Find the quantile function for the Gumbel distribution.
Exercise 2.7.4. Find the quantile function for the Weibull distribution, cf. Exercise
2.6.1. Make a QQ-plot of the quantiles from the Weibull distribution against the
quantiles from the Gumbel distribution.
Conditional probabilities and independence 59
If we know that the event A has occurred, but don’t have additional information
about the outcome of our experiment, we want to assign a conditional probability
to all other events B ⊆ E – conditioning on the event A. For a given event A we
aim at defining a conditional probability measure P ( · |A) such that P (B|A) is the
conditional probability of B given A for any event B ⊆ E.
P (B ∩ A)
P (B|A) = (2.17)
P (A)
for any event B ⊆ E.
What we claim here is that P ( · |A) really is a probability measure, but we need to
show that. By the definition above we have that
P (E ∩ A) P (A)
P (E|A) = = = 1,
P (A) P (A)
and if B1 , . . . , Bn are disjoint events
P (B1 ∪ . . . ∪ Bn ) ∩ A
P B1 ∪ . . . ∪ Bn A =
P (A)
P (B1 ∩ A) ∪ . . . ∪ (Bn ∩ A)
=
P (A)
P (B1 ∩ A) P (Bn ∩ A)
= + ... +
P (A) P (A)
= P (B1 |A) + . . . P (Bn |A),
where the third equality follows from the additivity property of P . This shows that
P (· |A) is a probability measure and we have chosen to call it the conditional prob-
ability measure given A. It should be understood that this is a definition – though
a completely reasonable and obvious one – and not a derivation of what conditional
probabilities are. The frequency interpretation is in concordance with this definition:
With n repeated experiments, εn (A) is the fraction of outcomes where A occurs and
εn (B ∩ A) is the fraction of outcomes where B ∩ A occurs, hence
εn (B ∩ A)
εn (A)
is the fraction of outcomes where B occurs among those outcomes where A occurs.
When believing in the frequency interpretation this fraction is approximately equal
to P (B|A) for n large.
60 Probability Theory
Associated with conditional probabilities there are two major results known as the
Total Probability Theorem and Bayes Theorem. These results tell us how to compute
some probabilities from knowledge of other (conditional) probabilities. They are easy
to derive from the definition.
If A1 , . . . , An are disjoint events in E and if B ⊆ E is any event then from the
definition of conditional probabilities
P (B|Ai )P (Ai ) = P (B ∩ Ai )
P (B ∩ A1 ) + . . . + P (B ∩ An ) = P ((B ∩ A1 ) ∪ . . . ∪ (B ∩ An )) = P (B ∩ A).
and if we use the Total Probability Theorem to express P (B) we have derived the
Bayes Theorem.
P (B|Ai )P (Ai )
P (Ai |B) = (2.19)
P (B|A1 )P (A1 ) + . . . + P (B|An )P (An )
for all i = 1, . . . , n.
The Bayes Theorem is of central importance when calculating with conditional prob-
abilities – whether or not we adopt a frequency or a Bayesian interpretation of
probabilities. The example below shows a classical setup where Bayes Theorem is
required.
Example 2.8.4. Drug tests are, for instance, used in cycling to test if cyclists
take illegal drugs. Suppose that we have a test for the illegal drug EPO, with the
property that 99% of the time it reveals (is positive) if a cyclist has taken EPO. This
Conditional probabilities and independence 61
sounds like a good test – or does it? The 0.99 is the conditional probability that the
test will be positive given that the cyclist uses EPO. To completely understand the
merits of the test, we need some additional information about the test and about
the percentage of cyclists that use the drug. To formalize, let E = {tp, fp, tn, fn} be
the sample space where tp = true positive, fp = false positive, tn = true negative,
and fn = false negative. By tp we mean that the cyclist has taken EPO and the test
shows that (is positive), by fp that the cyclist hasn’t taken EPO but the test shows
that anyway (is positive), by fn that the cyclist has taken EPO but the test doesn’t
show that (is negative), and finally by tn we mean that the cyclist hasn’t taken
EPO and that the test shows that (is negative). Furthermore, let A1 = {tp, fn} (the
cyclist uses EPO) and let A2 = {fp, tn} (the cyclist does not use EPO). Finally, let
B = {tp, fp} (the test is positive). Assume that the conditional probability that the
test is positive given that the cyclist is not using EPO is rather low, 0.04 say. Then
what we know is:
P (B|A1 ) = 0.99
P (B|A2 ) = 0.04.
If we have high thoughts about professional cyclists we might think that only a
small fraction, 7%, say, of them use EPO. Choosing cyclists for testing uniformly at
random gives that P (A1 ) = 0.07 and P (A2 ) = 0.93. From Bayes Theorem we find
that
0.99 × 0.07
P (A1 |B) = = 0.65.
0.99 × 0.07 + 0.04 × 0.93
Thus, conditionally on the test being positive, there is only 65% chance that he
actually did take EPO. If a larger fraction, 30%, say, use EPO, we find instead that
0.99 × 0.3
P (A1 |B) = = 0.91,
0.99 × 0.3 + 0.04 × 0.7
which makes us more certain that the positive test actually caught an EPO user.
Besides the fact that “99% probability of revealing an EPO user” is insufficient
information for judging whether the test is good, the point is that Bayes Theorem
pops up in computations like these. Here we are given some information in terms of
conditional probabilities, and we want to know some other conditional probabilities,
which can then be computed from Bayes Theorem. ⋄
P (A ∩ B) = P (A)P (B).
p = P(X = 1)
A transformation is a map from one sample space into another sample space. Ran-
dom variables can then be transformed using the map, and we are interested in the
distribution of resulting random variable. Transformations are the bread-and-butter
for doing statistics – it is crucial to understand how random variables and distribu-
tions on one sample space give rise to a range of transformed random variables and
distributions. Abstractly there is not much to say, but once the reader recognizes
how transformations play a key role throughout these notes, the importance of being
able to handle transformations correctly should become clear.
If E and E ′ are two sample spaces, a transformation is a map
h : E → E′
for A ⊆ E ′ to denote the event of outcomes in E for which the transformed outcome
ends up in A.
Example 2.9.2. Transformations are done to data all the time. Any computation
is essentially a transformation. One of the basic ones is computation of the sample
mean. If X1 and X2 are two real valued random variables, their sample mean is
1
Y = (X1 + X2 ).
2
The general treatment of transformations to follow will help us understand how we
derive the distribution of the transformed random variable Y from the distribution
of X1 and X2 . ⋄
This notation, P(h(X) ∈ A) = P(X ∈ h−1 (A)), is quite suggestive – to find the
distribution of h(X) we “move” h from the variable to the set by taking the “inverse”.
Indeed, if h has an inverse, i.e. there is a function h−1 : E ′ → E such that
h : E → {0, 1}
by
1 if x ∈ A
h(x) = 1A (x) = .
0 if x ∈ Ac
Thus h is the indicator function for the event A. The corresponding transformed
random variable
Y = h(X) = 1A (X)
is called an indicator random variable. We sometimes also write
Y = 1(X ∈ A)
to show that Y indicates whether X takes its value in A or not. Since Y takes values
in {0, 1} it is a Bernoulli variable with success probability
⋄
Example 2.9.5. If E is a discrete set and P is a probability measure on E given
by the point probabilities p(x) for x ∈ E then if h : E → E ′ the probability measure
h(P ) has point probabilities
X
q(y) = p(x), y ∈ h(E).
x:h(x)=y
Indeed, for y ∈ h(E) the set h−1 (y) is non-empty and contains precisely those points
whose image is y. Hence
X X
q(y) = h(P )({y}) = P (h−1 (y)) = p(x) = p(x).
x∈h−1 (y) x:h(x)=y
Random variables 65
As a consequence, if h : E → R and if
X
|h(x)|p(x) < ∞
x∈E
where we have used that the double sum is a sum over every x ∈ E – just organized so
that for each y ∈ h(E) we first sum all values of h(x)p(x) for those x with h(x) = y,
and then sum these contributions over y ∈ h(E).
Likewise, if X
h(x)2 p(x) < ∞
x∈E
Example 2.9.6 (Sign and symmetry). Let X be a real valued random variable
whose distribution is given by the distribution function F and consider h(x) = −x.
Then the distribution function for Y = h(X) = −X is
G(x) = P(Y ≤ x) = P(X ≥ −x) = 1 − P(X < −x) = 1 − F (−x) + P(X = −x).
We say that (the distribution of) X is symmetric if G(x) = F (x) for all x ∈ R. That
is, X is symmetric if X and −X have the same distribution. If F has density f we
know that P(X = −x) = 0 and it follows that the distribution of h(X) has density
g(x) = f (−x) by differentiation of G(x) = 1 − F (−x), and in this case it follows
that X is symmetric if f (x) = f (−x) for all x ∈ R. ⋄
Example 2.9.7. Let X be a random variable with values in R and with distribution
given by the distribution function F . Consider the random variable |X| – the absolute
value of X. We find that the distribution function for |X| is
for x ≥ 0.
If we instead consider X 2 , we find the distribution function to be
√ √ √ √ √
G(x) = P(X 2 ≤ x) = P(− x ≤ X ≤ x) = F ( x) − F (− x) + P(X = − x)
for x ≥ 0. ⋄
66 Probability Theory
Example 2.9.8 (Median absolute deviation). We have previously defined the in-
terquartile range as a measure of the spread of a distribution given in terms of
quantiles. We introduce here an alternative measure. If X is a real valued random
variable whose distribution has distribution function F and median q2 we can con-
sider the transformed random variable
Y = |X − q2 |,
which is the absolute deviation from the median. The distribution function for Y is
MAD = median(Fabsdev ).
The median absolute deviation is like the interquartile range a number that repre-
sents how spread out around the median the distribution is.
For a symmetric distribution we always have the median q2 = 0 and therefore we
have Fabsdev = 2F − 1. Using the definition of the median we get that if x is MAD
then
1
2F (x − ε) − 1 ≤ ≤ 2F (x) − 1
2
from which is follows that F (x − ε) ≤ 3/4 ≤ F (x). In this case it follows that MAD
is in fact equal to the upper quartile q3 . Using symmetry again one can observe that
the lower quartile q1 = −q3 and hence for a symmetric distribution MAD equals half
the interquartile range. ⋄
Example 2.9.9 (Location and Scale). Let X denote a real valued random variable
with distribution given by the distribution function F : R → [0, 1]. Consider the
transformation h : R → R given by
h(x) = σx + µ
Y = h(X) = σX + µ
f (x) = F ′ (x).
Random variables 67
We observe that G is also differentiable and applying the chain rule for differentiation
we find that the distribution of Y has density
′ 1 x−µ
g(x) = G (x) = f .
σ σ
It follows that the normal distribution with location parameter µ and scale parameter
σ has density
1 (x − µ)2 1 (x − µ)2
√ exp − =√ exp − . (2.20)
σ 2π 2σ 2 2πσ 2 2σ 2
The abbreviation
X ∼ N (µ, σ 2 ).
is often used to denote a random variable X, which is normally distributed with lo-
cation parameter µ and scale parameter σ. In the light of Example 2.7.10 the normal
distribution N (0, 1) has mean 0 and variance 1, thus the N (µ, σ 2 ) normal distribu-
tion with location parameter µ and scale parameter σ has mean µ and variance σ 2 .
⋄
R Box 2.9.1. For some of the standard distributions on the real line R, one can
easily supply additional parameters specifying the location and scale in R. For the
normal distribution pnorm(x,1,2) equals the density at x with location parameter
µ = 1 and scale parameter σ = 2. Similarly, plogis(x,1,2) gives the density for
the logistic distribution at x with location parameter µ = 1 and scale parameter
σ = 2.
9.0
8
8.5
Sample Quantiles
Sample Quantiles
7
8.0
6
7.5
5
7.0
−2 −1 0 1 2 −2 −1 0 1 2
600
400
500
300
Sample Quantiles
Sample Quantiles
400
200
300
100
200
100
−2 −1 0 1 2 −2 −1 0 1 2
Figure 2.18: QQplots for gene 1635 at from the ALL dataset. Here we see the expres-
sion levels and log (base 2) expression levels against the normal distribution with
(right) or without (left) presence of the BCR/ABL fusion gene.
straight line, but with slope σ and intercept µ. This is because Qµ,σ (y) defined by
Qµ,σ (y) = µ + σQ(y)
is actually a quantile function for the distribution of µ + σX. To see this, recall that
the distribution function for µ + σX is
x−µ
Fµ,σ (x) = F ,
σ
hence for any y ∈ (0, 1) and ε > 0
σQ(y) + µ − µ − ε
Fµ,σ (Qµ,σ (y) − ε) = F = F (Q(y) − ε/σ)
σ
≤ y
≤ F (Q(y)) = Fµ,σ (Qµ,σ (y)).
This adds considerable value to the QQ-plot as it can now be used to justify the
choice of the shape of the distribution without having to discuss how the unknown
scale and location parameters are chosen or perhaps estimated from data. ⋄
Exercises
Exercise 2.9.1. Compute the interquartile range and the median absolute deviation
for the N (0, 1) standard normal distribution.
Exercise 2.9.2. Find the density for the distribution of exp(X) if X is normally
distributed. This is the log-normal distribution. Make a QQ-plot of the normal distri-
bution against the log-normal distribution.
⋆ Exercise 2.9.3. Assume that X is a real valued random variable, whose distribution
has density f and distribution function F . Let h : R → [0, ∞) be the transformation
h(x) = x2 . Show that
√ √
P(h(X) ≤ y) = 1 − F (− y) + F ( y),
for y > 0. Argue that if the distribution of X is symmetric then this expression reduces
to √
f ( y)
g(x) = √ .
y
⋆ Exercise 2.9.4. Assume that the distribution of X is the normal distribution. Show
that the distribution of X 2 has density
1 y
g(y) = √ exp −
2πy 2
for y > 0.
70 Probability Theory
For all practical purposes we have more than one random variable in the game when
we do statistics. A single random variable carries a probability distribution on a
sample space, but if we hope to infer unknown aspects of this distribution, it is
rarely sufficient to have just one observation. Indeed, in practice we have a data set
consisting of several observations as in all the examples previously considered. The
question, we need to ask, is which assumptions we want to make on the interrelation
between the random variables representing the entire data set?
If we have n random variables, X1 , X2 , . . . , Xn , they each have a distribution, which
we refer to as their marginal distributions. Are the marginal distributions enough?
No, not at all! Lets consider the following example.
Example 2.10.1. Let X and Y are two random variables representing a particular
DNA letter in the genome in two different but related organisms. Lets assume for
simplicity that each letter has the uniform distribution as the marginal distribution.
What is the probability of observing (X = A, Y = A)? If the events (X = A) and
(Y = A) were independent, we know that the correct answer is 0.25 × 0.25 = 0.125.
However, we claimed that the organims are related, and the interpretation is that the
occurrence of a given letter in one of the organims, at this particular position, gives
us information about the letter in the other organism. What the correct probability
is depends on many details, e.g. the evolutionary distance between the organims. In
Subsection 2.10.2 we consider two examples of such models, the Jukes-Cantor model
and the Kimura model. ⋄
The probability measure the governs the combined behaviour of the pair (X, Y )
above is often referred to as the joint distribution of the variables. We need to know
the joint distribution to compute probabilities involving more than one the random
variable. The marginal probability distributions, given in general as
we get that
Z Z
P(X1 ∈ A1 , . . . , Xn ∈ An ) = ··· f1 (x1 ) · . . . · fn (xn )dxn . . . dx1
A1 An
Z Z
= f1 (x1 )dx1 · . . . · fn (xn )dxn
A1 An
= P(X1 ∈ A1 ) · . . . · P(Xn ∈ An ).
A similar computation holds with point probabilities, which replaces the densities
on a discrete sample space, and where the integrals are replaced with sums. We
summarize this as follows.
Result 2.10.3. Let X1 , X2 , . . . , Xn be n random variables. If the sample spaces Ei
are discrete, if the marginal distributions have point probabilities pi (x), x ∈ Ei and
i = 1, . . . , n, and the point probabilities for the distribution of (X1 , . . . , Xn ) factorize
as
P (X1 = x1 , . . . , Xn = xn ) = p(x1 , . . . , xn ) = p1 (x1 ) · . . . · pn (xn )
then the Xi ’s are independent.
If Ei = R, if the marginal distributions have densities fi : R → [0, ∞), i = 1, . . . , n,
then if the (n-dimensional) density for the distribution of (X1 , . . . , Xn ) factorizes as
The result above is mostly used as follows. If we want to construct a joint distribution
of X1 , . . . , Xn and we want the marginal distribution of Xi to be N (µ, σ 2 ), say, for
i = 1, . . . , n and we want X1 , . . . , Xn to be independent, the theorem above says that
this is actually what we obtain by taking the joint distribution to have the density
that is the product of the marginal densities. Using the properties of the exponential
function we see that the density for the joint distribution of n iid N (µ, σ 2 ) distributed
random variables is then
n
!
1 X (xi − µ)2
f (x1 , . . . , xn ) = √ exp − .
σ n ( 2π)n 2σ 2
i=1
Example 2.10.4. One example of a computation where we need the joint distri-
bution is the following. If X1 and X2 are two real valued random variables, we are
interested in computing the distribution function for their sum, Y = X1 + X2 . If the
joint density is f this computation is
Z Z
F (y) = P(Y ≤ y) = P(X1 + X2 ≤ y) = f (x1 , x2 )dx1 dx2 .
{(x1 ,x2 )|x1 +x2 ≤y}
If X1 and X2 are independent we have that f (x1 , x2 ) = f1 (x1 )f2 (x2 ) and the double
integral can we written as
Z ∞ Z y−x2
F (y) = f1 (x1 )dx1 f2 (x2 )dx2 .
−∞ −∞
Whether we can compute this double integral in practice is another story, which
depends on the densities f1 and f2 . ⋄
Example 2.10.5. We consider the random variables X and Y that take values in
E0 × E0 with E0 = {A, C, G, T} and with point probabilities given by the matrix
Y
A C G T
A 0.0401 0.0537 0.0512 0.0400 0.1850
C 0.0654 0.0874 0.0833 0.0651 0.3012
X
G 0.0634 0.0848 0.0809 0.0632 0.2923
T 0.0481 0.0643 0.0613 0.0479 0.2215
0.217 0.2901 0.2766 0.2163
The marginal distribution of X and Y respectively are also given as the row sums
(right column) and the column sums (bottom row). It is in fact this matrix rep-
resentation of discrete, multivariate distributions with “the marginal distributions
in the margin” that caused the name “marginal distributions”. One verifies easily –
or by using R – that all entries in this matrix are products of their marginals. For
instance,
P (X = G, Y = C) = 0.0848 = 0.2923 × 0.2901.
⋄
Joint distributions, conditional distributions and independence 73
Alternatively, rmp %*% t(cmp) or outer(rmp,cmp,"*") does the same job in this
case. The former works correctly for vectors only. The binary operator %o% is a
wrapper for the latter, which works for any function f of two variables as well. If
we want to compute f(rmp[i],cmp[j]), say, for all combinations of i and j, this
is done by outer(rmp,cmp,f). Note that when we use outer with the arithmetic
operators like multiplication, *, we need the quotation marks.
This distribution of the allele combination given in terms of the proportions of the
two alleles in the population is the Hardy-Weinberg equilibrium. In reality this is
not really a question of whether the population is in any sort of equilibrium. It is a
distributional consequence of an independence assumption. ⋄
Example 2.10.7 (Linkage). The alleles of two genes, or markers for that matter,
that are located on the same chromosome may occur in an associated way. For
example, consider a gene that occurs as allele a or A and another gene that occurs
as allele b and B, and we let X1 and X2 be random variables that represent the
allele we find on (one of) the chromosomes in a random individual. The marginal
distribution of X1 and X2 are given by
where p, q ∈ [0, 1] are the proportions of alleles a and b, respectively, in the popu-
lation. If X1 and X2 are independent, we have that the distribution of (X1 , X2 ) is
given by
P(X1 = a, X2 = b) = pq
P(X1 = a, X2 = B) = p(1 − q)
P(X1 = A, X2 = b) = (1 − p)q
P(X1 = A, X2 = B) = (1 − p)(1 − q).
If the distribution of (X1 , X2 ) deviates from this we have linkage. This is another
way of saying that we have dependence. ⋄
many – from ignorance or lack of skills over lack of better alternatives to deliberate
choices for efficiency reasons, say. In general, to use a model that does not fit the
data actually analyzed can potentially lead to biased or downright wrong results. If
one is in doubt about such things, the best advice is to seek assistance from a more
experienced person. There are no magic words or spells that can make problems
with lack of model fit go away.
where the third equality follows from the fact that the marginal distribution of X
has point probabilities X
p1 (x) = p(x, y).
y∈E2
76 Probability Theory
Math Box 2.10.1 (More about conditional distributions). If P denotes the joint
distribution of (X, Y ) on E1 × E2 , then the conditional probability measure given
the event A × E2 for A ⊆ E1 takes the form
P (B1 × B2 ∩ A × E2 ) P ((B1 ∩ A) × B2 ))
P (B1 × B2 |A × E2 ) = =
P (A × E2 ) P1 (A)
Y
A C G T
A 0.1272 0.0063 0.0464 0.0051 0.1850
C 0.0196 0.2008 0.0082 0.0726 0.3012
X
G 0.0556 0.0145 0.2151 0.0071 0.2923
T 0.0146 0.0685 0.0069 0.1315 0.2215
0.2170 0.2901 0.2766 0.2163
Here the additional right column and the bottom row shows the marginal distribu-
tions of X and Y respectively. Note that these marginals are the same as considered
in Example 2.10.5, but that the point probabilities for the joint distribution of X
and Y certainly differ from the product of the marginals. Thus X and Y are not
independent. We can compute the conditional distribution of Y given X as having
point probabilities
P(X = x, Y = y) p(x, y)
P(Y = y|X = x) = =P .
P(X = x) y∈E p(x, y)
Note that we have to divide by precisely the row sums above. The resulting matrix
of conditional distributions is
Joint distributions, conditional distributions and independence 77
Y
A C G T
A 0.6874 0.0343 0.2507 0.0276
X C 0.0649 0.6667 0.0273 0.2411
G 0.1904 0.0495 0.7359 0.0242
T 0.0658 0.3093 0.0311 0.5938
The rows in this matrix are conditional point probabilities for the distribution of
Y conditionally on X being equal to the letter on the left hand side of the row.
Each row sums by definition to 1. Such a matrix of conditional probabilities is called
a transition probability matrix. In terms of mutations (substitutions) in a DNA-
sequence the interpretation is that fixing one nucleic acid (the X) in the first sequence
we can read of from this matrix the probability of finding any of the four nucleic
acids at the corresponding position in a second evolutionarily related sequence. Note
that the nomenclature in molecular evolution traditionally is that a transition is a
change from a pyrimidine to a pyrimidine or a purine to a purine, whereas a change
from a pyrimidine to a purine or vice versa is called a transversion. ⋄
In probability theory the conditional probabilities are computed from the definition
and formulas above. In statistics, however, conditional probabilities are often used as
the fundamental building blocks when specifying a probability model. We consider
two concrete examples from molecular evolution and then the general concept of
structural equations.
where P t (x, y) is the conditional probability P(Y = y|X = x) when X and Y are
separated by the evolutionary time t. ⋄
Example 2.10.11 (The Kimura model). Another slightly more complicated model
of molecular evolution is the Kimura model. The model captures the observed fact
that a substitution of a purine with a purine or a pyrimidine with a pyrimidine
(a transition) is happening in the course of evolution with a different rate than a
substitution of a purine with pyrimidine or pyrimidine with purine (a transversion).
If we let α > 0 be a parameter determining the rate of transitions and β > 0 a
78 Probability Theory
0.35
0.35
Transition Transition
Transversion Transversion
0.30
0.30
0.25
0.25
Conditional Probability
Conditional Probability
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 200 400 600 800 1000 0 200 400 600 800 1000
t t
Figure 2.19: Two examples of transition and transversion probabilities for the
Kimura model as a function of evolutionary distance in units of time.
with P t (x, y) as above is the conditional probability P(Y = y|X = x) when X and
Y are separated by the evolutionary time t. ⋄
The two models, the Jukes-Cantor model and the Kimura model, may seem a little
arbitrary when encountered as given above. They are not, but they fit into a more
general and systematic model construction that is deferred to Section 2.15. There it
is shown that the models arise as solutions to a system of differential equations with
specific interpretations.
Both of the previous examples gave the conditional distributions directly in terms
of a formula for the conditional point probabilities. The following examples use a
different strategy. Here the conditional probabilities are given by a transformation.
Example 2.10.12. If X and ε are real valued random variables, a structural equa-
tion defining Y is an equation of the form
Y = h(X, ε)
Joint distributions, conditional distributions and independence 79
From a statistical point of view there are at least two practical interpretations of
the structural equation model. One is that the values of X are fixed by us – we
design the experiment and choose the values. We could, for instance, administer a
(toxic) compound at different doses to a bunch of flies and observe what happens,
cf. Example 1.2.5. The variable ε captures the uncertainty in the experiment – at
a given dose only a fraction of the flies will die. Another possibility is to observe
(X, Y ) jointly where, again, Y may be the death or survival of an insect but X is
the concentration of a given compound as measured from the dirt sample where the
insect is collected. Thus we do not decide the concentration levels. In the latter case
the structural equation model may still provide a useful model of the observed, con-
ditional distribution. However, we should be careful with the interpretation. If we
intervene and force X to take a particular value, it may not have the expected effect,
as predicted by the structural equation model, on the outcome Y . A explanation,
why this is so, is that X and ε may be dependent. Thus if we want to interpret the
structural equation correctly when we make interventions, the X and ε variables
should be independent. One consequence of this is, that when we design an exper-
iment it is essential to break any relation between the dose level X and the noise
variable ε. The usual way to achieve this is to randomize, that is, to assign the dose
levels randomly to the flies.
The discussion about interpretations of structural equations is only a scratch in
the surface on the whole discussion of causal inference and causal conclusions from
statistical data analysis. We also touched upon this issue in Example 1.2.5. Briefly,
probability distributions are descriptive tools that are able to capture association,
but by themselves they provide no explanations about the causal nature of things.
Additional assumptions or controlled experimental designs are necessary to facilitate
causal conclusions.
We treat two concrete examples of structural equation models, the probit regression
model and the linear regression model.
Example 2.10.13. If Φ denotes the distribution function for the normal distribution
and ε is uniformly distributed on [0, 1] we define
Thus the sample space for Y is {0, 1} and h(x, ε) = 1(ε ≤ Φ(α+ βx)) is the indicator
function that ε is smaller than or equal Φ(α + βx). We find that
P(Y = 1) = P(ε ≤ Φ(α + βx)) = Φ(α + βx).
The resulting model of the conditional distribution of the Bernoulli variable Y given
X = x is known as the probit regression model. ⋄
Example 2.10.14. If ε ∼ N (0, σ 2 ) and
Y = α + βX + ε
we have a structural equation model with
h(x, ε) = α + βx + ε.
This model is the linear regression model, and is perhaps the single most impor-
tant statistical model. For fixed x this is a location transformation of the normal
distribution, thus the model specifies that
Y | X = x ∼ N (α + βx, σ 2 ).
In words, the structural equation says that there is a linear relationship between
the Y variable and the X variable plus some additional “noise” as given by ε. One
important observation is that there is an embedded asymmetry in the model. Even
though we could “solve” the equation above in terms of X and write
α 1 1
X=− + Y − ε
β β β
this formula does not qualify for being a structural equation too. The explanation
is that if X and ε are independent then Y and ε are not! ⋄
To illustrate the point with causality and the asymmetry of structural equations lets
elaborate a little on the saying that “mud does not cause rain”. Assume that the
measurable level of mud is Y and that the measurable amount of rain is X and that
the model
Y = α + βX + ε
is a good model of how the mud level increases (β > 0) when it rains. Thus if we
perform the rain dance and it starts raining, then the mud level increases accordingly,
and by making it rain we have caused more mud. However, we cannot turn this
around. It won’t start raining just because we add more mud. The point is that
the correct interpretation of a structural equation is tied closely together with the
quantities that we model, and the equality sign in a structural equation should
be interpreted as an assignment from right to left. Arguably, the more suggestive
notation
Y ← α + βX + ε
could be used – similarly to the assignment operator used in R.
Joint distributions, conditional distributions and independence 81
The result should be intuitive and good to have in mind. In words it says that
marginal transformations of independent random variables result in independent
random variables. Intuitive as this may sound it was worth a derivation, since the
definition of independence is a purely mathematical one. The result is therefore just
as much a reassurance that our concept of independence is not counter intuitive in the
sense that we cannot introduce dependence by marginally transforming independent
variables.
Other typical transformations that we will deal with are summation and taking the
maximum or taking the minimum of independent random variables.
We start with the summation and consider the transformation h : Z × Z → Z given
by
h(x1 , x2 ) = x1 + x2 .
We can use Result 2.8.2 to obtain that
X
p(y) = P(Y = y) = P(X1 + X2 = y) = P(X1 + X2 = y, X2 = x)
x∈Z
hence X
p(y) = p1 (y − x)p2 (x). (2.24)
x∈Z
Result 2.10.16. Let X1 and X2 be two independent random variables each taking
values in Z and with (p1 (x))x∈Z and (p2 (x))x∈Z denoting the point probabilities for
82 Probability Theory
Remark 2.10.17. Finding the distribution of the sum of three or more random
variables can then be done iteratively. That is, if we want to find the distribution of
X1 + X2 + X3 where X1 , X2 and X3 are independent then we rewrite
X1 + X2 + X3 = (X1 + X2 ) + X3
and first find the distribution (i.e. the point probabilities) of Y = X1 + X2 using
Result 2.10.16. Then the distribution of Y + X3 is found again using Result 2.10.16
(and independence of Y and X3 , cf. Result 2.10.15). Note that it doesn’t matter how
we place the parentheses.
Example 2.10.18 (Sums of Poisson Variables). Let X1 and X2 be independent
Poisson distributed with parameter λ1 and λ2 respectively. The point probabilities
for the distribution of X1 + X2 are then given by
y
X λy−x λx
p(y) = exp(−λ1 ) 1
exp(−λ2 ) 2
(y − x)! x!
x=0
y
X λy−x
1 λx2
= exp(−(λ1 + λ2 ))
(y − x)!x!
x=0
y
1 X y!
= exp(−(λ1 + λ2 )) λx λy−x
y! x=0 x!(y − x)! 2 1
Using the binomial formula for the last sum we obtain that
(λ1 + λ2 )y
p(y) = exp(−(λ1 + λ2 )) .
y!
This shows that the distribution of X1 + X2 is a Poisson distribution with parameter
λ1 + λ2 . ⋄
A derivation similar to the derivation for variables taking integer values is possible
for real valued random variables whose distributions are given by densities f1 and
f2 , respectively. The differences being that the sum is replaced by an integral and
the point probabilities by densities.
Result 2.10.19. If X1 and X2 are independent real valued random variables with
distributions having density f1 and f2 , respectively, the density, g, for the distribution
of Y = X1 + X2 is given by
Z ∞
g(y) = f1 (y − x)f2 (x)dx.
−∞
Joint distributions, conditional distributions and independence 83
Math Box 2.10.2 (Sums of continuous variables). There is, in fact, a technical
problem just copying the derivation for integer valued variables to real valued
variables when we want to compute the density for the sum of two variables. The
problem is that all outcomes x have probablity zero of occuring. An alternative
derivation is based on Example 2.10.4. A substitution of x1 with x1 − x2 and an
interchange of the integration order yields the formula
Z y Z ∞
F (y) = f1 (x1 − x2 )f2 (x2 )dx2 dx1
−∞ −∞
for the distribution function of the sum. From this we read of directly that the
density is the inner integral
Z ∞
f1 (y − x2 )f2 (x2 )dx2
−∞
as a function of y.
Example 2.10.20. If X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) are two independent
normally distributed random variables then
thus the sum is also normally distributed. Indeed, to simplify computations assume
that µ1 = µ2 = 0 and σ1 = σ2 = 1, in which case the claim is that Y ∼ N (0, 2). The
trick is to make the following observation
y2 y 2
(y − x)2 + x2 = y 2 − 2xy + 2x2 = +2 x− ,
2 2
which leads to
Z ∞
1 (y−x)2 1 x2
g(y) = √ e− 2 √ e− 2 dx
−∞ 2π 2π
Z
1 − y2 ∞ 1 −(x− y )2
= √ e 4 √ e 2 dx
4π −∞ π
| {z }
=1
1 2
− y4
= √ e .
4π
We used that the integral above for fixed y is the integral of the density for the
N y2 , 12 distribution, which is thus 1. Then observe that g is the density for the
N (0, 2) distribution as claimed. For general µ1 , µ2 , σ1 and σ2 we can make the same
computation with a lot of bookkeeping, but this is not really the interesting message
here. The interesting message is that the sum of two normals is normal. ⋄
84 Probability Theory
Continuing adding normally distributed random variables we just end up with nor-
mally distributed random variables with their means and variances added up ac-
cordingly. If X1 , . . . , Xn are iid N (µ, σ) then we find that
S = X1 + . . . + Xn ∼ N (nµ, nσ 2 ).
Thus from Example 2.9.9 the distribution of the sample mean
1 1
Y = S = (X1 + . . . + Xn )
n n
2
is N (µ, σn ). This property of the normal distribution is quite exceptional. It is so
exceptional that the sample mean of virtually any collection of n iid variables strives
towards being normally distributed. The result explaining this fact is one of the
germs, if not the biggest then at least among the really great ones, in probability
theory. This is the Central Limit Theorem or CLT for short.
Result 2.10.21. If X1 , . . . , Xn are iid with mean µ and variance σ 2 then
1 approx σ2
(X1 + . . . + Xn ) ∼ N µ, . (2.27)
n n
Result 2.10.22. If X1 , . . . , Xn are n iid real value random variables with distribu-
tion function F then the distribution function, G, for
Y1 = max(X1 , X2 , . . . , Xn )
is given as
G1 (x) = F (x)n
Y2 = min(X1 , X2 , . . . , Xn )
is given as
G2 (x) = 1 − (1 − F (x))n
From Result 2.10.22 we find that the distribution function for Y = max{X1 , . . . , Xn }
is
From this we see that the distribution of Y is also a Gumbel distribution with
location parameter log n. ⋄
F (x) = 1 − exp(−λx).
From Result 2.10.22 we find that the distribution function for Y = min{X1 , . . . , Xn }
is
2.11 Simulations
random numbers, which can lead to wrong results. If you use the standard generator
provided by R you are not likely to run into problems, but it is not guaranteed
that all programming languages provide a useful standard pseudo random number
generator. If in doubt you should seek qualified assistance. Secondly, the pseudo
random number generator is always initialized with a seed – a number that tells the
generator how to start. Providing the same seed will lead to the same sequence of
numbers. If we need to run independent simulations we should be cautious not to
restart the pseudo random number generator with the same seed. It is, however, an
advantage when debugging programs that you can always provide the same sequence
of random numbers. Most pseudo random number generators have some default way
of setting the seed if no seed is provided by the user.
For the rest of this section we will assume that we have access to a sequence
U1 , U2 , . . . , Un of iid random variables uniformly distributed on [0, 1], which in prac-
tice are generated using a pseudo random number generator. This means that
such that
xi = x if ui ∈ I(x)
This shows that the algorithm indeed simulates random variables with the desired
distribution.
In practice we need to choose the intervals suitably. If the random variable, X, that
we want to simulate takes values in N a possible choice of I(x) is
as required, and the intervals are clearly disjoint. To easily compute these intervals
we need easy access to the distribution function
For the simulation of real valued random variables there is also a generic solution
that relies on the knowledge of the distribution function.
Definition 2.11.2. Let F : R → [0, 1] be a distribution function. A function
F ← : (0, 1) → R
Simulations 89
that satisfies
F (x) ≥ y ⇔ x ≥ F ← (y) (2.28)
for all x ∈ R and y ∈ (0, 1) is called a generalized inverse of F .
There are a few important comments related to the definition above. First of all
suppose that we can solve the equation
F (x) = y
for all x ∈ R and y ∈ (0, 1), yielding an inverse function F −1 : (0, 1) → R of F that
satisfies
F (x) = y ⇔ x = F −1 (y). (2.29)
Then the inverse function is also a generalized inverse function of F . However, not
all distribution functions have an inverse, but all distribution functions have a gen-
eralized inverse and this generalized inverse is in fact unique. We will not show this
although it is not particularly difficult. What matters in practice is whether we can
find the (generalized) inverse of the distribution function. Note also, that we do not
really want to define the value of F ← in 0 or 1. Often the only possible definition is
F ← (0) = −∞ and F ← (1) = +∞.
At this point the generalized inverse is useful because it is, in fact, a quantile function.
To see this, first observe that with x = F ← (y) then
by the definition of F ← . On the other hand, suppose that there exists a y ∈ (0, 1)
and an ε > 0 such that F (F ← (y) − ε) ≥ y then again by the definition of F ← it
follows that
F ← (y) − ε ≥ F ← (y),
which cannot be the case. Hence there is no such y ∈ (0, 1) and ε > 0 and
F (F ← (y) − ε) < y
for all y ∈ (0, 1) and ε > 0. This shows that F ← is a quantile function.
Result 2.11.3. The generalized inverse distribution function F ← is a quantile func-
tion.
There may exist other quantile functions besides the generalized inverse of the distri-
bution function, which are preferred from time to time. If F has an inverse function
then the inverse is the only quantile function, and it is equal to the generalized
inverse.
We will need the generalized inverse to transform the uniform distribution into
any distribution we would like. The uniform distribution on [0, 1] has distribution
function
G(x) = x
90 Probability Theory
F ← (U ) ≤ x ⇔ U ≤ F (x)
Note that it doesn’t matter if we have defined F ← on (0, 1) only as the uniform
random variable U with probability 1 takes values in (0, 1). We have derived the
following result.
Result 2.11.4. If F ← : (0, 1) → R is the generalized inverse of the distribution
function F : R → [0, 1] and if U is uniformly distributed on [0, 1] then the distribution
of
X = F ← (U )
has distribution function F .
The result above holds for all quantile functions, but it is easier and more explicit
just to work with the generalized inverse, as we have done.
Example 2.11.5. The exponential distribution with parameter λ > 0 has distribu-
tion function
F (x) = 1 − exp(−λx), x ≥ 0.
The equation
F (x) = 1 − exp(−λx) = y
is solved for y ∈ (0, 1) by
1
F −1 (y) = − log(1 − y).
λ
Thus the simulation of exponentially distributed random variables can be based on
the transformation
1
h(y) = − log(1 − y).
λ
⋄
Local alignment - a case study 91
Exercises
Ï Exercise 2.11.1. Write three R functions: pgumbel, dgumbel, and qgumbel taking
three arguments x,location,scale and returning the value (in x) of the distribution
function, the density, and the inverse distribution function (the quantile function)
respectively for the Gumbel distribution with location and scale parameters given by
location and scale.
Ï Exercise 2.11.2. Write a fourth function, rgumbel, that takes one integer argument,
n, and the location,scale arguments, and returns a vector of length n, which is a
simulated realization of n independent and identically Gumbel distributed random
variables.
⋆ Exercise 2.11.3. Let X have the geometric distribution with success probability p.
Show that the distribution function for X is
F (x) = 1 − (1 − p)⌊x⌋+1 ,
where ⌊x⌋ ∈ Z is the integer fulfilling that x − 1 < ⌊x⌋ ≤ x for x ∈ R. Define likewise
⌈x⌉ ∈ Z as the integer fulfilling that x ≤ ⌈x⌉ < x + 1 for x ∈ R.
Argue that
⌊z⌋ ≥ x ⇔ z ≥ ⌈x⌉
for all x, z ∈ R use this to show that
log(1 − y)
F ← (y) = −1
log(1 − p)
is the generalized inverse for the distribution function for the geometric distribution.
Ï Exercise 2.11.4. Write an R-function, my.rgeom, that takes two arguments, (n,p),
and returns a simulation of n iid geometrically distributed variables with success
parameter p. Note that the operations ⌊·⌋ and ⌈·⌉ are known as floor and ceiling.
As a case study of some of the concepts that have been developed up to this point
in the notes, we present in this section some results about the distribution of the
score of optimally locally aligned random amino acid sequences.
This is a classical problem and a core problem in biological sequence analysis whose
solution has to be found in the realms of probability theory and statistics. Moreover,
even though the problem may seem quite specific to biological sequence analysis it
92 Probability Theory
actually holds many of the general issues related to the extraction of data from large
databases and how the extraction procedure may need a careful analysis for a correct
interpretation of the results.
The local alignment problem is essentially not a biological problem but a computer
science problem of finding substrings of two strings that match well. Given two
strings of letters can we compute two substrings of letters – one from each – that
maximize the length of matching letters? In the words ABBA and BACHELOR we can
find the substring BA in both but no substrings of length 3. The stringent requirement
of exact matching is often to hard. If we try to match BACHELOR and COUNCILLOR
we could get an exact match of length 3 of LOR but allowing for two mismatches
we can match CHELOR and CILLOR of length 6. It may also be an idea to allow for
gaps in the matching. If we try to match COUNCILLOR with COUNSELOR we can match
COUN--CILLOR with COUNSE---LOR where we have introduced two gaps in the former
string and three in the latter.
When we introduce mismatches and gaps we will penalize their occurrences and
formally we end up with a combinatorial optimization problem where we optimize
a score over all selections of substrings. The score is a sum of (positive) match
contributions and (negative) mismatch and gap constributions. The computational
solution known as the Smith-Waterman algorithm is a so-called dynamic program-
ming algorithm, which can solve the optimization problem quite efficiently.
In the context of biological sequences, e.g. proteins regarded as a string of letters
from the amino acid alphabet, there are many implementations of the algorithm
and it is in fact not awfully complicated to write a simple implementation yourself.
The actual algorithm is, however, outside of the scope of these notes. In R there is
an implementation of several alignment algorithms, including the Smith-Waterman
algorithm for local alignment, in the pairwiseAlignment function from the package
Biostrings. The widely used BLAST program and friends offer a heuristic that
improves the search time tremendously and makes it feasible to locally align a protein
to the entire database of known protein sequences.
The purpose of the computation of the local alignment is to find parts of the protein
that share evolutionary or functional relations to other proteins in the database. The
algorithm will always produce a best optimal local alignment disregarding whether
the best alignment has any meaning or is just a coincidence due to the fact that a
database contains a large number of different sequences. Thus we need to be able to
tell if a local alignment with a given score is something we would expect to see by
chance or not.
Assume that X1 , . . . , Xn and Y1 , . . . , Ym are in total n + m random variables with
values in the 20 letter amino acid alphabet
E = A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V .
We regard the first n letters X1 , . . . , Xn as one sequence (one protein or a protein
sequence database) and Y1 , . . . , Ym as another sequence (another protein or protein
Local alignment - a case study 93
Table 2.1: A selection of values for the λ and K parameters for different choices of
scoring schemes in the local alignment algorithm. The entries marked by − are scor-
ing schemes where the approximation (2.30) breaks down. The values presented here
are from Altschul, S.F. and Gish, W. Local Alignment Statistics. Methods in Enzy-
mology, vol 266, 460-480. They are computed (estimated) on the basis of alignments
of simulated random amino acid sequences using the Robinson-Robinson frequencies.
On Figure 2.20 we see the density for the location-scale transformed Gumbel distri-
butions, taking n = 1, 000 and m = 100, 000, that approximate the distribution of
the maximal local alignment score using either BLOSUM62 or BLOSUM50 together
with the affine gap penalty function. We observe that for the BLOSUM50 matrix
the distribution is substantially more spread out.
If n = n1 + n2 , the approximation given by (2.30) implies that
0.10
0.10
0.05
0.05
0.00
0.00
40 45 50 55 60 65 70 70 80 90 100 110 120 130
Figure 2.20: The density for the Gumbel distribution that approximates the maximal
local alignment score with n = 1, 000 and m = 100, 000 using an affine gap penalty
function with gap open penalty 12 and gap extension penalty 2 together with a
BLOSUM62 (left) or BLOSUM50 (right) scoring matrix.
This (approximate) equality says that the distribution of Sn,m behaves (approxi-
mately) as if Sn,m = max{Sn1 ,m , Sn2 ,m } and that Sn1 ,m and Sn2 ,m are independent.
The justification for the approximation (2.30) is quite elaborate. There are theoret-
ical (mathematical) results justifying the approximation for ungapped local align-
ment. When it comes to gapped alignment, which is of greater practical interest,
there are some theoretical results suggesting that (2.30) is not entirely wrong, but
there is no really satisfactory theoretical underpinning. On the other hand, a number
of detailed simulation studies confirm that the approximation works well – also for
gapped local alignment. For gapped, as well as ungapped, local alignment there exist
additional corrections to (2.30) known as finite size corrections, which reflect the fact
that (2.30) works best if n and m are large – and to some extent of the same order.
Effectively, the corrections considered replace the product nm by a smaller number
n′ m′ where n′ < n and m′ < m. We will not give details here. Another issue is how
the parameters λ and K depend upon the scoring mechanism and the distribution
of the letters. This is not straight forward – even in the case of ungapped local
alignment, where we have analytic formulas for computing λ and K. In the gapped
case, the typical approach is to estimate values of λ and K from simulations.
Exercises
Alice and Bob work in Professor A.K.’s laboratory on a putative serine protease:
96 Probability Theory
Bob does a local alignment of the two proteins using the Smith-Waterman algo-
rithm with the BLOSUM50 scoring matrix and affine gap penalty function with gap
open penalty 10 and gap extension penalty 3. He uses the an implementation called
SSearch and finds the (edited) alignment:
Note that the output shows in the upper right corner that the length of one of the
sequences in 480. The other one has length 360.
Alice is not so keen on Bobs theory and she does a database search against the set of
Human proteins using BLAST with default parameter settings (Blosum62, gap open
11, gap extension 1, Human protein database contains at the time aprox. 13.000.000
amino acids). She finds the best alignment of the serine protease with the Human
serine peptidase far down the list
Score = 61
Identities = 26/95 (27%), Positives = 40/95 (42%), Gaps = 15/95 (15%)
Multivariate distributions 97
Alice believes that this provides her with evidence that Bobs theory is wrong. Who
is A. K. going to believe?
Exercise 2.12.1. Compute the probability – using a suitable model – for each of
the local alignments above of getting a local alignment with a score as large or larger
than the given scores. Pinpoint the information you need to make the computations.
Discuss the assumptions upon which your computations rest. What is the conclusion?
E1 × . . . × En = {(x1 , . . . , xn )|x1 ∈ E1 , . . . , xn ∈ En }.
The product space is the set of n-tuples with the i’th coordinate belonging to Ei
for i = 1, . . . , n. Each of the variables can represent the outcome of an experiment,
and we want to consider all the experiments simultaneously. To do so, we need to
define the distribution of the bundled variable X that takes values in the product
space. Thus we need to define a probability measure on the product space. In this
case we talk about the joint distribution of the random variables X1 , . . . , Xn , and
we often call the distribution on the product space a multivariate distribution or a
multivariate probability measure. We do not get the joint distribution automatically
from the distribution of each of the random variables – something more is needed. We
need to capture how the variables interact, and for this we need the joint distribution.
If the joint distribution is P and A is an event having the product form
A = A1 × . . . × An = {(x1 , . . . , xn ) | x1 ∈ A1 , . . . , xn ∈ An }
The right hand side is particularly convenient, for if some set Ai equals the entire
sample space Ei , it is simply left out. Another convenient, though slightly technical
point, is that knowledge of the probability measure on product sets specifies the
measure completely. This is conceptually similar to the fact that the distribution
function uniquely specifies a probability measure on R.
98 Probability Theory
Example 2.13.1. Let X1 , X2 , and X3 be three real valued random variables, that
is, n = 3 and E1 = E2 = E3 = R. Then the bundled variable X = (X1 , X2 , X3 ) is a
three dimensional vector taking values in R3 . If A1 = A2 = [0, ∞) and A3 = R then
for A ⊆ Ei .
If the sample spaces that enter into a bundling are discrete, so is the product space –
the sample space of the bundled variable. The distribution can therefore in principle
be defined by point probabilities.
Example 2.13.3. If two DNA-sequences that encode a protein (two genes) are
evolutionary related, then typically there is a pairing of each nucleotide from one
sequence with an identical nucleotide from the other with a few exceptions due
to mutational events (an alignment). We imagine in this example that the only
mutational event occurring is substitution of nucleic acids. That is, one nucleic acid
at the given position can mutate into another nucleic acid. The two sequences can
therefore be aligned in a letter by letter fashion without gaps, and we are going
to consider just a single aligned position in the two DNA-sequences. We want a
probabilistic model of the pair of letters occurring at that particular position. The
sample space is going to be the product space
and we let X and Y denote the random variables representing the two aligned
nucleic acids. To define the joint distribution of X and Y , we have to define point
probabilities p(x, y) for (x, y) ∈ E. It is convenient to organize the point probabilities
in a matrix (or array) instead of as a vector. Consider for instance the following
matrix
Multivariate distributions 99
A C G T
A 0.1272 0.0063 0.0464 0.0051
C 0.0196 0.2008 0.0082 0.0726
G 0.0556 0.0145 0.2151 0.0071
T 0.0146 0.0685 0.0069 0.1315
As we can see the probabilities occurring in the diagonal are (relatively) large and
those outside the diagonal are small. If we let A = {(x, y) ∈ E | x = y} denote the
event that the two nucleic acids are identical, then
X
P(X = Y ) = P (A) = p(x, y) = 0.1272 + 0.2008 + 0.2151 + 0.1315 = 0.6746.
(x,y)∈A
This means that the probability of obtaining a pair of nucleic acids with a mutation
is P(X 6= Y ) = P (Ac ) = 1 − P (A) = 0.3254. ⋄
Note that from the fact that P(X ∈ Rn ) = 1, a density f must fulfill that
Z ∞ Z ∞
··· f (x1 , . . . , xn )dxn . . . dx1 = 1.
−∞ −∞
Remember also the convention about our formulations: Every time we talk about
random variables, we are in fact only interested in their distribution, which is a prob-
ability measure on the sample space. The content of the definition above is therefore
that the distribution – the probability measure – is given simply by specifying a
density. One of the deeper results in probability theory states that if a function f ,
defined on Rn and with values in [0, ∞), integrates to 1 as above, then it defines a
probability measure on Rn .
If you feel uncomfortable with the integration, remember that if Ai = [ai , bi ] with
ai , bi ∈ R for i = 1, . . . , n then
Z b1 Z bn
P(a1 ≤ X1 ≤ b1 , . . . , an ≤ Xn ≤ bn ) = ··· f (x1 , . . . , xn )dxn . . . dx1
a1 an
100 Probability Theory
Consequently we can see that the marginal distribution of Xi also has a density that
is given by the function
Z ∞ Z ∞
xi 7→ ··· f (x1 , . . . , xn )dxn . . . dxi+1 dxi−1 . . . dx1 .
−∞ −∞
| {z }
n−1
In a similar manner, one can compute the density for the distribution of any subset of
coordinates of an n-dimensional random variable, whose distribution has a density,
by integrating out over the other coordinates.
Example 2.13.6 (Bivariate normal distribution). Consider the function
p 2
1 − ρ2 x − 2ρxy + y 2
f (x, y) = exp −
2π 2
for ρ ∈ (−1, 1). We will first show that this is a density for a probability measure
on R2 , thus that it integrates to 1, and then find the marginal distributions. The
numerator in the exponent in the exponential function can be rewritten as
x2 − 2ρxy + y 2 = (x − ρy)2 + (1 − ρ2 )y 2 ,
hence
Z ∞Z ∞ 2
x − 2ρxy + y 2
exp − dydx
−∞ −∞ 2
Z ∞Z ∞
(x − ρy)2 (1 − ρ2 )y 2
= exp − − dydx
−∞ −∞ 2 2
Z ∞ Z ∞
(x − ρy)2 (1 − ρ2 )y 2
= exp − dx exp − dy.
−∞ −∞ 2 2
Multivariate distributions 101
0.15 0.15
0.10 0.10
density
density
4 4
0.05 2 0.05 2
0 0
y
0.00 0.00
−4 −4
−2 −2 −2 −2
0 0
x 2 x 2
−4 −4
4 4
Figure 2.21: Two examples of the density for the bivariate normal distribution as
considered in Example 2.13.6 with ρ = 0 (left) and ρ = 0.75 (right).
The inner integral can be computed for fixed y using substitution and knowledge
about the one-dimensional normal distribution. With z = x − ρy we have dz = dx,
hence Z ∞ Z ∞ 2
(x − ρy)2 z √
exp − dx = exp − dz = 2π,
−∞ 2 −∞ 2
where the last equality follows from the fact that the density for the normal distri-
bution on R is 2
1 x
√ exp − ,
2π 2
which integrates to 1.
√ We see that the inner integral does not depend
p upon y and is
2
p another substitution with z = y 1 − ρ (using that
constantly equal to 2π, thus
2
ρ ∈ (−1, 1)) such that dz = 1 − ρ dy gives
Z Z Z
∞ ∞
x2 − 2ρxy + y 2 √ ∞
(1 − ρ2 )y 2
exp − dydx = 2π exp − dy
−∞ −∞ 2 −∞ 2
√ Z ∞ 2
2π z
= p exp − dz
2
1 − ρ −∞ 2
2π
= p
1 − ρ2
√
where we once more use that the last integral equals 2π. This shows that the
(positive) function f integrates to 1, and it is therefore a density for a probability
measure on R2 .
102 Probability Theory
Suppose that the distribution of (X, Y ) has density f on R2 , then we have almost
computed the marginal distributions of X and Y by the integrations above. The
marginal distribution of Y has by Result 2.13.5 density
p Z 2
1 − ρ2 ∞ x − 2ρxy + y 2
f2 (y) = exp − dx
2π −∞ 2
1 y2
= p exp − .
2π(1 − ρ2 )−1 2(1 − ρ2 )−1
√
Here we used, as argued above, that the integration over x gives 2π, and then we
have rearranged the expression a little. From (2.20) in Example 2.9.9 we recognize
this density as the density for the normal distribution with scale parameter σ 2 =
(1 − ρ2 )−1 and location parameter µ = 0. Note that for ρ = 0 we have σ 2 = 1
and for ρ → ±1 we have σ 2 → ∞. The density is entirely symmetric in the x- and
y-variables, so the marginal distribution of X is also N (0, (1 − ρ2 )−1 ).
This probability measure is called a bivariate normal or Gaussian distribution. The
example given above contains only a single parameter, ρ, whereas the general bivari-
ate normal distribution is given in terms of five parameters. The general bivariate
density is given as
p
λ1 λ2 − ρ 2 1 2 2
f (x, y) = exp − λ1 (x − ξ1 ) − 2ρ(x − ξ1 )(y − ξ2 ) + λ2 (y − ξ2 )
2π 2
(2.32)
where ξ1 , ξ2 ∈ R, λ1 , λ2 ∈ (0, ∞) and ρ ∈ (−λ1 λ2 , λ1 λ2 ). For this density it is
possible to go through similar computations as above, showing that it integrates to
1. If the distribution of (X, Y ) is given
by f it islikewise possible
to compute
the
λ2 λ1
marginal distributions, where X ∼ N ξ1 , λ1 λ2 −ρ2 and Y ∼ N ξ2 , λ1 λ2 −ρ2 . ⋄
Math Box 2.13.1 (Multivariate normal distributions). We can define the family
of n-dimensional regular normal or Gaussian distributions via their densities on
Rn for n ≥ 1. The measures are characterized in terms of a vector ξ ∈ Rn and a
positive definite symmetric n × n matrix Λ. Positive definite means that for any
vector x ∈ Rn we have xt Λx > 0. Consequently one can show that the positive
function x 7→ exp(−( 12 x − ξ)t Λ(x − ξ)) have a finite integral over Rn and that
Z r
1 (2π)n
exp(− (x − ξ)t Λ(x − ξ))dx = .
Rn 2 detΛ
Here detΛ is the determinant of the matrix Λ. The density for the n-dimensional
regular normal distribution with parameters ξ and Λ is then
s
detΛ 1 t
f (x) = exp − (x − ξ) Λ(x − ξ) .
(2π)n 2
One can then show that the i’th marginal distribution is the normal distributions
with location parameter ξi and scale parameter Σii .
It is possible to define a multivariate normal distribution for positive semi-definite
matrices Λ that only fulfill that xt Λx ≥ 0, but it requires a little work. If we don’t
have strict inequality, there is no density, and the multivariate normal distribution
is called singular.
which is seen to be equal to f (x, y) from Example 2.13.6 if and only if ρ = 0. Thus if
the distribution of (X, Y ) is given by the density f in Example 2.13.6, then X and
Y are independent if and only if the parameter ρ equals 0. ⋄
Γ(λ1 + λ2 + λ) λ1 −1 λ2 −1
f (x, y) = x y (1 − x − y)λ−1 .
Γ(λ1 )Γ(λ2 )Γ(λ)
104 Probability Theory
12 12
10 10
8 8
density
density
6 6
4 4
2 2
0 0
Figure 2.22: Two examples of the density for the bivariate Dirichlet distribution as
considered in Example 2.13.8 with λ1 = λ2 = λ = 4 (left) and λ1 = λ2 = 2 together
with λ = 6 (right).
with density
Γ(λ1 + . . . + λn + λ) λ1 −1 λ2 −1
f (x1 , . . . , xn ) = x x2 · · · xnλn −1 (1 − x1 − . . . − xn )λ−1
Γ(λ1 ) · · · Γ(λn )Γ(λ) 1
We see that the integral of the positive function f is 1, and f is the density for the
bivariate Dirichlet distribution on R2 . Since f is 0 outside the 2-dimensional unit
simplex U2 , the distribution is really a distribution living on the unit simplex. ⋄
Exercises
Exercise 2.13.1. Consider the pair of random variables (X, Y ) as in Example 2.13.3
in the notes. Thus (X, Y ) denotes the pair of aligned nucleotides in an alignment
of two DNA-sequences and the joint distribution of this pair is given by the point
probabilities
A C G T
A 0.12 0.03 0.04 0.01
C 0.02 0.27 0.02 0.06
G 0.02 0.01 0.17 0.02
T 0.05 0.03 0.01 0.12
If X and Y are real valued random variables with joint distribution P on R2 having
density f , then the probability of X being equal to x is 0 for all x ∈ R. Thus we
cannot use Definition 2.10.8 to find the conditional distribution of Y given X = x.
It is, however, possible to define a conditional distribution in a sensible way as long
as f1 (x) > 0 where f1 denotes the density for the marginal distribution of X. This
is even possible if we consider not just two real valued random variables, but in fact
if we consider two random variables X and Y with values in Rn and Rm such that
the joint distribution P on Rn+m has density f .
Definition 2.13.9 (Conditional densities). If f is the density for the joint distri-
bution of two random variables X and Y taking values in Rn and Rm , respectively,
then with
Z
f1 (x) = f (x, y)dy
Rm
which reads that the density for the joint distribution is the product of the densities
for the marginal distribution of X and the conditional distribution of Y given X.
Note the analogy to (2.21) and especially (2.23) for point probabilities. It is necessary
to check that the definition really makes sense, that is, that f (y|x) actually defines
a density for all x ∈ R with f1 (x) > 0. It is positive by definition, and we also see
that
Z R
m f (x, y)dy
f (y|x)dy = R =1
Rm f1 (x)
by (2.31).
If we let Px denote the probability measure with density y 7→ f (y|x) for given x ∈ Rn
with f1 (x) > 0, i.e. the conditional distribution of Y given X = x, then for B ⊆ Rm ,
Z
P(Y ∈ B|X = x) = Px (B) = f (y|x)dy.
B
Multivariate distributions 107
If moreover A ⊆ Rn
Z Z
P(X ∈ A, Y ∈ B) = P (A × B) = f (x, y)dydx
ZA ZB
= f (y|x)f1 (x)dydx
A B
Z Z
= f (y|x)dy f1 (x)dx
ZA B
= Px (B)f1 (x)dx.
A
Example 2.13.10 (Linear regression). Let X be a real valued random variable with
distribution N (0, σ12 ). Define
Y = α + βX + ε (2.34)
Exercises
Exercise 2.13.2. In this exercise we consider a dataset of amino acid pairs (x, y).
We think of the data as representing the outcome of a random variable (X, Y ). Here
X and Y represent an amino acid at a given position in two evolutionary related
proteins (same protein, but from two different species, say). The dataset may be
obtained from a (multiple) alignment of (fractions) of proteins. The only mutational
event is substitution of one amino acid for another. In this exercise you are allowed
to think of the different positions as independent, but X and Y are dependent.
• Load the aa dataset into R with data(aa) and use table to cross-tabulate
the data according to the values of x and y. Compute the matrix of relative
frequencies for the occurrence of all pairs (x, y) for (x, y) ∈ E0 × E0 where E0
denotes the amino acid alphabet.
Exercise 2.13.3. Continuing with the setup from the previous exercise we denote
by P the probability measure with point probabilities, p(x, y), being the relative
frequencies computed above. It is a probability measure on E0 × E0
• Compute the point probabilities, p1 (x) and p2 (y), for the marginal distributions
of P and show that X and Y are not independent.
• Compute the score matrix defined as
p(x, y)
Sx,y = log .
p1 (x)p2 (y)
⋆⋆
Exercise 2.13.5. Let X and ε be independent real valued random variables with
distribution N (0, σ12 ) and N (0, σ22 ) respectively. Show that the density for their joint
distribution is
1 x2 ε2
f (x, ε) = exp − 2 − 2 .
2πσ1 σ2 2σ1 2σ2
Define the transformation
h : R2 → R
by h(x, ε) = (x, α + βy + ε) and find the density for the distribution of h(X, ε).
Compare with Example 2.13.10.
Hint: First find h−1 (I × J) for I, J ⊆ R. You may think of these sets as intervals.
Then use the definitions of transformed distributions and densities to compute
Multivariate datasets from the sample space Rd with d ≥ 2 are more difficult to
visualize than a one-dimensional dataset. It is the same problem as we have with
tabulations. Two-dimensional tables are basically the limit for our comprehension.
Bivariate datasets (d = 2) can also be visualized for instance via scatter plots and
bivariate kernel density estimation, but for multivariate datasets we usually have to
rely on visualizing univariate or bivariate transformations.
If xi = (xi1 , xi2 ) ∈ R2 is bivariate a simple plot of xi2 against xi1 for i = 1, . . . , n
is called a scatter plot. It corresponds to a one-dimensional rug plot, but the two
dimensions actually gives us a better visualization of the data in this case. The scatter
plot may highlight some simple dependence structures in the data. For instance, if
the conditional distribution of Xi2 given Xi1 = xi1 is N (α + βxi1 , σ22 ) as in Example
2.13.10 with α, β ∈ R, then this will show up on the scatter plot (if β 6= 0) as a
tendency for the points to lie around a line with slope β. How obvious this is depends
on how large σ22 is compared to β and how spread out the distribution of Xi1 is.
Example 2.14.1. We simulate n = 100 bivariate iid random variables (Xi1 , Xi2 )
for i = 1, . . . , 100 where the distribution of Xi1 ∼ N (0, 1) and the conditional dis-
tribution of Xi2 given Xi1 = xi1 is N (βxi1 , 1) with β = 0, 0.5, 1, 2. The resulting
scatter plots are shown in Figure 2.23. The dependence between Xi1 and Xi2 is de-
termined by β, and as shown in Exercise 2.13.4, the variables are independent if and
only if β = 0. From the figure we observe how the dependence through β influences
the scatter plot. As β becomes larger, the points clump around a line with slope β.
For small β, in this case β = 0.5 is quite small, the scatter plot is not so easy to
distinguish from the scatter plot with β = 0 as compared to the larger values of β.
Thus weak dependence is not so easy to spot by eye. ⋄
Example 2.14.2 (Iris data). A classical dataset that is found in numerous textbooks
contains characteristic measurements of the Iris flower for three different species, Iris
110 Probability Theory
β=0 β = 0.5
2
1
1
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
β=1 β=2
2
2
1
1
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
Figure 2.23: Scatter plot of 100 simulated variables as described in Example 2.14.1.
The variables have a linear dependence structure determined by the parameter β
that ranges from 0 to 2 in these plots.
Setosa, Iris Virginica and Iris Versicolor. The data were collected in 1935 by Edgar
Anderson, and is today available in R via data(iris). The date are organized in a
dataframe as follows
There are 50 measurements for each species, thus a total of 150 rows in the data
frame. In Figure 2.24 we see two-dimensional scatter plots of the different measure-
ments against each other for the species Setosa. Restricting our attention to the
two variables Petal length and Petal width, we see in Figure 2.25 scatter plots, two-
Transition probabilities 111
2.5 3.0 3.5 4.0 0.1 0.2 0.3 0.4 0.5 0.6
5.5
Sepal.Length
5.0
4.5
4.0
3.5
3.0
2.5 Sepal.Width
1.8
1.6
Petal.Length
1.4
1.2
1.0
0.6
0.5
0.4
Petal.Width
0.3
0.2
0.1
Figure 2.24: Considering the species Setosa in the Iris dataset we see a bivariate
scatter plot of all the pair-wise combinations of the four variables in the dataset.
This is obtained in R simply by calling plot for a dataframe containing the relevant
variables.
dimensional kernel density estimates and corresponding contour curves. The kernel
density estimates are produced by kde2d from the MASS package. ⋄
Recall that if we have two random variables X and Y taking values in the discrete
sample space E we have the relation
Setosa Versicolor
0.8 3.0 2.0 1.4
1.2
2.5 1.8
0.6
1.0
2.0 1.6
0.8
Petal width
Petal width
0.4 1.5 1.4
0.6
1.0 1.2
0.4
0.2
0.5 1.0
0.2
2.5
2.0 1.0
1.5
1.0 0.5
0.5
4.5 1.2
2.0 0.2
5.0 1.0
Figure 2.25: For the Iris Setosa and Iris Versicolor we have estimated a two-
dimensional density for the simultaneous distribution of petal length and width.
In both cases we used a bivariate Gaussian kernel with bandwidth 0.8 in both di-
rections.
If we think about the two step interpretation of the joint distribution the relation
above says that the joint distribution is decomposed into a first step where X is
sampled according to its marginal distribution and then Y is sampled conditionally
on X. With such a dynamic interpretation of the sampling process it is sensible to
call P(Y = y|X = x) the transition probabilities. Note the the entire dependence
structure between X and Y is captured by the conditional distribution, and X and
Y are independent if P(Y = y|X = x) = P(Y = y) for all y ∈ E.
It is possible to specify whole families of probability measures on E × E where E is a
finite sample space, e.g. E = {A, C, G, T}, which are indexed by a time parameter t ≥ 0
Transition probabilities 113
and perhaps some additional parameters that capture the dependence structure.
This is done by specifying the transition probabilities via a system of differential
equations. We introduce Pt as the distribution of the pair of random variables (X, Y )
on E × E indexed by the time parameter t ≥ 0. We are going to assume that the
marginal distribution of X, which is given by the point probabilities
X
π(x) = Pt (x, y),
y∈E
does not depend on t, and we are going to define the matrix P t for each t ≥ 0 by
Pt (x, y)
P t (x, y) = .
π(x)
That is, P t (x, y) is the conditional probability that x changes into y over the time
interval t (note that we may have x = y). There is a natural initial condition on P t
for t = 0, since for t = 0 we will assume that X = Y , thus
0 1 if x = y
P (x, y) =
0 otherwise
Thus to specify Pt for all t ≥ 0 by (2.36) we need only to specify a single matrix
of intensities. In a very short time interval, ∆, the interpretation of the differential
equation is that
Rearranging yields
X
P t+∆ (x, y) ≃ P t (x, y)(1 + ∆λ(y, y)) + P t (x, z)∆λ(z, y).
z6=y
This equation reads that the conditional probability that x changes into y in the
time interval t + ∆ is given as a sum of two terms. The first term is the probability
114 Probability Theory
P t (x, y) that x changes into y in the time interval t times the factor 1 + ∆λ(y, y).
This factor has the interpretation as the probability that y doesn’t mutate in the
short time interval ∆. The second term is a sum of terms where x changes into some
z 6= y in the time interval t and then in the short time interval ∆ it changes from z
to y with probability ∆λ(z, y). In other words, for z 6= y the entries in the intensity
matrix have the interpretation that ∆λ(z, y) is approximately the probability that
z changes into y in the short time interval ∆, and 1 + ∆λ(y, y) is approximately the
probability that y doesn’t change in the short time interval ∆.
In Exercises 2.15.1 and 2.15.2 it is verified, as a consequence of λ(x, y) ≥ 0 for x 6= y
and (2.37), that the solution to the system of differential equations (2.36) is indeed
a matrix of conditional probabilities, that is, all entries of P t are ≥ 0 and the row
sums are always equal to 1.
It is a consequence of the general theory for systems of linear differential equations
that there exists a unique solution to (2.36) with the given initial condition. In
general the solution to (2.36) can, however, not easily be given a simple analytic
representation, unless one accepts the exponential of a matrix as a simple analytic
expression; see Math Box 2.15.1. There we also show that the solution satisfies the
Chapman-Kolmogorov equations; for s, t ≥ 0 and x, y ∈ E
X
P t+s (x, y) = P t (x, z)P s (z, y). (2.38)
z∈E
For some special models we are capable of obtaining closed form expressions for
the solution. This is for instance the case for the classical examples from molecular
evolution with E = {A, C, G, T} that lead to the Jukes-Cantor model and the Kimura
model introduced earlier via their the conditional probabilities.
Example 2.15.1 (The Jukes-Cantor model). The Jukes-Cantor model is given by
assuming that
−3α if x = y
λ(x, y) =
α if x 6= y
That is, the matrix of intensities is
A C G T
A −3α α α α
C α −3α α α
G α α −3α α
T α α α −3α
The parameter α > 0 tells how many mutations that occur per time unit. The
solution to (2.36) is
then the set of differential equations defined by (2.36) can be expressed in matrix
notation as
dP t
= P t Λ.
dt
If Λ is a real number, this differential equation is solved by P t = exp(tΛ), and this
solution is also the correct solution for matrix Λ, although it requires a little more
work to understand how the exponential function works on matrices. One can take
as a definition of the exponential function the usual Taylor expansion
X∞ n n
t Λ
exp(tΛ) = .
n=0
n!
It is possible to verify that this infinite sum makes sense. Moreover, one can simply
differentiate w.r.t. the time parameter t term by term to obtain
X∞
d tn−1 Λn
exp(tΛ) = n
dt n=1
n!
∞ n−1 n−1
!
X t Λ
= Λ
n=1
(n − 1)!
∞ n n
!
X t Λ
= = exp(tΛ)Λ.
n=0
n!
Many of the usual properties of the exponential function carry over to the expo-
nential of matrices. For instance, exp(tΛ + sΛ) = exp(tΛ) exp(sΛ), which shows
that
P t+s = P t P s .
This is the Chapman-Kolmogorov equations, (2.38), in their matrix version.
Example 2.15.2 (The Kimura model). Another slightly more complicated model
defined in terms of the differential equations (2.36), that admits a relatively nice
closed form solution, is the Kimura model. Here the intensity matrix is assumed to
116 Probability Theory
be
A C G T
A −α − 2β β α β
C β −α − 2β β α
G α β −α − 2β β
T β α β −α − 2β
with α, β > 0. The interpretation is that a substitution of a purine with a purine or a
pyrimidine with a pyrimidine (a transition) is happening with another intensity than
a substitution of a purine with pyrimidine or pyrimidine with purine (a transversion).
We see that the intensity for transversions is λ(x, y) = β and the intensity for
transitions is λ(x, y) = α. The solution is
Exercises
⋆
Exercise 2.15.1. Let P t (x, y) for x, y ∈ E satisfy the system of differential equations
dP t (x, y) X
= P t (x, z)λ(z, y)
dt
z∈E
P
with λ(x, y) ≥ 0 for x 6= y and λ(x, x) = − y6=y λ(x, y). Define
X
sx (t) = P t (x, y)
y
as the “row sums” of P t for each x ∈ E. Show that dsx (t)/dt = 0 for t ≥ 0. Argue
that with the initial condition, P 0 (x, y) = 0, x 6= y and P 0 (x, x) = 1, then sx (0) = 1
and conclude that sx (t) = 1 for all t ≥ 0.
⋆⋆
Exercise 2.15.2. We consider the same system of differential equations as above
with the same initial condition. Assume that λ(x, y) > 0 for all x 6= y. Use the sign
of the resulting derivative
dP 0 (x, y)
= λ(x, y)
dt
at 0 to argue that for a sufficiently small ε > 0, P t (x, y) ≥ 0 for t ∈ [0, ε]. Use this fact
together with the Chapman-Kolmogorov equations, (2.38), to show that P t (x, y) ≥ 0
for all t ≥ 0.
3
Given one or more realizations of an experiment with sample space E, which prob-
ability measure P on E models – or describes – the experiment adequately? This is
the crux of statistics – the inference of a suitable probability measure or aspects of
a probability measure from empirical data. What we consider in the present chapter
is the problem of estimation of one or more parameters that characterize the prob-
ability measure with the main focus on using the maximum likelihood estimator.
117
118 Statistical models and inference
Histogram of neuron
0.8
0.6
Density
0.4
0.2
0.0
0 1 2 3 4 5
neuron
Figure 3.2 shows a QQ-plot of the observations against the quantiles for the expo-
nential distribution, which seems to confirm that the exponential distribution is a
suitable choice.
A more careful analysis will, however, reveal that there is a small problem. The
problem can be observed by scrutinizing the rug plot on the histogram, which shows
that there is a clear gap from 0 to the smallest observations. For the exponential
distribution one will find a gap of roughly the same order of magnitude as the
Statistical Modeling 119
5
4
3
neuron
2
1
0
0 1 2 3 4 5 6
qexp(pp)
Figure 3.2: The QQ-plot of the neuron interspike times against the quantiles for the
exponential distribution.
distances between the smallest observations – for the dataset the gap is considerably
larger.
It is possible to take this into account in the model by including an extra, unknown
parameter µ ≥ 0 such that Xi −µ is exponentially distributed with parameter λ > 0.
This means that Xi has a distribution with density
estimate should it be the case that such a bound exists. It is well documented in
the literature that after a spike there is a refractory period where the cell cannot
fire, which can explain why we see the gap between 0 and the smallest observations.
However, if we compute an estimate µ̂ > 0 based on the dataset, we obtain a resulting
model where a future observation of an interspike time smaller than µ̂ is in direct
conflict with the model. Unless we have physiological substantive knowledge that
supports an absolute lower bound we must be skeptical about an estimate. There is
in fact evidence for a lower bound, whose value is of the order of one millisecond – too
small to explain the gap seen in the data. In conclusion, it is desirable to construct a
refined model, which can explain the gap and the exponential-like behavior without
introducing an absolute lower bound. ⋄
(x1 , y1 ), . . . , (xn , yn )
We also need to specify the marginal distribution of the first x-nucleotide. We may
here simply take an arbitrary probability measure on {A, C, G, T} given by point
probabilities (p(A), p(C), p(G), p(T)). This specifies the full distribution of (X1 , Y1 ) by
Statistical Modeling 121
The point probabilities using the Kimura model from Example 3.3.18 are given the
same way just by replacing the conditional probabilities by those from Example
3.3.18, which also introduces the extra parameter β > 0.
If we dwell for a moment on the interpretation of the Jukes-Cantor model, the
conditional probability P t (x, x) is the probability that the a nucleotide does not
change over the time period considered. For the hepatitis C virus dataset considered
in Example 1.2.4, Table 1.2 shows for instance that out of the 2610 nucleotides in
segment A there are 78 that have mutated over the period of 13 years leaving 2532
unchanged. With reference to the frequency interpretation we can try to estimate α
in the Jukes-Cantor model by equation the formula for P t (x, x) equal to 2532/2610.
This gives
log(( 2532 1 4
2610 − 4 ) 3 )
α̂ = − = 7.8 × 10−4 .
4 × 13
⋄
where A denotes m possible repeat counts observable for TH01. The situation is
the same as with genes where we may have two alleles and we can observe only
three allele combinations. Here the number of different alleles is m and we can
observe m(m + 1)/2 different allele combinations. For TH01 we can take A =
{5, 6, 7, 8, 9, 9.3, 10, 11} for the purpose of this example. We note that 9.3 is also
a possibility, which corresponds to a special variant of the repeat pattern with 9 full
occurrences of the fundamental repeat pattern AATG but with the partial pattern
ATG between the 6th and 7th occurrence.
122 Statistical models and inference
y
5 6 7 8 9 9.3 10 11
5 0 0 1 0 0 0 0 0
6 19 31 7 16 48 0 0
7 10 9 11 43 0 0
x 8 3 9 17 3 0
9 6 19 1 1
9.3 47 1 0
10 0 0
11 0
Table 3.1: Tabulation of the repeat counts for the short tandem repeat TH01 for the
Caucasian population included in the NIST STR dataset.
For the NIST dataset we can tabulate the occurrences of the different repeat counts
for TH01. Table 3.1 shows the tabulation for the Caucasian population in the dataset.
The dataset also includes data for Hispanics and African-Americans.
To build a model we regard the observations above as being realizations of iid random
variables all with the same distribution as a pair (X, Y ) that takes values in E. The
full model – sometimes called the saturated model – is given as the family of all
probability measures on E. The set of point probabilities
X
Θ = {(p(x, y))(x,y)∈E | p(x, y) = 1}
x,y
w
5 6 8 9 10 11 12 13
5 0 0 1 0 0 0 0 0
6 0 0 1 0 0 0 0
8 82 45 20 78 15 0
z 9 3 2 16 2 0
10 0 10 2 0
11 19 5 0
12 0 1
13 0
Table 3.2: Tabulation of the repeat counts for the short tandem repeat TPOX for
the Caucasian population included in the NIST STR dataset.
to the TPOX counts in the NIST dataset for the Caucasian population is found
in table 3.2. The question is whether we can assume that (X, Y ) and (Z, W ) are
independent? ⋄
x1,1 , . . . , x1,37
x2,1 , . . . , x2,42
124 Statistical models and inference
X1,1 , . . . , X1,37
X2,1 , . . . , X2,42 .
However, we will not insist that all the variables have the same distribution. On the
contrary, the purpose is to figure out if there is a difference between the two groups.
We will, however, assume that within either of the two groups the variables do have
the same distribution.
The measurement of the gene expression level can be viewed as containing two com-
ponents. One component is the signal, which is the deterministic component of the
measurement, and the other component is the noise, which is a random component.
In the microarray setup we measure a light intensity as a distorted representation of
the true concentration of a given RNA-molecule in the cell. If we could get rid of all
experimental uncertainty and, on top of that, the biological variation what would
remain is the raw signal – an undistorted measurement of the expression level of a
particular RNA-molecule under the given experimental circumstances. Even if the
technical conditions were perfect giving rise to no noise at all, the biological vari-
ation would still remain. Indeed, this variation can also be regarded as part of the
insight into and understanding of the function of the biological cell. The bottom line
is that we have to deal with the signal as well as the noise. We are going to consider
two modeling paradigms for this signal and noise point of view; the additive and the
multiplicative noise model.
Let X denote a random variable representing the light intensity measurement of our
favorite gene. The additive noise model says that
X = µ + σε (3.1)
where µ ∈ R, σ > 0 and ε is a real valued random variable whose distribution has
mean value 0 and variance 1. This is simply a scale-location model and it follows from
Example 2.9.9 that the mean value of the distribution of X is µ and the variance
is σ 2 . Thus µ is the mean expression of the gene – the signal – and σεi captures
the noise that makes X fluctuate around µ. It is called the additive noise model
for the simple fact that the noise is added to the expected value µ of X. It is quite
common to assume in the additive noise model that the distribution of ε is N (0, 1).
The model of X is thus N (µ, σ 2 ), where µ and σ are the parameters. The parameters
(µ, σ) can take any value in the parameter space R × (0, ∞).
For the multiplicative noise model we assume instead that
X = µε (3.2)
with ε a positive random variable. We will in this case insist upon µ > 0 also. If
the distribution of ε has mean value 1 and variance σ 2 it follows from Example
2.9.9 that the distribution of X has mean value µ and variance µ2 σ 2 . If we specify
Statistical Modeling 125
the distribution of ε this is again a two-parameter model just as the additive noise
model. The most important difference from the additive noise is that the standard
deviation of X is µσ, which scales with the mean value µ. In words, the noise gets
larger when the signal gets larger.
The formulation of the multiplicative noise model in terms of mean and variance
is not so convenient. It is, however, easy to transform from the multiplicative noise
model to the additive noise model by taking logarithms. Since X = µε then
which is an additive noise model for log X. If one finds a multiplicative noise model
most suitable it is quite common to transform the problem into an additive noise
model by taking logarithms. Formally, there is a minor coherence problem because
if ε has mean 1 then log ε does not have mean value 0 – it can be shown that it will
always have mean value smaller than 0. If we on the other hand assume that log ε
has mean value 0 it holds that ε will have mean value greater than 1. However, this
does not interfere with the fact that if the multiplicative noise model is a suitable
model for X then the additive noise model is suitable for log X and vice versa – one
should just remember that if the mean of X is µ then the mean of log X is not equal
to log µ but it is in fact smaller than log µ.
Returning to our concrete dataset we choose to take logarithms (base 2), we assume
an additive noise model for the logarithms with error distribution having mean 0 and
variance 1, and we compute the ad hoc estimates of µ (the mean of log X) and σ 2
(the variance of log X) within the groups as the empirical means and the empirical
variances.
group 1 group 2
1 P37 1 P42
µ̂1 = 37 i=1 log x1,i = 8.54 µ̂2 = 42 i=1 log x2,i = 7.33
1 P37 1 P42
σ̃1 = 37 i=1 (log x1,i − µ̂1 )2 = 0.659
2 σ̃2 = 42 i=1 (log x2,i − µ̂2 )2 = 0.404
2
However, the size of this number does in reality not tell us anything. We have to
interpret the number relatively to the size of the noise as estimated by σ̃12 and σ̃22 .
The formalities are pursued later.
One remark is appropriate. If we don’t take logarithms the empirical means
37 42
1 X 1 X
µ̃1 = x1,i = 426.6, µ̃2 = x2,i = 179.1
37 42
i=1 i=1
126 Statistical models and inference
are reasonable estimates of the means (of X) within the two groups. For the multi-
plicative mode it is best to compute the ratio
µ̃1
= 2.38
µ̃2
for comparing the mean values instead of the difference. Again any interpretation of
this number needs to be relative to the size of the noise. The downside of using ad
hoc estimates and ad hoc approaches shows it face here, because µ̂1 6= log µ̃1 we have
two procedures. Either we take logarithms, compute the means and compute their
difference or we compute the means, take logarithms and compute their difference
and we end up with two different results. Which is the most appropriate? The model
based, likelihood methods introduced later will resolve the problem. ⋄
All four examples above specify a family of probability distributions on the relevant
sample space by a set of parameters. For the exponential distribution used as a
model of the interspike times the parameter is the intensity parameter λ > 0. For
the additive model (or multiplicative in the log-transformed disguise) the parameters
for the gene expression level are µ ∈ R and σ > 0. For the model in the forensic
repeat counting example the parameters are the point probabilities on the finite
sample space E, or if we assume Hardy-Weinberg equilibrium the point probabilities
on the set A. For the Jukes-Cantor model the parameters are the marginal point
probabilities for the first nucleotide and then α > 0 and time t > 0 that determines
the (mutation) dependencies between the two nucleotides.
Three of the examples discussed above are all examples of phenomenological mod-
els. By this we mean models that attempt to describe the observed phenomena (the
empirical data) and perhaps relate observables. For the neuronal interspike times
we describe the distribution of interspike times. For the gene expression data we
relate the expression level to presence or absence of the BCR/ABL fusion gene, and
for the evolution of molecular sequences we relate the mutations to the time be-
tween observations. Good phenomenological modeling involves a interplay between
the modeling step, the data analysis step and the subject matter field. We do not,
however, attempt to explain or derive the models completely from fundamental the-
ory. It seems that once we step away from the most fundamental models of physics
there is either no derivable, complete theory relating observables of interest or it
is mathematically impossible to derive the exact model. The boundaries between
theory, approximations and phenomenology are, however, blurred.
Most statistical models are phenomenological of nature – after all, the whole purpose
of using a probabilistic model is to capture randomness, or uncontrollable variation,
which by nature is difficult to derive from theory. Often the need for a probabilistic
model is due to the fact that our knowledge about the quantities we model is limited.
A few classical probabilistic models are, however, derived from fundamental sampling
principles. We derive some of these models in the next section.
Classical sampling distributions 127
The classical probability models, the binomial, multinomial, geometric and hyper-
geometric distributions, that we consider in this section are all distributions on the
positive integers that arise by sampling and counting, and which can be understood
naturally in terms of transformations of an underlying probability measure.
Example 3.2.1 (Binomial distribution). Let X1 , . . . , Xn denote n iid Bernoulli vari-
ables with success parameter p ∈ [0, 1]. The fundamental sample space is E = {0, 1},
the bundled variable X = (X1 , . . . , Xn ) takes values in {0, 1}n , and the distribution
of X is
Yn Pn Pn
P (X = x) = pxi (1 − p)1−xi = p i=1 xi (1 − p)n− i=1 xi ,
i=1
h : E → E ′ = {0, 1, . . . , n}
be given as
n
X
h(x1 , . . . , xn ) = xi .
i=1
The distribution of
n
X
Y = h(X) = Xi
i=1
is called the binomial distribution with probability parameter p and size parameter
n. We use the notation Y ∼ Bin(n, p) to denote that Y is binomially distributed with
size parameter n and probability parameter p. We find the point probabilities
P of the
distribution of h(X) as follows: For any vector x = (x1 , . . . , xn ) with ni=1 xi = k it
follows that
P(X = x) = pk (1 − p)n−k .
Thus all outcomes that result in the same value of h(X) are equally probable. So
X
n k
P(h(X) = k) = P (X = x) = p (1 − p)n−k
k
x:h(x)=k
P
where nk denotes the number of elements x ∈ E with h(x) = ni=1 xi = k. From
Section B.2 we get that
n n!
= .
k k!(n − k)!
⋄
Example 3.2.2 (Multinomial distribution). If our fundamental sample space con-
tains m elements instead of just 2 we can derive an m-dimensional version of the bino-
mial distribution. We will assume that the fundamental sample space is {1, 2, . . . , m}
128 Statistical models and inference
but the m elements could be labeled any way we like. Let X1 , . . . , Xn be n iid ran-
dom variables, let X = (X1 , . . . , Xn ) denote the bundled variable with outcome in
E = {1, 2, . . . , m}n , and consider the transformation
h : E → E ′ = {0, 1, 2, . . . , n}m
defined by
n
X
h(x) = (h1 (x), . . . , hm (x)), hj (x) = 1(xi = j).
i=1
That is, hj (x) counts how many times the outcome j occurs. The distribution of
hence
X
n
P(h(X) = (k1 , . . . , km )) = P (X = x) = pk1 pk2 . . . pkmm ,
k1 . . . km 1 2
x:h(x)=(k1 ,...,km )
n
where k1 ...k m
denotes the number of ways to label n elements such that k1 are
labeled 1, k2 are labeled 2, etc. From Section B.2 we get that
n n!
= .
k k1 !k2 ! · · · km !
⋄
Example 3.2.3 (Geometric distribution). Let X denote a Bernoulli variable taking
values in E = {0, 1} such that P (X = 1) = p. We can think of X representing the
outcome of flipping a (skewed) coin with 1 representing heads (success). We then let
X1 , X2 , X3 , . . . be independent random variables each with the same distribution as
X corresponding to repeating coin flips independently and indefinitely. Let T denote
the first throw where we get heads, that is X1 = X2 = . . . = XT −1 = 0 and XT = 1.
Due to independence the probability of T = t is p(1 − p)t−1 . If we introduce the
random variable
Y = T − 1,
which is the number of tails (failures) we get before we get the first head (success)
we see that
P (Y = k) = p(1 − p)k .
This distribution of Y with point probabilities p(k) = p(1 − p)k , k ∈ N0 , is called
the geometric distribution with success probability p ∈ [0, 1]. It is a probability
distribution on the non-negative integers N0 . ⋄
Classical sampling distributions 129
problem is, like for the binomial and multinomial distribution, a combinatorial prob-
lem. How many outcomes (x1 , . . . , xn ) in E n result in h(x1 , . . . , xn ) = (k1 , . . . , km )
where k1 ≤ N1 , . . . , km ≤ Nm and k1 + . . . km = n? As derived in Section B.2 there
are
Nj
kj
different ways to pick out kj different elements in E with property j. Thus there are
N1 N2 Nm
...
k1 k2 km
ways to pick out k1 elements with property 1, k2 with property 2 etc. There are,
moreover, n! different ways to order these variables, and we find that the point
probabilities for the hypergeometric distribution are given by
Here the second equality follows by the definition of the binomial coefficients and
the third by the definition of the multinomial coefficients.
The last formula above for the point probabilities looks similar to the formula for
the multinomial distribution. Indeed, for N1 , . . . , Nm sufficiently large compared to
n it holds that
(k1 ) (k ) (k ) k 1 k 2 k m
N1 N2 2 . . . Nm m N1 N2 Nm
≃ ...
N (n) N N N
P(property(Xi ) = j) = Nj /N.
Exercises
fλ (x) = λ exp(−λx)
for x ≥ 0. With Θ = (0, ∞) the family (Pλ )λ∈Θ of probability measures is a statistical
model on E = E0n = (0, ∞)n .
Note that we most often use other names than θ for the concrete parameters in
concrete models. For the exponential distribution we usually call the parameter λ.
⋄
As the two previous examples illustrate the parameter space can take quite differ-
ent shapes depending on the distributions that we want to model. It is, however,
commonly the case that Θ ⊆ Rd for some d. In the examples d = m and d = 1
respectively.
As mentioned, we search for probability measures in the statistical model that fit
a dataset well. The purpose of estimation is to find a single ϑ̂ ∈ Θ, an estimate of
θ, such that Pϑ̂ is “the best” candidate for having produced the observed dataset.
We have up to this point encountered several ad hoc procedures for computing
estimates for the unknown parameter. We will in the following formalize the concept
of an estimator as the procedure for computing estimates and introduce a systematic
methodology – the maximum likelihood method. The maximum likelihood method
will in many cases provide quite reasonable estimates and it is straight forward to
implement the method, since the likelihood function that we will maximize is defined
directly in terms of the statistical model. Moreover, the maximum likelihood method
is the de facto standard method for producing estimates in a wide range of models.
θ̂ : E → Θ.
ϑ̂ = θ̂(x),
fλ (x) = λ2 exp(−λ2 x)
for λ ∈ R. We observe that λ and −λ give rise to the same probability measure
namely the exponential distribution with intensity parameter λ2 . ⋄
The example above illustrates the problem with non-identifiability of the parameter.
We will never be able to say whether λ or −λ is the “true” parameter since they
both give rise to the same probability measure. The example is on the other hand
134 Statistical models and inference
a little stupid because we would probably not choose such a silly parametrization.
But in slightly more complicated examples it is not always so easy to tell whether
the parametrization we choose makes the parameter identifiable. In some cases the
natural way to parametrize the statistical model leads to non-identifiability.
Example 3.3.6 (Jukes-Cantor). If we consider the Jukes-Cantor model from Ex-
ample 3.1.2 and take both α > 0 and t > 0 as unknown parameters, then we have
non-identifiability. This is easy to see, as the two parameters always enter the con-
ditional probabilities via the product αt. So κα and t/κ gives the same conditional
probabilities for all κ > 0.
This does make good sense intuitively since t is the time parameter (calendar time)
and α is the mutation intensity (the rate by which mutations occur). A model with
large t and small α is of course equivalent to a model where we scale the time down
and the intensity up.
If we fix either of the parameters we get identifiability back, so there is a hope that
we can estimate α from a present, observable process where we know t, and then with
fixed α we may turn everything upside down and try to estimate t. This argument
is based on the “molecular clock”, that is, that the mutation rate is constant over
time. ⋄
Example 3.3.7 (ANOVA). Consider the additive noise model
Xi = µi + σi εi
for the vector X = (X1 , . . . , Xn ) of random variables where εi ∼ N (0, 1) are iid.
The sample space is in this case E = Rn . We regard the scale parameters σi > 0 as
known and fixed. The parameter space for this model is then Θ = Rn and
θ = (µ1 , . . . , µn ).
g : {1, . . . , n} → {1, 2}
such that g(i) = 1 denotes that the i’th observation belongs to group 1 and g(i) = 2
otherwise. The model we considered in Example 3.1.4 correspond to saying that
µi = αg(i)
dataset. It is for instance also known whether the individual is a male or a female.
If we define another factor
b : {1, . . . , n} → {1, 2}
such that b(i) = 1 if the i’th individual is a female and b(i) = 2 otherwise we can
consider the model
µi = αg(i) + βb(i)
where µi is broken down into the addition of an αg(i) -component and a βb(i) -component
where (α1 , α2 ) ∈ R2 as above and similarly (β1 , β2 ) ∈ R2 . The parameter space is
Θ = R4
µi = αg(i) + βb(i)
= (αg(i) + κ) + (βb(i) − κ).
β1 + β2 = 0
and then the parameter becomes identifiable. This restriction effectively reduces the
number of free parameters by 1 and gives a model with only 3 free parameters. An
alternative, identifiable parametrization is given by (α, β, γ) ∈ R3 and
The interpretation of this model is simple. There is a common mean value γ and
then if the fusion gene is present the level changes with α and if the individual is
female the level changes with β.
The additive model above should be compared with a (fully identifiable) model with
four free parameters:
µi = νg(i),b(i)
where the parameter is (ν1,1 , ν1,2 , ν2,1 , ν2,2 ) ∈ R4 . ⋄
The estimation of the unknown parameter for a given statistical model only makes
sense if the parameter is identifiable. If we try to estimate the parameter anyway and
come up with an estimate ϑ̂, then the estimate itself does not have any interpretation.
Only the corresponding probability measure Pϑ̂ can be interpreted as an approxima-
tion of the true probability measure. Without identifiability, we cannot really define
or discuss properties of the estimators directly in terms of the parameter. One will
need to either reparameterize the model or discuss only aspects/transformations of
136 Statistical models and inference
the parameter that are identifiable. Identifiability is on the other hand often a messy
mathematical problem. In subsequent examples we will only discuss identifiability
issues if there are problems with the parameter being identifiable.
Though identifiability is important for interpretations, probability models with non-
identifiable parametrization and corresponding estimators are routinely studied and
applied to certain tasks. One example is the neural networks or more generally
many so-called machine learning models and techniques. The primary purpose of
such models is to work as approximation models primarily for prediction purposes
– often referred to as black box prediction. The estimated parameters that enter in
the models are not in themselves interpretable.
We have two different but similar definitions of the likelihood function depending
on whether E is a discrete or a continuous sample space. In fact, there is a more
abstract framework, measure and integration theory, where these two definitions are
special cases of a single unifying definition. It is on the other hand clear that despite
the two different definitions, Lx (θ) is for both definitions a quantification of how
likely the observed value of x is under the probability measure Pθ .
We will often work with the minus-log-likelihood function
instead of the likelihood function itself. There are several reasons for this. For prac-
tical applications the most notable reason is that the likelihood function often turns
Statistical Inference 137
The definition is tentative because there are a number of problems that we need to
take into account. In a perfect world Definition 3.3.9 would work, but in reality the
likelihood function Lx may not attain a global maximum (in which case θ̂(x) is not
defined) and there may be several θ’s at which Lx attains its global maximum (in
which case the choice of θ̂(x) is ambiguous). The problem with non-uniqueness of
the global maximum is not a real problem. The real problem is that there may be
several local maxima and when searching for the global maximum we are often only
able to justify that we have found a local maximum. The problem with non-existence
of the a global maximum is also a real problem. In some situations there exists a
unique θ̂(x) for all x ∈ E such that
but quite frequently there is only a subset A ⊆ E such that there exists a unique
θ̂(x) for all x ∈ A with
If P(X ∈ Ac ) is small we can with high probability maximize the likelihood func-
tion to obtain an estimate. We just have to remember that under some unusual
circumstances we cannot. When studying the properties of the resulting estimator
we therefore need to consider two things: (i) how θ̂(x) behaves on the set A where
it is defined, and (ii) how probable it is that θ̂(x) is defined.
If the parameter space is continuous we can use calculus to find the MLE analytically.
We differentiate the (minus-log-) likelihood function w.r.t. the parameter and try to
find stationary points, that is, θ’s where the derivative is zero. If Θ = R or an open
interval and lx is twice differentiable it holds that θ̃ ∈ Θ is a local minimizer for
138 Statistical models and inference
dlx
(θ̃) = 0 (3.7)
dθ
d2 lx
(θ̃) > 0. (3.8)
dθ 2
From this we can conclude that if there is a unique solution θ̃ ∈ Θ to (3.7) that
also fulfills (3.8) then θ̂(x) = θ̃. To see way, assume that there is a θ0 ∈ Θ such
that lx (θ0 ) < lx (θ̃). The second condition, (3.8), assures that the derivative is > 0
for θ ∈ (θ̃, θ1 ), say, and if θ0 > θ̃ there is a θ2 ∈ (θ1 , θ0 ) where the derivative is <
0. Somewhere in (θ1 , θ2 ) the derivative must take the value 0 – a local maximum –
which contradicts that θ̃ is the unique solution to (3.7).
The equation
dlx
(θ) = 0
dθ
is known as the likelihood equation, and if the MLE, θ̂(x), exists it must satisfy this
equation2 . So as a starting point one can always try to solve this equation to find
a candidate for the MLE. If Θ ⊆ Rd is a multidimensional parameter space there
exists a similar approach, see Math Box 3.3.1, in which case the likelihood equation
becomes a set of equations. Having found a candidate for θ̂(x) that solves the like-
lihood equation(s), there can, however, be substantial problems when d ≥ 2 with
ensuring that the solution is a global maximum. There are some special classes of
models where it is possible to show that when there is a solution to the likelihood
equation then it is the global minimizer for lx and hence the MLE. Typically such ar-
guments rely on convexity properties of lx . For many other models there are no such
result and it is hardly ever possible to solve the likelihood equations analytically.
One therefore needs to rely on numerical methods. To this end there are several ways
to proceed. The Newton-Raphson algorithm solves the likelihood equation numeri-
cally whereas the gradient descent algorithm and its variants are direct numerical
minimization algorithms of lx . The Newton-Raphson algorithm requires that we can
compute the second derivative of lx whereas gradient descent requires only the first
derivative. See Numerical Optimization by Jorge Nocedal and Stephen J. Wright for
a authoritative treatment of optimization methods.
These algorithms are general purpose optimization algorithms that do not take into
account that we try to maximize a likelihood function. In special cases some specific
algorithms exists for maximizing the likelihood, most notably there are a number of
models where one will encounter the so-called EM-algorithm. It is possible that the
reader of these notes will never have to actually implement a numerical optimization
algorithm, and that is not the subject of these notes anyway. A fairly large and ever
growing number of models are available either in R or via alternative programs.
Multivariate numerical optimization is a specialist’s job! It is good to know, though,
2
except in cases where a global maximum is attained at the boundary of the parameter set Θ
Statistical Inference 139
dlx dlx d2 lx
(θ) ≃ (θ0 ) + 2 (θ0 )(θ − θ0 ).
dθ dθ d θ
For a given θ0 ∈ Θ we solve the linear equation
dlx d2 lx
(θ0 ) + 2 (θ0 )(θ − θ0 ) = 0
dθ d θ
instead of the likelihood equation. The solution is
−1
d2 lx dlx
θ1 = θ0 − 2
(θ 0 ) (θ0 ).
d θ dθ
Taylor expanding from θ1 instead and solving the corresponding linear equation
leads to a θ2 and we can iteratively continue this procedure, which result in a
sequence (θn )n≥0 defined by
−1
d2 lx dlx
θn = θn−1 − (θ n−1 ) (θn−1 ).
d2 θ dθ
If the initial guess θ0 is sufficiently close to ϑ̂ (the MLE) then θn converges rapidly
towards ϑ̂. Note that a fixed point of the algorithm, that is, a θ such that if we put
in θ in the formula above we get back θ, is a solution to the likelihood equation.
what kind of optimization is going on when one computes the MLE in practice,
and it is also good to know that all numerical algorithms, whether they are general
purpose or problem specific, typically rely on an initial guess θ0 of the parameter.
The algorithm will then iteratively “improve” upon the guess. It is good practice
to choose a number of different initial guesses or come up with qualified initial
guesses to prevent that the algorithm either diverges or converges towards a wrong
local minimum close to a “bad” initial guess. Note that there is no way to assure in
general that an algorithm has found a global minimizer.
fλ (x) = λ exp(−λx).
140 Statistical models and inference
Dlx (θ) = 0.
For an initial guess θ0 it yields a sequence (θn )n≥0 that – hopefully – converges to
the MLE ϑ̂ = θ̂(x).
170
170 1.4
160
160 1.2
150
σ2
1.0
150
140
0.8
1.4
140
1.2
−0.4 1.0
−0.2 0.6
0.8
0.0
130
0.2
0.6 −0.4 −0.2 0.0 0.2 0.4
0.4
µ
Figure 3.3: The minus-log-likelihood function (left) and a contour plot (right) for
the scale-location parameters (µ, σ 2 ) based on n = 100 simulations of iid N (0, 1)-
distributed variables. The MLE is µ̂ = 0.063 and σ̃ 2 = 0.816. The profile minus-
log-likelihood function of σ 2 is given by evaluating the minus-log-likelihood function
along the straight line, as shown on the contour plot, given by µ = µ̂.
The example above shows that the ad hoc estimate of λ considered in Example 3.1.1
for neuronal interspike data is in fact also the maximum likelihood estimate.
Example 3.3.11 (Normal distribution). Let X1 , . . . , Xn be n iid random variables
with the N (µ, σ 2 ) distribution. The statistical model is given by E = Rn , Θ =
R × (0, ∞), θ = (µ, σ 2 ) and Pθ the probability measure that has density
Yn
1 (xi − µ)2
fθ (x1 , . . . , xn ) = √ exp −
2πσ 2 2σ 2
i=1
n
!
1 1 X
= √ exp − 2 (xi − µ)2 .
( 2πσ 2 )n 2σ
i=1
d2 lx n
2
(µ, σ 2 ) = 2 > 0
d µ 2σ
This is again a quite simple equation, and by multiplying with σ 4 > 0 we can
rearrange the equation into
n
X
σ2n = (xi − µ̂)2 .
i=1
The solution is
n
1X
σ̃ 2 = (xi − µ̂)2 .
n
i=1
dlx 2 2 < σ̃ 2 and > 0
Similarly we may show that the derivative dσ 2 (µ̂, σ ) is < 0 for σ
2 2 2 2
for σ > σ̃ . Thus lx (µ̂, σ ) is monotonely decreasing up to σ̃ and then monotonely
increasing thereafter. Thus the solution σ̃ 2 is the unique global minimizer for lx (µ̂, σ 2 )
and in conclusion (µ̂, σ̃ 2 ) is the unique global maximizer for the likelihood function.
⋄
Statistical Inference 143
138
profile minus−log−likelihood function
136
134
132
σ2
Figure 3.4: Example of the profile minus-log-likelihood function for σ 2 with 100
simulated N (0, 1) variables.
For the statistical model above – the normal distribution with unknown mean and
variance – where we have two unknown parameters, we were able to derive the MLE
by reducing the two-dimensional optimization problem to successive one-dimensional
optimization problems. The technique is useful – analytically as well as numerically.
Instead of throwing ourselves into a difficult optimization problem of several vari-
ables, we may try to solve the problem one variable at a time. The likelihood as
a function of one parameter optimized over all other parameters is known as the
profile likelihood. For instance,
n
1 X n √
lx (µ̂, σ 2 ) = 2
(xi − µ̂)2 + log σ 2 + n log 2π
2σ 2
i=1
for j = 1, . . . , m. We may note that the β parameter is in fact not identifiable though
the θ parameter is. It just turns out that the optimization in the β-parametrization
is simpler.
Statistical Inference 145
If we fix all parameters but βj we find by differentiation that βj must fulfill that
neβj
Pm β − nj = 0
j=1 e
j
or that
eβj nj
pj (β) = Pm = ,
j=1 eβj n
which has a solution in βj if and only if nj > 0. This shows that in the θ-parametrization
there is a unique minimizer of the minus-log-likelihood equation given by
nj
p̂j =
n
if n1 , . . . , nm > 0. Strictly speaking we have showed that this solution is the only
possible minimizer – we haven’t showed that it actually is a minimizer3
Using the reparameterization we have found that if n1 , . . . , nm > 0 the likelihood
function Lx (θ) attains a unique maximum over the set Θ in
n nm
1
,..., ,
n n
which is therefore the maximum likelihood estimate. The vector (n1 , . . . , nm ) is the
realization of the random variable (N1 , . . . , Nm ) and the maximum likelihood esti-
mator is
N1 Nm
θ̂ = (p̂1 , . . . , p̂m ) = ,...,
n n
This maximum likelihood estimator is very reasonable, since the estimator for pj ,
Nj
p̂j = ,
n
is the relative frequency of observations being equal to j. ⋄
The previous examples gave explicit expressions for the maximum likelihood esti-
mator. It is not always the case that we can come up with a nice analytic solution
to the optimization problem – not even by clever reparameterizations or using the
profile method. In fact, it is rather the exception than the rule that the maximum
3
If one is familiar with convexity concepts it is possible to show that lx is convex in the β-
parametrization, and convexity implies that a solution to the likelihood equation is a global mini-
mizer.
146 Statistical models and inference
185
180
profile minus−log−likelihood function
175
170
165
160
Figure 3.5: The profile minus-log-likelihood function for ρ with 100 simulated Gum-
bel variables.
likelihood estimator is given by a closed form analytic expression. The next example
will show an analysis where we use almost all of the techniques considered previously
and though we cannot find a complete analytic solution we are able to derive precise
conditions for the existence of a unique maximum to the likelihood function and we
can derive a single equation in a one-dimensional parameter that we need to solve
numerically.
It is not so easy to find the minimum of this function analytically let alone to show
that it has a minimum. But there is a way around dealing with lx (µ, σ) directly via
Statistical Inference 147
X n
dlx
(η, ρ) = exp(η) exp(−ρxi ) − n.
dη
i=1
To find the minimum of the minus-log-likelihood function for any fixed ρ we equate
the derivative to 0. This gives the equation
n
X
exp(η) exp(−ρxi ) = n,
i=1
whose solution is !
n
1X
η̂(ρ) = − log exp(−ρxi ) .
n
i=1
X n
d2 lx
2
(η, ρ) = exp(η) exp(−ρxi ) > 0,
dη
i=1
which shows that not only does the minus-log-likelihood function attain a local
minimum at η̂(ρ) as a function of η for given ρ but actually a global minimum.
148 Statistical models and inference
1.4 240
240
220 1.2
220
200
1.0
ρ
200
180
0.8
0.2 0.8
160
−0.4 −0.2 0.0 0.2 0.4
0.4 0.6
η
Figure 3.6: The minus-log-likelihood function (left) and a contour plot (right) for the
(η, ρ) reparameterization of the scale-location model based on n = 100 simulations of
iid Gumbel distributed variables. The MLE is η̂ = 0.025 and ρ̂ = 0.944. The profile
minus-log-likelihood function of ρ is given by evaluating the minus-log-likelihood
function along the curved line, as shown on the contour plot, given by the equation
η̂(ρ) = η.
The minimizer η̂(ρ) depends upon ρ, which makes things more complicated as com-
pared to the normal distribution where the minimizer of µ for fixed σ 2 does not
depend upon σ 2 . To get any further we plug the expression of η̂(ρ) back into the
minus-log-likelihood function giving the profile minus-log-likelihood function
n
!
1X
lx (η̂(ρ), ρ) = ρnx + n + n log exp(−ρxi ) − n log ρ,
n
i=1
Since this shows that the derivative is strictly increasing there can be only one
solution to the equation above.
The conclusion of our analysis is, that if n ≥ 2 and at least two of the observations are
different there is precisely one solution to the equation above, hence there is a unique
global minimum for the profile minus-log-likelihood function, and consequently there
is a unique global minimizer for the full minus-log-likelihood function.
From a practical point of view we need to solve (numerically) the equation (3.10)
and then plug this solution into η̂(ρ). This gives the maximum likelihood estimate
of (η, ρ). The Newton-Raphson algorithm for solving the equation reads
d2 lx dlx
ρn = ρn−1 − (η̂(ρn−1 ), ρn−1 )−1 (η̂(ρn−1 ), ρn−1 )
dρ2 dρ
for n ≥ 1 and an initial guess ρ0 . ⋄
The previous example elaborated on the idea of using the profiled minus-log-likelihood
function. Combined with a reparameterization the profile method reduced the bi-
variate optimization to the solution of a univariate equation. In the next example
we show how a reparameterization can help simplify the computations considerably
though they can be carried out in the original parametrization if desired.
Example 3.3.15 (Evolutionary models). We consider the model from Example
3.1.2 with (X1 , Y1 ), . . . , (Xn , Yn ) being n iid random variables each taking values
in E0 × E0 with E0 = {A, C, G, T}. We take the distribution of (Xi , Yi ), when the
evolutionary distance in time is t, to be given by the point probabilities
Pt,p,θ (x, y) = p(x)Pθt (x, y).
Here p is a vector of point probabilities on E0 , so that p(x) is the probability of x,
and Pθt (x, y) is the conditional probability that x mutates into y in time t. These
conditional probabilities depend on the additional parameter θ ∈ Θ. The main focus
is on the estimation of θ. Having observed z = ((x1 , y1 ), . . . , (xn , yn )) ∈ E = (E0 ×
E0 )n the full likelihood function becomes
n
Y n
Y
Lz (t, p, θ) = Pt,p,θ (xi , yi ) = p(xi )Pθt (xi , yi )
i=1 i=1
We observe that the first term depends upon p only and the second term on (t, θ). We
can therefore optimize each term separately to find the maximum likelihood estima-
tor. In addition we see that the first term is simply the minus-log-likelihood function
for the multinomial distribution, and naturally the MLE of the marginal point prob-
ability p(x) is the relative frequency of x among the observations x1 , . . . , xn .
Turning to the second term we know from Example 3.3.6 that there may be a
problem with identifiability of t and the additional parameters, and we consider
here the situation where we know (or fix) t. Then the second term,
n
X
˜lz (θ) = − log Pθt (xi , yi ),
i=1
thus nz (x, y) is the number of observed mutations of x with y. Then we can rewrite
X
˜lz (θ) = − nz (x, y) log Pθt (x, y)
x,y
Pn t
since there are exactly nz (x, y) terms in the sum i=1 log Pθ (xi , yi ) that equals
log Pθt (x, y).
For the rest of this example we will consider the special model, the Jukes-Cantor
model, where
for some (known) t > 0 and α > 0 the unknown additional parameter. Introducing
X X
n1 = nz (x, x) and n2 = nz (x, y)
x x6=y
we find that
˜lz (α) = −n1 log(0.25 + 0.75 × exp(−4αt)) − n2 log(0.25 − 0.25 × exp(−4αt)).
If we differentiate we obtain
d˜lz 3n1 t exp(−4αt) n2 t exp(−4αt)
(α) = −
dα 0.25 + 0.75 × exp(−4αt) 0.25 − 0.25 × exp(−4αt)
3n1 n2
= 4t exp(−4αt) −
1 + 3 exp(−4αt) 1 − exp(−4αt)
Statistical Inference 151
dl̃z
and the likelihood equation dα (α) = 0 is equivalent to the equation
This equation has a (unique) solution if and only if 3n1 > n2 in which case
1 3(n1 + n2 ) 1 3n
α̂ = log = log
4t 3n1 − n2 4t 3n1 − n2
is the maximum likelihood estimator. Moreover, we see from the expression of the
derivative of ˜lz (α) that (given 3n1 > n2 ) then ddα
l̃z
(α) < 0 if α < α̂ and ddα
l̃z
(α) > 0
˜
if α > α̂. This shows that lz (α) is monotonely decreasing up to α̂ and monotonely
increasing thereafter. Hence α̂ is the global minimum of the minus-log-likelihood
function.
Working with the minus-log-likelihood and in particular differentiation in the α-
parametrization is hideous. It is much easier to make a reparameterization by
such that
1
α = α(γ) = log(1 − 4γ)
4t
for γ ∈ (0, 0.25). In the γ-parameter the minus-log-likelihood becomes
Example 3.3.16. We turn to the data from the Hepatitis C virus evolution, as
considered in Example 1.2.4, and we want to estimate the α parameter in the Jukes-
Cantor model. The two quantities that enter in the estimator are n2 , the total
number of mutations, and n1 , the remaining number of non-mutated nucleotide
pairs. Observe that n1 = n − n2 . We will consider two situations. Either we pool
all of the three segments of the virus and make one estimate, or we estimate α
separately for the segments A, B, and C. The following table shows the result.
152 Statistical models and inference
600
500
400
Minus−log−likelihood
300
200
Segment A
Segment B
100
Segment C
Figure 3.7: The (partial) minus-log-likelihood functions ˜lz (α) for the hepatitis C virus
data, as considered in Example 3.3.16, using the Jukes-Cantor model. We estimate
α separately for the three segments A, B, and C of the genome. The corresponding
maximum likelihood estimates are marked on the plots.
Segment
A B C A+B+C
n1 2532 1259 1009 4800
n2 78 25 20 123
α̂ 7.8 × 10−4 5.0 × 10−4 5.0 × 10−4 6.5 × 10−4
The time is measured in years so that t = 13 and the estimated mutation rates are
thus per year mutation rates. Seemingly, segment A shows a different mutation rate
than segments B and C do – but conclusions like this have to be based on knowledge
about the uncertainty of the estimates. We deal with this in Chapter ??. Plugging
the estimated mutation rate α̂A = 7.8 × 10−4 for segment A into the expression for
Pαt (x, y), we find for instance that for x 6= y
This is an estimate of the chance that any specific single nucleotide in segment A
will mutate within a year.
Still dividing the data according to the three segments, the frequency vectors nA , nB , nC ∈
N40 and relative frequency vectors pA , pB , pC ∈ R4 (MLEs of p) are
Statistical Inference 153
A C G T Total
nA 483 786 763 578 2610
pA 0.185 0.301 0.292 0.221 1
nB 257 398 350 279 1284
pB 0.200 0.310 0.273 0.217 1
nC 233 307 286 203 1029
pC 0.226 0.298 0.278 0.197 1
To investigate whether the model is actually a good model for the data, we reconsider
the table of observed mutations for segment A together with the expected number
of mutations. Since all mutations are equally probable for the Jukes-Cantor model –
the probability being 0.25 − 0.25 × exp(−4 × 13 × 7.8 × 10−4 ) = 0.0099 for segment
A – the expected number of mutations from nucleotide x is nA (x) × 0.0099.
H90 H90
A C G T A C G T
A 1 11 1 A 4.8 4.8 4.8
H77 C 4 1 20 H77 C 7.8 7.8 7.8
G 13 3 1 G 7.6 7.6 7.6
T 3 19 1 T 5.7 5.7 5.7
Looking at these two tables we are suspicious about whether the Jukes-Cantor model
actually is adequate. There are many more transitions than transversions, where the
Jukes-Cantor model predicts the same number. ⋄
Thus we have observed 260 Bernoulli variables X1 , . . . , X260 (death = 1 and survival
= 0), which we can safely assume independent. But it would obviously be wrong to
assume that they have the same distribution. On the contrary, we are interested in
figuring out the effect of the concentration of dimethoat on the death rate of the
flies, that is, on the distribution of the Bernoulli variables.
We will parametrize the probability of death as a function of dose (concentration),
and we introduce
exp(α + βy)
p(y) = ,
1 + exp(α + βy)
where p(y) ∈ (0, 1) and α, β > 0. The logistic regression model is then defined by
letting the probability of Xi = 1 given the dose yi be p(yi ). The function y 7→ p(y)
is known as the logistic function, which is why this model of the probability as a
function of dose level is known as logistic regression. It is common not to use the con-
centration directly as the dose level but instead use the log(concentration). Thus y =
log(concentration). The observation consists in general of a vector x = (x1 , . . . , xn )
of 0-1 variables, with n = 260 in the fly death example, and the statistical model of
logistic regression has sample space E = {0, 1}n , parameter θ = (α, β) and parame-
ter space Θ = R2 .
Observing that
1
1 − p(y) =
1 + exp(α + βy)
we can rewrite to find
p(y)
log = α + βy.
1 − p(y)
The left hand side is the logarithm of the odds that the fly die, so the model says
that the log odds for dying depends linearly upon the dose level (log(concentration)
for the flies).
Statistical Inference 155
where
n
X n
X
S= xi and SS = y i xi .
i=1 i=1
Note that you can compute the likelihood function from the table above. You do not
need the observations, only the summary given in the table. It is actually sufficient
to know just S = 121, SS = −151.9 and the log(concentrations).
We fix α and differentiate w.r.t. β and find
n
X n
X
dlx yi exp(α + βyi )
(α, β) = − SS = yi p(yi ) − SS
dβ 1 + exp(α + βyi )
i=1 i=1
n
X n
X
d2 lx yi2 exp(α
+ βyi )
(α, β) = = yi2 p(yi )(1 − p(yi )).
dβ 2 (1 + exp(α + βyi ))2
i=1 i=1
Since both of the second derivatives are > 0, we conclude that lx (α, β) as a function
of one of the parameters (and the other fixed) can have at most one local minimum,
which is then a global minimum. This does not prove the uniqueness of minima of
lx regarded as a function of two variables.
We can approach the bivariate optimization by general purpose optimization algo-
rithms, and use for instance the Newton-Raphson algorithm as discussed in Math
Box 3.3.1. As an alternative, which illustrates the use of the one-dimensional Newton-
Raphson algorithm along the coordinate axis, consider the following alternating
156 Statistical models and inference
4.0
500
500
3.5
400
Minus−log−lik
400
3.0
300
elihood
2.5
β
300
200
100
2.0
200
4.0
3.5 7
3.0 6
1.5
5
2.5
4 100
be
ta
2.0
ha
3 alp
1.0
1.5
2 1 2 3 4 5 6 7
1.0 1 α
4.0 4.0
500 500
3.5 3.5
400 400
3.0 3.0
2.5 2.5
β
300 300
2.0 2.0
200 200
1.5 1.5
100 100
1.0 1.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
α α
Figure 3.8: The minus-log-likelihood function (upper left) using the logistic regression
model for the fly data. It does not have a pronounced minimum, but rather a long
valley running diagonally in the α-β-parametrization. The contour plots (upper right
and bottom) show how two different algorithms converge towards the minimizer (the
MLE) (α̂, β̂) = (5.137, 2.706) for different starting points (α0 , β0 ) = (2, 2), (4, 3) and
(4, 1.5). The broken line is the bivariate Newton-Raphson algorithm and the black
line is the alternating Newton-Raphson algorithm, as discussed in Example 3.3.17,
which is slower an moves in zig-zag along the α and β axes.
Newton-Raphson algorithm. With initial guess (α0 , β0 ) we define the sequence (αn , βn )n≥0
by
n
!−1 n !
X X
αn = αn−1 − pn−1 (yi )(1 − pn−1 (yi )) pn−1 (yi ) − S
i=1 i=1
Statistical Inference 157
and then
n
!−1 n
!
X X
βn = βn−1 − yi2 p′n−1 (yi )(1 − p′n−1 (yi )) yi p′n−1 (yi ) − SS
i=1 i=1
where
exp(αn−1 + βn−1 y) exp(αn + βn−1 y)
pn−1 (y) = p′n−1 (y) = .
1 + exp(αn−1 + βn−1 y) 1 + exp(αn + βn−1 y)
The algorithm amounts to making a one-dimensional Newton-Raphson step first
along the α axis, then along the β axis and so on and so forth. It is not a particular
fast algorithm – if the sequence of parameters converge they do it slowly – but
curiously for this example the alternating Newton-Raphson algorithm is more stable
than the raw bivariate Newton-Raphson. This means that it will actually converge
for a large set of initial guesses. Convergence of the raw bivariate Newton-Raphson
is quite sensitive to making a good initial guess. There are ways to deal with such
problems – via moderations of the steps in the Newton-Raphson algorithm. But
that is beyond the scope of these notes. If any of the algorithms converge then the
resulting point is, for the logistic regression model, always a global minimizer. For
the flies the MLE is
(α̂, β̂) = (5.137, 2.706).
1.0
0.8
Probability of death
0.6
0.4
0.2
0.0
−4 −3 −2 −1 0
log(concentration)
Figure 3.9: The logistic curve for the probability of fly death as a function of
log(concentration) of dimethoat using the MLE parameters (α̂, β̂) = (5.137, 2.706)
Sometimes the likelihood function does not attain a maximum, and neither algorithm
will converge. This happens if there are two concentrations, c0 < c1 , such that all
flies that received a dose below c0 survived and all flies that received a dose above
c1 died, and we have no observations with dose levels in between. This is not really
a problem with the model, but rather a result of a bad experimental design. ⋄
158 Statistical models and inference
being the number of nucleotide pairs with no mutations, the number of transitions
and the number of transversions respectively, then the (partial) minus-log-likelihood
function becomes
Direct computations with the minus-log-likelihood function are even more hideous in
the (α, β)-parametrization for the Kimura model than for the Jukes-Cantor model.
A reparameterization is possible but we refrain from the further theoretical analysis
of the model. Instead we turn to the Hepatitis C virus data, which we analyzed
in Example 3.3.16 using the Jukes-Cantor model, we apply a standard multivariate
numerical optimization algorithm, for instance optim in R, for computing the max-
imum likelihood estimates of α and β. The results for this example are summarized
in the following table:
Segment
A B C
n1 2532 1259 1009
n2 63 20 14
n3 15 5 6
α̂ 1.9 × 10−3 1.2 × 10−3 1.2 × 10−3
β̂ 2.2 × 10−4 1.5 × 10−4 2.3 × 10−4
Statistical Inference 159
146 122
1.8 1.8 1.8
β × 10−4
β × 10−4
β × 10−4
410
144 120
1.6 1.6 1.6
Figure 3.10: The contour plots of the minus-log-likelihood function for the hepatitis C
data using the Kimura model. The three plots represent the three different segments
A, B, and C (from left to right). The corresponding MLEs are found by numerical
optimization using optim in R.
The MLEs of the pA , pB and pC vectors do not change, and we can compute the
table of expected mutations
H90 H90
A C G T A C G T
A 1 11 1 A 1.4 11.7 1.4
H77 C 4 1 20 H77 C 2.3 2.3 19.0
G 13 3 1 G 18.4 2.2 2.2
T 3 19 1 T 1.7 14.0 1.7
Exercises
pmb (1 − p)mB
Pm
where mb = i=1 xi is the number of males with the CB-allele and mB = m − mb
is the number of males without the CB-allele, find the likelihood function and the
minus-log-likelihood function for p and show that the MLE equals
mb
p̂ = .
m
Assume in addition that we have observations from f randomly selected females,
y1 , . . . , yf ∈ {0, 1}, where yi = 1 if female number i is color blind, that is, if she has
two CB-alleles. We will assume that the allele distribution in the total population
satisfies the Hardy-Weinberg equilibrium, which means that the proportion of females
with 2 CB-alleles is p2 , the proportion with 1 is 2p(1 − p) and the proportion with 0
is (1 − p)2 . We assume that the observations are independent and also independent
of the male observations above.
Statistical Inference 161
Exercise 3.3.3. Argue that the probability that yi = 1 equals p2 and the probability
that yi = 0 equals 2p(1 − p) + (1 − p)2 = (1 − p)(1 + p). Letting y = (y1 , . . . , yf ) argue
that the probability of observing (x, y) equals
λi = eβyi +α
Exercise 3.3.5. Fix β and show that for fixed β the minimum of the minus-log-
likelihood function in α is Pn
xi
α̂(β) = log Pni=1 βy .
i=1 e
i
and that the minimizer of the profile minus-log-likelihood solves the equation
Pn Pn
yi eβyi xi yi
Pi=1
n βy
= Pi=1
n .
i=1 e i=1 xi
i
Exercise 3.3.7. Implement the Newton-Raphson algorithm for solving the equation
in β above and implement then a function for estimation of (α, β) for a given dataset
x1 , . . . , xn and y1 , . . . , yn .
162 Statistical models and inference
β(θ) = Pθ (R).
as the maximal probability for rejecting the null-hypothesis over all possible choices
of θ from the null-hypothesis. That is, α gives the largest probability of rejecting
the null-hypothesis by mistake. For θ ∈ Θ\Θ0 the power, β(θ), is the probability of
correctly rejecting the null-hypothesis under the specific alternative θ.
A good test has small level α and large power β(θ) for all θ ∈ Θ\Θ0 . However, these
two requirements are at odds with one another. If we enlarge the acceptance set,
say, the level as well as the power goes down and if we enlarge the rejection set, the
level as well as the power goes up.
In practice we always specify the acceptance and rejection regions via a test statistic.
A test statistic is a function h : E → R, such that with a given choice of threshold
c ∈ R the acceptance region is defined as
A = {x ∈ E | h(x) ≤ c}.
A = {x ∈ E | c1 ≤ h(x) ≤ c2 }
Hypothesis testing 163
for c1 , c2 ∈ R. Often the two-sided tests are symmetric with c1 = −c2 in which case
we can just as well rephrase the test as a one-sided test with the test statistic |h|.
Note that for the one-sided test we have A = h−1 ((−∞, c]), hence
by the definition of what the transformed probability measure h(Pθ ) is. The point
is that to compute the power and level of a test based on a test statistic h we need
to know the transformed probability measure h(Pθ ).
If we have a one-sided test statistic h we reject if h(x) is too large. If we have
settled for a level α-test we use the distribution of the test statistic to compute the
corresponding threshold, which is the (1−α)-quantile for the distribution, and which
we often refer to as the level α critical value or just the critical value for short. If we
reject a hypothesis we often say that the conclusion is statistically significant and
the critical value is sometimes called the significance level.
We develop in this section a classical test statistic, the two sample t-test, where we
use Example 3.1.4 as inspiration. In that example we consider gene expression of a
particular gene for two different groups of individuals. The sample space is R79 and
the observations are the log-expression measurements.
The full model consists of specifying that the measurements are all assumed inde-
pendent and that the distribution of the log-expressions are N (µ1 , σ12 ) in group 1
and N (µ2 , σ22 ) in group 2. Equivalently we can specify the model by saying that we
observe the realization of Xi,j for i = 1 and j = 1, . . . , 37 or i = 2 and j = 1, . . . , 42,
and that
Xi,j = µi + σi εi,j
164 Statistical models and inference
where εi,j are 79 iid random variables with the N (0, 1)-distribution.
In a slightly more general framework there are n observations in the first group
and m in the second and thus n + m observations in total. The full model has a
4-dimensional parameter vector
H 0 : µ1 = µ2 ,
that is,
Θ0 = {(µ1 , σ12 , µ2 , σ22 ) ∈ Θ | µ1 = µ2 }.
which are also the maximum-likelihood estimators in this model according to Ex-
ample 3.3.11.
If we consider the difference of the estimators we get that
n m
1X 1 X
µ̂1 − µ̂2 = X1,j − X2,j
n m
j=1 j=1
Xn m
1 1 X
= (µ1 + σ1 ε1,j ) − (µ2 + σ2 ε2,j )
n m
j=1 j=1
n m
σ1 X σ2 X
= µ1 − µ2 + ε1,j + −ε2,j .
n m
j=1 j=1
Since all the εi,j ’s are assumed independent and N (0, 1)-distribution we can, using
the result in Math Box 2.10.20, find that
σ2 σ2
µ̂1 − µ̂2 ∼ N µ1 − µ2 , 1 + 2 .
n m
If we choose µ̂1 − µ̂2 as the test statistic – using a symmetric, two-sided test so that
we reject if |µ̂1 − µ̂2 | > c – the normal distribution above tells us how to compute
the power of the test and in particular the level. Under the null-hypothesis it holds
that µ1 − µ2 = 0, but the distribution of the test statistic still depends upon the
unknown parameters σ12 and σ22 . If we thus want to choose the critical value c such
that the probability of wrongly rejecting the null-hypothesis is ≤ α we choose the
Hypothesis testing 165
σ2 σ2
1 − α/2-quantile for the normal distribution with mean 0 and variance n1 + m2 . Since
this is a scale transformation of the N (0, 1)-normal distribution we can compute the
quantile by Example 2.9.10 as
r
σ12 σ22
c(α) = + z1−α/2
n m
where z1−α/2 is the 1 − α/2-quantile for the N (0, 1)-distribution. For α = 0.05 the
quantile is 1.96 and for α = 0.01 the quantile is 2.58.
We encounter a practical problem. Even if we have decided that a level 5% test
is the relevant test to use we cannot compute the corresponding threshold for the
given test statistic because it depends upon the unknown parameters σ12 and σ22 . A
widespread solution is to use the plug-in principle and simply plug in the estimators
of the unknown parameters in the formula for the threshold. The resulting threshold
becomes r
σ̃12 σ̃22
c̃α = + z1−α/2 ,
n m
and we reject the null-hypothesis if |µ̂1 − µ̂2 | > c̃α .
Returning to the computations in Example 3.1.4 and taking α = 0.05 the threshold
becomes r
0.659 0.404
c̃α = + 1.96 = 0.32
37 42
and since the test statistic in this case equals 1.20 we reject the hypothesis that
the group means are equal. Had we taken α = 0.01 the threshold would be 0.46
and we would still reject. Another way of phrasing this is that the difference in the
estimated means is large compared to the variance in the data – so large that it is
very unlikely that it will be this large by chance if the mean value parameters are
equal.
Because we use estimated values for the unknown variance parameters the compu-
tations above are not exact. Though we aim for a level of the test being α it is not
necessarily precisely α. The problem is most pronounced when n and m are small,
2-5, say, in which case one can make a gross mistake.
We will show later that the empirical variance,
n
1X
σ̃ 2 = (Xi − µ̂),
n
i=1
which is by far the most used estimator of the variance in practice. This is for
instance the estimator computed by var in R. When n = 5 the new estimator is a
factor 5/4 = 1.25 larger – a considerable amount. For large n the difference between
the two estimators becomes negligible. Effectively, using the larger variance estimator
increases the threshold and consequently our conclusions will be more conservative –
we are less likely to reject the hypothesis that the mean value parameters are equal.
R Box 3.4.1 (T-test). The two sample t-test can be carried out using the
t.test function in R. There are two essentially different ways for using the
function. If the data for the two groups are stored in the two vectors x and
y you compute the t-test by
> t.test(x,y)
If the two vectors contain the gene expression measurements from the two
groups considered in Example 3.1.4 the output is
data: x and y
t = 7.1679, df = 67.921, p-value = 7.103e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.8678568 1.5374938
sample estimates:
mean of x mean of y
8.536492 7.333816
Alternatively, the data for the two groups may be stored in a data frame
myData with one column labeled value, say, containing the measure-
ments/observations and another column labeled groups, say, which is a
factor with two levels. Then we can compute the t-test by
> t.test(value~group,data=myData)
The default setting for t.test is to compute the t-test without assuming
equal variances. We can specify equal variances by setting var.equal=TRUE.
There might, however, still be problems with using the approximation. In the at-
tempt to remedy the problems we can choose to consider the t-test statistic
µ̂1 − µ̂2
T =q 2 .
σ̂1 σ̂22
n + m
this test statistic. A common approximation of the distribution of the t-test statistic
under the null-hypothesis that µ1 = µ2 is as a t-distribution with degrees of freedom
quantiles for the normal distribution. For doing a single test the practical difference
by using the normal distribution is minor for n ≥ 20.
To summarize the conclusions from the derivations above, we consider the null-
hypothesis
H 0 : µ1 = µ2 .
If we assume equal variances we compute the two-sample t-test
r
nm µ̂1 − µ̂2
T = , (3.13)
n+m σ̂
where σ̂ 2 is the pooled variance estimate given by (3.12) and we reject the hypothesis
if |T | > w1−α/2 where w1−α/2 is the 1 − α/2 quantile for the t-distribution with
n + m − 2 degrees of freedom.
If we do not assume equal variances we compute the two-sample t-test
µ̂1 − µ̂2
T =q 2 (3.14)
σ̂1 σ̂22
n + m
and we reject the hypothesis if |T | > w1−α/2 where w1−α/2 is the 1 − α/2 quantile
for the t-distribution with degrees of freedom given by (3.11). In this case where we
do not assume equal variances the test is often referred to as the Welch two-sample
t-test.
The relevant quantiles can be computed in R using the quantile function qt for the
t-distribution. However, the t-test can also be computed using the t.test function.
This function reports the conclusion of the test in terms of a p-value. If the computed
t-test statistic is equal to t for the concrete dataset then the p-value is
where T has the t-distribution with df degrees of freedom. Alternatively, because the
t-distribution is symmetric
p = 2(1 − Fdf (|t|))
where Fdf is the distribution function for the t-distribution with df degrees of free-
dom. The null-hypothesis is rejected at level α if and only if the p-value is ≤ α.
There is one natural question left. Should we choose the Welch t-test or should
we assume equal variances? Technically there is no problem in using the Welch t-
test though the t-distribution used is not exact. If the estimated variances are close
to each other there will only be minor differences between the Welch t-test and the
equal variance t-test, and if they are not, the equal variance t-test is not appropriate.
Should we actually make a formal statistical test of the hypothesis that the variance
parameters σ12 and σ22 are equal? If we reject, then we use the Welch t-test, and
otherwise we use the equal variance t-test. Such a procedure has been criticized in
Hypothesis testing 169
0.6
0.5
0.4
0.3
0.2
0.1
0.0
5 6 7 8 9 10 11
Figure 3.11: The densities for the normal distributions with estimated parameters
for the gene expression data for gene 1635 at. We reject the hypothesis that the
mean value parameters are equal.
is called the likelihood ratio test statistic. Since Θ0 ⊆ Θ, we have Q(x) ∈ (0, 1]. Small
values of Q(x) are critical.
To use the test statistic above we need to know its distribution under Pθ for θ ∈ Θ0 .
It is in general impossible to find this distribution, but in many situations of practical
importance we can find a useful approximation. We state this as a theorem, though
it is not precise in terms of the prerequisites required for the approximation to be
valid.
Result 3.4.3. If Θ is a d-dimensional parameter space and Θ0 d0 -dimensional, the
distribution of
−2 log Q(X)
can be approximated by a χ2 -distribution with d − d0 degrees of freedom. Large values
of −2 log Q(x) are critical.
Remark 3.4.4. The “dimension” of Θ and Θ0 is a little too involved to define in
a precise mathematical sense. It essentially covers the more intuitive idea of “the
number of free parameters”. In practice, it is often easy to compute the dimension
drop d − d0 as this is simply the number of (different) restrictions that we put on the
parameters in Θ to get Θ0 . For instance, in the example above with the 2-sample t-
test the dimension of Θ is 3 (or 4), the dimension of Θ0 is 2 (or 3) and the dimension
drop is 1.
If θ̂(x) denotes the maximum-likelihood estimate and θ̂0 (x) the maximum-likelihood
estimate under the null hypothesis it follows that
−2 log Q(x) = 2(− log Lx (θ̂0 (x)) + log Lx (θ̂(x))) = 2(lx (θ̂0 (x)) − lx (θ̂(x)))
Having computed −2 log Q(x) we often report the test by computing a p-value. If Fdf
denotes the distribution function for the χ2 -distribution with df degrees of freedom
the p-value is
p = 1 − Fdf (−2 log Q(x)).
This is the probability for observing a value of −2 log Q(X) under the null-hypothesis
that is as large or larger than the observed value −2 log Q(x).
Example 3.4.5. For the Kimura model in Example 3.3.18 we observe that the
null-hypothesis
H0 : α = β
is equivalent to the Jukes-Cantor model. If we consider Segment A of the virus
genome we can compute ˜lx (α̂, β̂) = 399.2 and under the null-hypothesis ˜lx (α̂0 , α̂0 ) =
436.3. We find that
−2 log Q(x) = 74.2.
Under the null-hypothesis we make a single restriction on the parameters, and the
p-value using a χ2 -distribution with 1 degree of freedom is 7.0 × 10−18 , which is
Hypothesis testing 171
effectively 0. This means that we will by all standards reject the null-hypothesis over
the alternative. That is to say, we reject that the Jukes-Cantor model is adequate
for modeling the molecular evolution of Segment A. ⋄
One word of warning. We computed ˜lx above instead of the full minus-log-likelihood.
We did so because the remaining part of the full minus-log-likelihood does not involve
the parameters α and β and is unaffected by the hypothesis. All terms that remain
constant under the full model and the hypothesis can always be disregarded as the
difference in the computation of −2 log Q is unaffected by these terms. However,
be careful always to disregard the same terms when computing lx (θ̂0 (x)) as when
computing lx (θ̂(x)).
Example 3.4.6. Continuing Example 3.3.18 we will investigate if the three segments
can be assumed to have the same parameters. Thus for the full model we have
six parameters αA , βA , αB , βB , αC , βC , two for each segment. We set up the null-
hypothesis
H0 : αA = αB = αC , βA = βB = βC .
We find that −2 log Q(x) = 6.619, and since the full model has 6 free parameters
and the model under the null-hypothesis has 2, we compute the p-value using the
χ2 -distribution with 4 degrees of freedom. The p-value is 0.157 and we do not reject
the hypothesis that all three segments have the same parameters. ⋄
The methodology of statistical testing is very rigid. Formally we have to set up the
hypothesis prior to considering the data, since testing a hypothesis that is formu-
lated based on the data almost automatically demolishes the assumptions that are
used to derive the distribution of the test statistics. This makes statistical testing
most appropriate for confirmatory analyses where we know what to expect prior to
the actual data analysis and want to confirm and document that our beliefs are cor-
rect. On the other hand, exploratory data analysis where we don’t know in advance
what to expect is an important part of applied statistics. Statistical testing is used
anyway as an exploratory tool to investigate a range of different hypotheses, and
there are numerous algorithms and ad hoc procedures for this purpose. The merits
of such procedures are extremely difficult to completely understand. It is, however,
always important to understand how to correctly interpret a hypothesis test and
the conclusions we can draw. Disregarding whether we use test statistics for a less
formal, exploratory analysis or a formal confirmatory analysis we have to remember
that if we accept a null-hypotheses we do in fact have little evidence that the hy-
pothesis is true. What we can conclude is that there is no evidence in the data for
concluding that the hypothesis is false, which is a considerably vaguer conclusion!
A hypothesis may be screamingly wrong even though we are unable to document
it. If the test we use has little power against a particular alternative, it will be very
difficult to detect such a deviation from the null-hypothesis. On the other hand,
if we reject a null-hypothesis we may be rather certain that the null-hypothesis is
false, but “how false” is it? If we have a large dataset we may be able to statistically
172 Statistical models and inference
detect small differences that are of little practical relevance. Statistical significance
provides documentation for rejecting a hypothesis, e.g. that there is a difference of
the mean values for two groups, but does not in itself document that the conclusion
is important or significant in the usual sense of the word, e.g. that the difference of
the mean values is of any importance.
One of the problems with formal statistical testing is that the more tests we do the
less reliable are our conclusions. If we make a statistical test at a 5%-level there
is 5% chance the we by mistake reject the hypothesis even though it is true. This
is not a negligible probability but 5% has caught on in the literature as a suitable
rule-of-thumb level. The problem is that if we carry out 100 tests at a 5%-level then
we expect that 1 out of 20 tests, that is, 5 in total, reject the null-hypothesis even if
it is true in all the 100 situations. What is perhaps even worse is that the probability
of rejecting at least one of the hypothesis is in many cases rather large. If all the
tests are independent, the number of tests we reject follows a binomial distribution
with parameters n = 100 and p = 0.05, in which case the probability of rejecting at
least one hypothesis if they are all true is 1 − (1 − 0.05)100 = 99.4%.
If we carry out 100 two-sample t-tests on different4 datasets and find that at a
5% level we rejected in 4 cases the hypothesis that the means are equal, does this
support a conclusion that the means are actually different in those 4 cases? No, it
does not. If we reject in 30 out of the 100 cases we are on the other hand likely
to believe that for a fair part of the 30 cases the means are actually different. The
binomial probability of getting more than 10 rejections is 1.1% and getting more
than 20 rejections has probability 2.0 × 10−8 . But for how many and for which of
the 30 cases can we conclude that there is a difference? A natural thing is to order
(the absolute value of) the test statistics
|t(1) | ≤ . . . ≤ |t(100) |
4
Formally we need independent datasets for some of the subsequent quantitative computations
to be justified but qualitatively the arguments hold in a broader context.
Hypothesis testing 173
500
4
2
400
0
Sample Quantiles
300
Frequency
−2
200
−4
−6
100
−8
0
−4 −2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0
Figure 3.12: QQ-plot of the 12,625 t-test statistics for the ALL dataset (left) and
histogram of the corresponding p-values.
We find our earlier considered gene, 1635 at, as number three from the top on this
list. Figure 3.12 shows that QQ-plot of the computed t-test statistics against the
t-distribution with 77 degrees of freedom together with a histogram of the p-values.
The QQ-plot bends in a way that indicates that there are too many large and small
values in the sample. This is confirmed by the histogram of p-values, which shows
that there are in the order of several hundred p-values too many in the range from
0 to 0.01. ⋄
The conclusion in the example above is that there are a number of the cases where
we should reject the hypothesis – even in the light of the fact that we do 12625
tests. The real question is that if we continued the list above, when should we stop?
What should the threshold for the t-test statistic be in the light of the multiple tests
carried out?
174 Statistical models and inference
α = 1 − (1 − 0.05)1/n = 1 − (0.95)1/n .
With n = 12625 as in the example this gives an α = 4.1 × 10−6 . Thus each test has
to be carried out at the level 4.1 × 10−6 to be sure not to reject a true hypothesis.
This can be a very conservative procedure.
Current research suggests that for large, multiple testing problems focus should
change from the family wise error rate to other quantities such as the false discovery
rate, which is the relative number of falsely rejected hypotheses out of the total num-
ber of rejected hypotheses. The book Multiple testing procedures with applications
to genomics by Dudoit and van der Laan (Springer, 2008) treats this and a number
of other issues in relation to multiple testing problems. A very pragmatic viewpoint
is that the multiple testing problem is a simple matter of choosing a suitable thresh-
old to replace the critical value used for a single test. How to do this appropriately
and how to interpret the choice correctly can be a much more subtle problem, but
ordering the tests according to p-value is almost always a sensible thing to do.
Exercises
Exercise 3.4.1. Consider the setup for Exercise 3.3.4 and the null-hypothesis
H0 : β = 0
Interpret the hypothesis and compute a formula for −2 log Q(x) for testing this hy-
pothesis. What is the approximating distribution of this test statistic under the null-
hypothesis?
Exercise 3.4.2. Make a simulation study to investigate the distribution of the test
statistics −2 log Q(X) for the hypothesis considered in Example 3.4.5. That is, use the
estimated Jukes-Cantor model to simulate new datasets, 200 say, compute the corre-
sponding −2 log Q(x) statistics for each dataset and compare the resulting empirical
distribution with the χ2 -distribution with 1 degree of freedom.
Confidence intervals 175
The formal statistical test answers a question about the parameter in terms of the
data at hand. Is there evidence in the data for rejecting the given hypothesis about
the parameter or isn’t there? We can only deal with such a question in the light of
the uncertainty in the data even if the hypothesis is true. The distribution of the test
statistic captures this, and the test statistic needs to be sufficiently large compared
to its distribution before we reject the hypothesis.
There is another, dual way of dealing with the uncertainty in the data and thus
the uncertainty in the estimated parameters. If we consider a real valued parameter
then instead of formulating a specific hypothesis about the parameter we report an
interval, such that the values of the parameter in the interval are conceivable in the
light of the given dataset. We call the intervals confidence intervals.
If (Pθ )θ∈Θ is a parametrized family of probability measures on E, and if we have an
observation x ∈ E, then an estimator θ̂ : E → Θ produces an estimate θ̂(x) ∈ Θ. If
the observation came to be as a realization of an experiment that was governed by
one probability measure Pθ in our parametrized family (thus the true parameter is
θ), then in most cases θ̂(x) 6= θ – but it is certainly the intention that the estimate
and the true value should not be too far apart. We attempt here to quantify how far
away from the estimate θ̂(x) it is conceivable that the true value of the parameter
is.
Example 3.5.1. Let X1 , . . . , Xn be iid Bernoulli distributed with success probabil-
ity p ∈ [0, 1]. Our parameter space is [0, 1] and the unknown parameter is the success
probability p. Our sample space is {0, 1}n and the observation is an n-dimensional
vector x = (x1 , . . . , xn ) of 0-1-variables. We will consider the estimator
n
1X
p̂ = Xi ,
n
i=1
which Pis the relative frequency of 1’s (and the MLE as well). The distribution of
np̂ = ni=1 Xk is a binomial distribution with parameters (n, p), which implicitly5
gives the distribution of p̂.
If we in this example take z(p) and w(p) to be the 0.025- and 0.975-quantiles for the
binomial distribution with parameters (n, p) we know that
Pp (z(p) ≤ np̂ ≤ w(p)) ≃ 0.95.
The reason that we don’t get exact equality above is that the binomial distribution
is discrete, so the distribution function has jumps and we may not be able to obtain
exact equality. If we now define
I(p̂) = {p ∈ [0, 1] | z(p) ≤ np̂ ≤ w(p)}
5
The distribution of p̂ is a distribution on {0, 1/n, 2/n, . . . , 1} – a set that changes with n – and
the convention is to report the distribution in terms of np̂, which is a distribution on Z.
176 Statistical models and inference
1.0
1.0
0.8
0.8
0.6
0.6
p
p
^
^
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p
Figure 3.13: These figures show 95%-confidence intervals for the parameter p in the
binomial distribution for different possible realizations of the estimator p̂ (on the
y-axis) with n = 10 (left) and n = 30 (right). For a given estimate p̂(x) = y we can
read of which p (those on the line) that could produce such an estimate. Note the
cigar shape.
we have Pp (p ∈ I(p̂)) ≃ 0.95. We can find the interval I(p̂) by reading it off from a
figure as illustrated in Figure 3.13. Note that the probability statement is a statement
about the random interval I(p̂). It says that this random interval will contain the
true parameter with probability 0.95 and we call I(p̂) a 95%-confidence interval.
It will be shown in a later chapter that the variance of np̂ is np(1 − p), and if we
approximate the binomial distribution B(n, p) with the N (np, np(1−p)) distribution,
which will also be justified later, we arrive at the following convenient approximation
p ! p !
p̂(1 − p̂) p̂(1 − p̂)
z(p) ≃ n p − 1.96 √ and w(p) ≃ n p + 1.96 √
n n
If we plug this approximation into the formula for the confidence interval we get
" p p #
p̂(1 − p̂) p̂(1 − p̂)
I(p̂)) = p̂ − 1.96 √ , p̂ + 1.96 √ .
n n
The approximation does not work well if n is too small or if the true p is too close
to 0 or 1. ⋄
For a single parameter as in Example 3.5.1 we report an interval. How large the
interval is depends upon how certain – or confident – we want to be that the true
Confidence intervals 177
Definition 3.5.2. A confidence set for the parameter θ given the observation x ∈ E
is a subset I(x) ⊆ Θ. If for each x ∈ E we have given a confidence set I(x) we say
that (I(x))x∈E are (1− α)-confidence sets for the unknown parameter if for all θ ∈ Θ
Pθ (θ ∈ I(X)) ≥ 1 − α. (3.15)
Example 3.5.3. We consider the statistical model for just group 1 in Example
3.1.4, which states that we observe
Xj = µ + σεj
for j = 1, . . . , 37 where the εj ’s are iid N (0, 1). Assume for the sake of simplification
that σ is known. The parameter is then just µ and the parameter space is R, the
sample space is R37 and we observe x = (x1 , . . . , x37 ).
1 P37
As usual µ̂ = 37 j=1 Xj and we introduce the statistic
37
1 X
h(X, µ0 ) = µ̂ − µ0 = (µ − µ0 ) + σ εj .
37
j=1
2
If µ = µ0 the distribution of this statistic is N (0, σ37 ) and if we recall that 1.96 is
the 0.975-quantile for the N (0, 1) normal distribution
σ
Pµ0 |h(X, µ0 )| ≤ 1.96 √ = 0.95
37
178 Statistical models and inference
We find that
σ
I(x) = {µ0 ∈ R | |h(X, µ0 )| ≤ 1.96 √ }
37
σ σ
= {µ0 ∈ R | −1.96 √ ≤ µ̂ − µ0 ≤ 1.96 √ }
37 37
σ σ
= µ̂ − 1.96 √ , µ̂ + 1.96 √
37 37
If we plug in the estimated value of the standard deviation we get the 95%-confidence
interval [8.27, 8.80]. Since the standard deviation is estimated and not known we
violate the assumptions for the derivations above. Because there is an approximation
involved we say that the confidence interval has nominal coverage 95% whereas the
actual coverage may be lower. If n is not too small the approximation error is minor
and the actual coverage is close to the nominal 95%. Below we treat a method that
is formally correct – also if the variance is estimated. ⋄
Note the similarity of the construction above with a hypothesis test. If we formulate
the simple null-hypothesis
H 0 : µ = µ0 ,
we can introduce the test statistics h(x, µ0 ) as above and reject the test if this
test statistic is larger in absolute value than 1.96 √σ37 . The 95%-confidence interval
consists precisely of those µ0 where the two-sided level 5% test based on the test
statistic h(x, µ0 ) will be accepted.
This duality between confidence intervals and statistical tests of simple hypotheses
is a completely general phenomena. Consider a simple null-hypothesis
H0 : θ = θ0
This equality also implies that if I(x) for x ∈ E form (1 − α)-confidence sets then
the set
A(θ0 ) = {x ∈ E | θ0 ∈ I(x)}
forms an acceptance region for a level α test.
If Θ ⊆ R we naturally ask what general procedures we have available for producing
confidence intervals. We consider here intervals that are given by the test statistic
h(x, θ0 ) = θ̂(x) − θ0 ,
Confidence intervals 179
for any estimator θ̂ of the unknown parameter. With α ∈ (0, 1) the fundamental
procedure is to find formulas for zα (θ0 ) and wα (θ0 ), the α/2 and 1 − α/2 quantiles,
for the distribution of θ̂(x) − θ0 under the probability measure Pθ0 and define the
confidence set
I(x) = {θ0 ∈ Θ | zα (θ0 ) ≤ θ̂(x) − θ0 ≤ wα (θ0 )}.
This is what we did for the binomial and the normal distributions above. Unfortu-
nately these are special cases, and even if we know the quantiles as a function of θ0
it may not be practically possible to compute I(x) above. In general the set does not
even have to be an interval either! A computable alternative is obtained by plugging
in the estimate of θ in the formulas for the quantiles above, which gives
I(x) = [θ̂(x) − wα (θ̂(x)), θ̂(x) − zα (θ̂(x))].
Exact distributions and thus quantiles are hard to obtain, and in most cases we have
to rely on approximations. Below we present three of the most basic constructions
of approximate (1 − α)-confidence intervals.
• Suppose we have a formula se(θ0 ) for the standard deviation of θ̂ under Pθ0
– often referred to as the standard error of θ̂. Assume, furthermore, that the
distribution of θ̂(x)−θ0 can be approximated by the N (0, se(θ0 )2 )-distribution.
With zα the 1 − α/2 quantile for the N (0, 1)-distribution the general construc-
tion is
I(x) = {θ0 ∈ Θ | −se(θ0 )zα ≤ θ̂(x) − θ0 ≤ se(θ0 )zα }.
As above, it may be practically impossible to compute I(x) and it may not
even be an interval. If we plug in the estimate of θ in the formula se(θ0 ) we
arrive at the interval
I(x) = [θ̂(x) − se(θ̂(x))zα , θ̂(x) + se(θ̂(x))zα ].
• Estimates ẑα and ŵα of the quantiles zα (θ0 ) and wα (θ0 ) or an estimate, se,
ˆ of
the standard error of θ̂ are found by simulations. This is known as bootstrapping
and the technicalities will be pursued below. In any case, once the estimates
have been computed one proceeds as above and computes either
I(x) = [θ̂(x) − ŵα , θ̂(x) − ẑα ],
or
I(x) = [θ̂(x) − sez
ˆ α , θ̂(x) + sez
ˆ α ].
where zα is the (1 − α/2)-quantile for the normal distribution N (0, 1). We find that
this interval is identical to the approximate interval considered in Example 3.5.1. For
the binomial distribution we will not use the observed Fisher information because
we have a formula for the Fisher information. The observed information is useful
when we don’t have such a formula. ⋄
Confidence intervals 181
2
1
1
µ
µ
0
0
^
^
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
µ µ
Figure 3.14: These figures show 95%- and 99%-confidence intervals for the parameter
of interest µ in the normal distribution N (µ, σ 2 ) with n = 10 (left) or n = 30 (right)
independent observations. On the figure σ 2 = 1. Reading the figure from the x-axis
the full lines give the 0.025- and 0.975 quantiles for the normal distribution with
√
variance 1/ n and mean value parameter µ, the dashed lines give the 0.005- and
0.995 quantiles. For a given estimate µ̂(x) = y we can read the figure from the y-axis
and read of which µ that could produce such an estimate. This gives the confidence
intervals.
In the discussion above we do in reality only treat the situation with a single, uni-
variate real parameter. In situations with more then one parameter we gave abstract
definitions of confidence sets but we did not provide any practical methods. Though
it is possible – also computationally and in practice – to work with confidence sets
for more than a univariate parameter, such sets are notoriously difficult to relate to.
When we have more than one unknown parameter we usually focus on univariate
parameter transformations – the parameters of interest.
In general, if τ : Θ → R is any map from the full parameter space into the real line,
and if we are really interested in τ = τ (θ) and not so much θ, we call τ the parameter
of interest. If θ̂ is an estimator of θ, then τ̂ = τ (θ̂) can be taken as an estimator of
τ . This is the plug-in principle.
This is not the plug-in estimator – the plug-in estimator requires that we have an
estimator, θ̂, of θ and then use τ (θ̂) = Pθ̂ (A) as an estimator of τ (θ). If necessary,
Pθ̂ (A) can be computed via simulations.
⋄
Example 3.5.8. The logistic regression model for X1 , . . . , Xn with y1 , . . . , yn fixed,
as considered in Example 3.3.17, is given by the point probabilities
exp(α + βyi )
P(Xi = 1) =
1 + exp(α + βyi )
Confidence intervals 183
p(y)
log = α + βy.
1 − p(y)
The log odds equal 0 precisely when p(y) = 1/2 and this happens when y = −α/β.
The value of y where p(y) = 1/2 is called LD50 , which means the Lethal Dose for
50% of the subjects considered. In other words, the dose that kills half the flies. We
see that in terms of the parameters in our logistic regression model
α
LD50 = − .
β
0.6
0.4
0.2
LD50 = − 1.899
0.0
−4 −3 −2 −1 0
log(concentration)
Figure 3.15: In the fly death experiment we estimate the parameter of interest, LD50 ,
to be −1.899. Thus a concentration of 0.1509 = exp(−1.899) is found to be lethal
for half the flies.
If we use the plug-in principle for estimation of the parameter of interest, we can
proceed and try to find the distribution of the statistic
τ̂ − τ0 = τ (θ̂) − τ (θ0 ).
We observe that the distribution of τ (θ̂) is the transformation via τ of the distribu-
tion of θ̂. If we are capable of finding this distribution, we can essentially use the
184 Statistical models and inference
Definition 3.5.9. A confidence set for a real valued parameter of interest τ = τ (θ)
given the observation x ∈ E is a subset I(x) ⊆ R. If we for each x ∈ E have a
given confidence set I(x) we say that (I(x))x∈E are (1 − α)-confidence sets for the
unknown parameter of interest if for all θ ∈ Θ
For the practical construction of confidence intervals we can proceed in ways very
similar to those in the previous section. We summarize the practically applicable
methods below – noting that for the constructions below it is not in general an
assumption that the estimator of τ is the plug-in estimator, but for some of the
constructions we still need an estimator θ̂ of the full parameter.
• We have formulas zα (θ0 ) and wα (θ0 ) for the α/2 and 1 − α/2 quantiles for the
distribution of τ̂ − τ in which case we can compute the interval
If we have a formula se(θ0 ) for the standard error of τ̂ under Pθ0 we can use
the plug-in estimator se ˆ = se(θ̂(x)). If θ̂ is the maximum-likelihood estimator
and τ̂ = τ (θ̂) is the plug-in estimator, an estimator, se,
ˆ is obtainable in terms
of the Fisher information, see Math Box 4.7.3.
• Estimates ẑα and ŵα of the quantiles zα (θ0 ) and wα (θ0 ) or an estimate, se,
ˆ of
the standard error of τ̂ are found by bootstrapping and we compute
or
I(x) = [τ̂ (x) − sez
ˆ α , τ̂ (x) + sez
ˆ α ].
Example 3.5.10. Just as in Example 3.5.3 we consider the statistical model spec-
ified as
Xj = µ + σεj
Confidence intervals 185
The standard error of τ (θ̂) can be estimated in two ways. First, via the plug-in
method q
ˆ = Dτ (θ̂)T I(θ̂)−1 Dτ (θ̂),
se
which requires that we have a formula for I(θ). Alternatively, we can estimate the
Fisher information I(θ) by the observed Fisher information D2 lX (θ̂) in which case
we get q
se
ˆ = Dτ (θ̂)T D2 lX (θ̂)−1 Dτ (θ̂).
for j = 1, . . . , n where the εj ’s are iid N (0, 1), but we do not assume that σ is
known. The parameter of interest is µ and we seek a (1 − α)-confidence
P interval. The
parameter space is R × (0, ∞) and we use the MLE µ̂ = n1 nj=1 Xj and the usual
modification
n
2 1 X
σ̂ = (Xj − µ̂)2
n−1
j=1
with zα the (1 − α/2)-quantile for the N (0, 1)-distribution. This interval is identical
√
to the general approximation constructions when we use σ̂/ n as an estimate of the
186 Statistical models and inference
These intervals are exact under the assumption of iid normally distributed observa-
tions, and in this case the actual coverage is (1 − α).
If we return to the gene expression data that we also considered in Example 3.5.3 we
found the approximate 95%-confidence interval [8.27, 8.80]. There are 37 observations
and using the t-distribution with 36 degrees of freedom instead of the approximat-
ing normal distribution the quantile changes from 1.96 to 2.03 and the confidence
interval changes to [8.26, 8.81]. Thus by using the more conservative t-distribution
the confidence interval is increased in length by roughly 3.5%. ⋄
The second construction based on the t-statistic in Example 3.5.10 does not fit in
among the bullet points above. When τ is the parameter of interest the general
t-statistic is
τ̂ (x) − τ
t = t(x, τ ) =
se(θ̂(x))
where se(θ0 ) is the standard error of τ̂ under Pθ0 and θ̂ is an estimator of θ. Using
this statistic requires first of all that we have a formula for the standard error. If we
approximate the distribution of the t-statistic by a N (0, 1)-distribution, the resulting
confidence intervals
I(x) = {τ ∈ R | |t(x, τ )| ≤ zα } = [τ̂ (x) − se(θ̂(x))zα , τ̂ + se(θ̂(x))zα ]
where zα is the 1 − α/2-quantile for the N (0, 1)-distribution are the same as we have
under the second bullet point. When the parameter of interest is the mean value
parameter µ for the normal distribution we found in Example 3.5.10 the formula
σ
se(µ, σ 2 ) = √
n
Confidence intervals 187
for the standard error. For this particular example – as we already discussed – it is
possible theoretically to find the exact distribution of the t-statistics. As stated, it is
a t-distribution with n − 1 degrees of freedom. The consequence is that the quantile
zα from the normal distribution is replaced by a quantile wα from the t-distribution.
It happens so that wα ≥ zα but that the quantile for the t-distribution approaches
zα when n gets large. Using the exact t-distribution gives systematically wider con-
fidence intervals. It is not uncommon to encounter the standard confidence interval
construction as above but where the quantile from the normal distribution has been
replaced by the quantile from the t-distribution, also in situations where there is no
theoretical reason to believe that the t-distribution is a better approximation than
the normal distribution. It is difficult to say if such a practice is reasonable, but since
the resulting confidence intervals get systematically larger this way, we can regard
the procedure as being conservative.
Example 3.5.11. Consider the setup in Section 3.4.1 with two groups of indepen-
dent normally distributed observations where the full parameter set is
rather than the difference in means, we can obtain (nominal) 95%-confidence inter-
vals as [20.874 , 21.532 ] = [1.83, 2.89] using the quantile from the normal distribution
and [20.868 , 21.537 ] = [1.82, 2.90] using the t-distribution.
If we make the assumption of equal variances in the two groups, the t-distribution
becomes exact. The confidence interval based on the t-distribution with 37 + 42 −
2 = 77 degrees of freedom is [0.874, 1.531]. The general formula for the (1 − α/2)-
confidence interval for the difference δ under the assumption of equal variances is
" r r #
n+m n+m
µ̂1 − µ̂2 − σ̂wα , µ̂1 − µ̂2 + σ̂wα
nm nm
We have in this section focused on general confidence intervals based on the statistic
τ̂ − τ . The t-statistic is an example of another possible choice of statistic useful for
confidence interval constructions. There are numerous alternatives in the literature
for choosing a suitable statistic h(x, τ ), but we will not pursue these alternatives
here.
3.6 Regression
Regression models form the general class of models that is most important for sta-
tistical applications. It is primarily within the framework of regression models that
we treat relations among several observables. Sometimes these relations are known
up to a small number of parameters from a given scientific theory in the subject
matter field. In other cases we try to establish sensible relations from the data at
hand.
We specify the general regression model for a real valued observable X given y by
the scale-location model
X = gβ (y) + σε
where ε is a random variable with mean 0 and variance 1. Here y ∈ E for some
sample space E and
gβ : E → R
is a function parametrized by β. The full parameter is θ = (β, σ) ∈ Θ, where we
will assume that Θ ⊆ Rd × (0, ∞), thus the β-part of the parametrization is a d-
dimensional real vector. The variable y has many names. Sometimes it is called the
independent variable – as opposed to X which is then called the dependent variable
– in other situations y is called a covariate. We will call y the regressor to emphasize
that the observable X is regressed on the regressor y.
Regression 189
Xi = gβ (yi ) + σεi
Maximum-likelihood estimation, for the regression model at hand, can thus be boiled
down to a matter of minimizing the function above. We elaborate on this in the case
where the distribution of ε is assumed to be the normal distribution. In that case
n
1 X √
lx (β, σ) = n log σ + 2 (xi − gβ (yi ))2 +n log 2π.
2σ
|i=1 {z }
RSS(β)
400
300
loss
200
100
40 50 60 70 80 90 100
hard
Figure 3.16: Scatter plot of the relation between hardness and abrasion loss for tires
and the straight line estimated by least squares regression.
The least squares estimator and the MLE are identical for the normal distribution
but otherwise not.
The variance is on the other hand typically not estimated by the MLE above – not
even if the assumption of the normal distribution holds. Instead we use the estimator
1
σ̂ 2 = RSS(β̂)
n−d
gβ (y) = β0 + β1 y.
2.0
1.5
1.5
1.0
density
density
1.0
0.5
0.5
0.0
0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
concentration concentration
2.0
1.5
1.5
1.0
density
density
1.0
0.5
0.5
0.0
0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12
concentration concentration
Figure 3.17: The optical density in the ELISA calibration experiment as a function
of the known concentrations of DNase in four different runs of the experiment.
Under the very mild condition that the yi ’s are not all equal there is a unique
minimizer, and there is even an explicit, analytic formula for the minimizer, see
Math Box 3.6.1.
Example 3.6.1. We consider in this example the relation between the hardness
measured in Shore units of a rubber tire and the abrasion loss in gm/hr. Figure
3.16 shows the scatter plot. From the scatter plot we expect that there is a close
to straight line relation; the harder the tare is the smaller is the loss. There are
30 observations in the dataset and if Xi denotes the loss and yi the hardness for
i = 1, . . . , 30 we suggest the model
Xi = β0 + β1 yi + σεi
where ε1 , . . . , ε30 are iid N (0, 1). Figure 3.16 also shows the straight line estimated
by least squares estimation of the parameters (β0 , β1 ). ⋄
in ng/ml, of the DNase protein was made, and each run, out of a total of 11 runs, of
the experiment consisted of measuring the optical density from the ELISA experi-
192 Statistical models and inference
0.5
0
−0.5
−1
log−density
log−density
−2
−1.5
−3
−2.5
−4
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
log−concentration log−concentration
0
0
−1
log−density
log−density
−1
−2
−3
−2
−4
−3
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
log−concentration log−concentration
Figure 3.18: The logarithm of the optical density in the ELISA calibration experi-
ment as a function of the known logarithm of concentrations of DNase in four dif-
ferent runs of the experiment. Note that the transformations make the data points
lie more closely to a straight line than for the untransformed data.
ment with two measurements for each concentration. The first four runs are shown
on Figure 3.17.
From that figure it is not obvious that we can use a linear regression model to capture
the density as a function of concentration. However, taking a look at Figure 3.18 we
see that by applying the log-transformation to both quantities we get points that
approximately lie on a straight line. Thus considering only one run, the first, say,
then if Xi denotes the log-density and yi the log-concentration for i = 1, . . . , 16 we
can try the model
Xi = β0 + β1 yi + σεi .
The resulting estimated lines plotted on Figure 3.18 are estimated in this way – with
a separate estimation for each of the four runs shown on the figure.
⋄
Example 3.6.3 (Beaver body temperature). As with the DNase example above it
is a standard trick to be able to find a suitable transformation of the observables
such that a straight line relation becomes plausible. We consider in this example
the body temperature for a beaver measured every 10 minutes for a period of 16
hours and 40 minutes, that is, 100 measurements in total. We let Xi denote the i’th
Regression 193
38.0
37.5
temp
37.0
time
Figure 3.19: The body temperature of a beaver measured every 10 minutes together
with a curve estimated by least squares regression.
temperature measurement and ti the time in minutes since the first measurement
(that is t1 = 0, t2 = 10, . . .). We suggest that the body temperature will be periodic
with a period of 24 hours (1440 minutes) and be minimal at 8.30 in the morning.
As the first measurement is taken at 9.30 in the morning we introduce
yi = cos(2π(ti + 60)/1440)
An important technique used in the two latter examples above is that ordinary linear
regression can be useful even in situations where there is no linear relation between
the observable and the regressor. The question is if we can transform the regressor
and/or the observable by some known transformation(s) such that linear regression
is applicable for the transformed values instead.
The most interesting hypothesis to test in relation to the linear regression model
is whether β1 = 0 because this is the hypothesis that the value of the regressor
does not influence the distribution of the observable. There are several formal test
statistics that can be computed for testing this hypothesis. From a summary of the
194 Statistical models and inference
> library(MASS)
> data(Rubber)
> rubberLm <- lm(loss~hard,data=Rubber)
object returned by the lm function in R we can read of the value of the estimated
standard error se
ˆ of the estimator for β1 , the t-test statistic
β̂1
t=
se
ˆ
for testing the hypothesis H0 : β1 = 0 together with a p-value computed based on
the t-distribution with n − 2 degrees of freedom. The estimated standard error se
ˆ
can also be used for constructing confidence intervals of the form
[β̂1 − sez
ˆ α , β̂1 + sez
ˆ α]
where zα is the (1−α/2)-quantile for the N (0, 1)-distribution. If the error distribution
is N (0, 1) then there is theoretical basis for replacing the quantile for the normal
distribution by the quantile for the t-distribution with n − 2 degrees of freedom.
With β̂0 and β̂1 the least squares estimates we introduce the fitted values
It is important to investigate if the model actually fits the data. That is, we need
methods that can tell if one or more of the fundamental model assumptions are
violated. The residuals should resemble the noise variables σε1 , . . . , σεn , who are iid.
Moreover, there is no relation between the yi ’s and the σεi ’s, and the variance of σεi
Regression 195
Math Box 3.6.1 (Linear regression). The theoretical solution to the minimization
of RSS is most easily expressed geometrically. We denote by x ∈ Rn our vector of
observations, by y ∈ Rn the vector of regressors and by 1 ∈ Rn a column vector
of 1’s. The quantity RSS(β) is the length of the vector x − β0 1 − β1 y in Rn . This
length is minimized over β0 and β1 by the orthogonal projection of x onto the space
in Rn spanned by y and 1.
One can find this projection
√ if we know an orthonormal basis, and such a one is
found by setting a = 1/ n (so that a has length 1) and then replacing y by
y − (y T a)a
b = pPn .
2
i=1 (yi − ȳ)
T
Both a and b have unit length
Pn and they are orthogonal as a b = 0. Note that
T 1
(ya )a = ȳ1 where ȳ = n i=1 yi . We have to assume that at least two of the yi ’s
differ or the sum in the denominator above is 0. If all the yi ’s are equal, the vector
y and 1 are linearly dependent, they span a space of dimension one, and β0 and β1
are not both identifiable. Otherwise the vectors span a space of dimension two.
The projection of x onto the space spanned by a and b is
xT y − nȳx̄ xT y − nȳ x̄
(xT a)a + (xT b)b = x̄ − ȳ Pn 2
1 + Pn 2
y.
i=1 (yi − ȳ) i=1 (yi − ȳ)
| {z } | {z }
β̂0 β̂1
xT y − nȳ x̄
β̂1 = Pn 2
β̂0 = x̄ − ȳβ̂1 .
i=1 (yi − ȳ)
does not depend upon the value yi . We typically investigate these issues by graphical
methods – diagnostic plots – based on the computed residuals. There is one caveat
though. The variance of the i’th residual does in fact depend upon the yi ’s, and it
can be shown to equal σ 2 (1 − hii ) where
1 (yi − ȳ)2
hii = + Pn 2
n j=1 (yj − ȳ)
P
and ȳ = n1 ni=1 yi . The quantity hii is called the leverage of the i’th observation.
The standardized residuals are defined as
ei
ri = √ .
σ̂ 1 − hii
The leverage quantifies how much the observation xi influences the fitted value
x̂i relative to the other observations. A large leverage (close to 1) tells that the
observation heavily influences the fitted value whereas a small value (close to 1/n)
tells that the observation has minimal influence.
196 Statistical models and inference
Math Box 3.6.2 (Fitted values). We continue here with the setup from Math
Box 3.6.1. If Xi = β0 + β1 yi + σεi we find that the fitted value is
n
X
X̂i = (X T a)ai + (X T b)bi = β0 + β1 yi + σ(aj ai + bj bi )εj .
j=1
P
The quantity usually denoted hii = nj=1 (aj ai + bj bi )2 is know as the leverage and
if we use that a and b are orthonormal we find that
n
X
hii = (aj ai + bj bi )2 = a2i aT a + b2j bT b + 2ai bi aT b
j=1
1 (yi − ȳ)2
= a2i + b2i = + Pn 2
.
n j=1 (yj − ȳ)
Plots of the residuals or the standardized residuals against the yi ’s or against the
fitted values x̂i ’s are used to check if the mean value specification β0 + β1 yi is ad-
equate. We expect to see an unsystematic distribution of points around 0 at the
y-axis and spread out over the range of the variable we plot against. Systematic
deviations from this, such as slopes or bends, in the point cloud indicate that the
model specification does not adequately capture how the mean of Xi is related to
yi . A plot of the standardized residuals, their absolute value – or the square root
of their absolute value, as the choice is in R – against the fitted values can be used
to diagnose if there are problems with the assumption of a constant variance. We
look for systematic patterns, in particular whether the spread of the distribution
of the points changes in a systematic way over the range of the fitted values. If so,
there may be problems with the constant variance assumption. Finally, we can also
compare the standardized residuals via a QQ-plot to the distribution of the εi ’s. A
QQ-plot of the empirical distribution of the standardized residuals against the nor-
Regression 197
1.4
28
26
28 29
100
26
1.2
29
Standardized residuals
1.0
50
Residuals
0.8
0.6
0
0.4
−50
0.2
−100
0.0
100 150 200 250 300 100 150 200 250 300
0.5
2
28
26 26
29 29
1.5
Standardized residuals
Standardized residuals
1
1.0
1
0.5
0
−0.5
−1
Cook’s distance
−1.5
Figure 3.20: The four standard diagnostic plots produced in R for a linear regression.
Call:
lm(formula = loss ~ hard, data = Rubber)
Residuals:
Min 1Q Median 3Q Max
-86.15 -46.77 -19.49 54.27 111.49
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 550.4151 65.7867 8.367 4.22e-09 ***
hard -5.3366 0.9229 -5.782 3.29e-06 ***
---
198 Statistical models and inference
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this we read of the estimates β̂0 = 550.4 and β̂1 = −5.334, their standard
errors, the t-value for a test whether the parameter equals 0 and corresponding
p-value computed using the t-distribution with 28 degrees of freedom. As a visual
guidance for the eye, the stars are printed to highlight significance at different levels.
The conclusion is in this case that none of the parameters can be taken equal to 0.
The residual standard error is the estimate of σ.
To assess if the model is adequate we consider the four diagnostic plots shown in
Figure 3.20. These are the default plots produced by a call of plot(rubberLm). The
plot of the residuals against the fitted values shows that the residuals are scattered
randomly around 0 over the range of the fitted values. Thus the straight line seems
to capture the relation between loss and hardness well. The QQ-plot shows that the
distribution of the residuals is reasonably approximated by the normal distribution
– the deviations from a straight line are within the limits of what is conceivable
with only 30 observations. The plot of the square root of the absolute value of the
standardized residuals against the fitted values shows that the variance does not
seem to depend upon the mean. Moreover, there are no clear outliers. The fourth
and final plot shows the standardized residuals plotted against the leverage. We
should in particular be concerned with combinations of large leverage and large
standardized residual as this may indicate an outlier that has considerable influence
on the fitted model.
Confidence intervals for the two estimated parameters, β0 and β1 , can be computed
from the information in the summary table above. All we need is the standard error
and the relevant quantile from the t-distribution or the normal distribution. For
convenience, the R function confint can be used to carry out this computation.
> confint(rubberLm)
2.5 % 97.5 %
(Intercept) 415.657238 685.173020
hard -7.227115 -3.445991
The parameters β0 and β1 are, however, not the only parameters of interest in the
context of regression. If y ∈ R the predicted value
is of interest for the prediction of a future loss for a tire of hardness y. If we have a new
data frame in R, called newdata, say, with a column named hard, as our regressor
Regression 199
Prediction
400
300
Loss
200
100
40 50 60 70 80 90 100
Hardness
Figure 3.21: Predicted values, 95%-confidence bands and 95%-prediction bands for
the linear regression model of tire abrasion loss as a function of tire hardness.
Figure 3.21 shows predicted values, 95%-confidence bands and 95%-prediction bands.
⋄
200 Statistical models and inference
0.10
27 41
0.1
0.05
0.00
0.0
Residuals
Residuals
−0.05
−0.1
−0.10
−0.15
47
−0.2
31
32
−0.20 48
−0.3
Figure 3.22: Residual plots for the linear regression model of log-density on log-
concentration of run 2 and 3 for the DNase ELISA experiment.
Example 3.6.5. The summary of the lm-object for one of the linear regressions for
the ELISA data considered in Example 3.6.2 reads
Call:
lm(formula = log(density) ~ log(conc), data = myDNase[myDNase[,1] == 2, ])
Residuals:
Min 1Q Median 3Q Max
-0.25254 -0.07480 0.01123 0.11977 0.17263
Regression 201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.85572 0.03505 -24.41 7.09e-13 ***
log(conc) 0.69582 0.02025 34.36 6.39e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Figure 3.22 shows the residual plot for two of the runs. Both plots are typical for
this dataset and shows a lack of fit for the model. The residuals are positive for the
fitted values in the middle range and negative for the extreme fitted values. The
straight line does not seem to fully capture the relation between the log-density and
the log-concentration. We will return to a refined model in the next section.
The fact that the model does not fit the data does not mean that the linear regression
model is useless. After all, it captures the general relation between the log-density
and log-concentration up to some relatively small curvature, which on the other
hand appears very clearly on the residual plot. The systematic error of the model
for the mean value leads to a larger estimate of the variance than if the model was
more refined. Often this results in conservative conclusions such as wider confidence
intervals for parameters of interest and wider prediction intervals. Whether this is
tolerable must be decided on a case-by-case basis. ⋄
Example 3.6.6. Figure 3.24 shows the residual plot and the QQ-plot for the beaver
data considered in Example 3.6.3. The QQ-plot seems OK, but the residual plot
shows a problem. Around the middle of the residual plot the residuals show a sudden
change from being exclusively negative to being mostly positive. This indicates a
lack of fit in the model and thus that the temperature variation over time cannot be
ascribed to a 24 hour cycle alone.
In fact, there might be an explanation in the dataset for the change of the tem-
perature. In addition to the time and temperature it has also been registered if the
beaver is active or not. There is a 0-1 variable called activ in the dataset, which is
1 if the beaver is active. A refined analysis will include this variable so that we have
a three parameter model of the mean with
β0 + β1 y
β0 + βactiv + β1 y
if the beaver is active. Here y = cos(2πt/1440) where t is time in minutes since 8.30
in the morning. The summary output from this analysis is:
202 Statistical models and inference
100 100
42 42
99 99
2
0.4
Standardized residuals
0.2
1
Residuals
0.0
0
−0.2
−1
−0.4
Figure 3.23: Residual plot and QQ-plot for the beaver body temperature data.
Call:
lm(formula = temp ~ activ + cos(2 * pi * (time + 60)/1440), data = beaver2)
Residuals:
Min 1Q Median 3Q Max
-0.44970 -0.11881 -0.02208 0.17742 0.39847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.19976 0.04253 874.690 < 2e-16 ***
activ 0.53069 0.08431 6.294 8.95e-09 ***
cos(2 * pi * (time + 60)/1440) -0.24057 0.06421 -3.747 0.000304 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can read from the summary that the hypothesis that βactiv = 0 is clearly re-
jected. The estimate β̂activ = 0.531 suggests that the beaver being active accounts
for roughly half a grade of increase in the body temperature. Figure 3.24 shows the
fitted mean value and the residual plot. It might be possible to refine the model
even further – perhaps with a smoother transition between the inactive and active
state – but we will not pursue such refinements here. Instead we will focus on an-
Regression 203
Residuals vs Fitted
0.4
42
38.0
0.2
Residuals
37.5
0.0
temp
−0.2
37.0
−0.4
89
90
0 200 400 600 800 1000 37.0 37.2 37.4 37.6 37.8 38.0
Figure 3.24: Plot of the beaver temperature data and estimated curve (left) and
residual plot (right) when the additional variable activ is included in the analysis.
other potential problem. The observations of the temperature are taken every 10
minutes and this may give problems with the independence assumption. When we
deal with observations over time, this should always be taken into consideration. If
the observations are ordered according to observation time we can investigate the
assumption by a lag-plot of the residuals e1 , . . . , e100 , which is a plot of (ei−1 , ei )
for i = 2, . . . , 100. Figure 3.25 shows that lag-plot of the residuals. This plot should
look like a scatter plot of independent variables, but in this case it does not. On the
contrary, it shows a clear dependence between residual ei and the lagged residual
ei−1 .
Though the residuals show dependence and the model assumptions of independent
εi ’s are questionable, the estimated mean value function can still be used as a rea-
sonable estimate. The problem is in general that all estimated standard errors are
systematically wrong as they are based on the independence assumption, and they
are typically too small, which means that we tend to draw conclusions that are too
optimistic and provide confidence and prediction intervals that are too narrow. The
right framework for systematically correcting for this problem is time series analysis,
which is beyond the scope of these notes. ⋄
204 Statistical models and inference
0.4
0.2
Residual
0.0
−0.2
−0.4
Lagged residual
Figure 3.25: The temperature observations are ordered by time and this plot shows
the residuals plotted against the lagged residuals, that is, the residuals from the
previous time point.
Dividing regression into linear and non-linear regression is like dividing the animals
in the Zoo into elephants and non-elephants. In the world of non-linear regression
you can meet all sorts of beasts, and we can in this section only touch on a few
simple examples that you can find outside of the world of linear regression.
In this section we will also consider estimation using the least squares method only.
This is only the maximum-likelihood method if the εi -terms are normally distributed.
In the world of regression this is often referred to as non-linear least squares. We can
in general not expect to find closed form solutions to these minimization problems
and must rely on numerical optimization. A word of warning is appropriate here. We
cannot expect that the numerical optimization always goes as smoothly as desirable.
To find the correct set of parameters that globally minimizes the residual sum of
squares we may need to choose appropriate starting values for the algorithm to
converge, and it may very easily be the case that there are multiple local minima,
that we need to avoid.
For non-linear regression there are generalizations of many of the concepts from
linear regression. The fitted values are defined as
To check the model we make residual plots of the residuals against either the re-
gressors yi or the fitted values x̂i and we look for systematic patterns that either
indicate that the model of the mean via the function gβ̂ is inadequate or that the
constant variance assumption is problematic. Moreover, we can compare the empir-
ical distribution of the residuals to the normal distribution via a QQ-plot. We don’t
have a simple leverage measure, nor do we have a formula for the variance of the
residuals. So even though the residuals may very well have different variances it is
not as easy to introduce standardized residuals that adjust for this. In the context
of non-linear regression the term standardized residual often refers to ei /σ̂ where we
simply divide by the estimated standard deviation.
All standard estimation procedures used will produce estimates of the standard error
for the β-parameters that enter in the model. These estimates are based on the as-
sumption that the εi ’s are normally distributed, but even under this assumption the
estimates will still be approximations and can be inaccurate. The estimated standard
errors can be used to construct confidence intervals for each of the coordinates in the
β-vector or alternatively to test if the parameter takes a particular value. In some
cases – but certainly not always – it is of particular interest to test if a parameter
equals 0, because it is generally a hypothesis about whether a simplified model is
adequate.
β1 y
r= . (3.20)
β2 + y
R = gβ (y) + σε
R Box 3.6.2 (Non-linear least squares regression). There are several dif-
ferent possibilities for fitting non-linear regression models in R. The nls
function does non-linear least squares estimation and is quite similar in use
to the lm function for ordinary least squares.
With ethConv a data frame with two columns named rate and conc we
can estimate the Michaelis-Menten curve from Example 3.6.7 by
Consult Example 3.6.7 on how to summarize the result of this call with the
summary function.
A main difference from lm is that in the formula specification
rate ~ beta1 * conc/(beta2 + conc) we need to explicitly include all
unknown parameters and we need to explicitly give an initial guess of the
parameters by setting the start argument. You can leave out the specifi-
cation of start in which case nls automatically starts with all parameters
equal to 1 – and gives a warning.
If you want to do things beyond what nls can do there are several solutions,
some are tied up with a particular area of applications or with particular
needs.
The drc package is developed for dose-response curve estimation but can
be useful for general non-linear regression problems. The main function,
multdrc, does non-linear least squares estimation but a particular advantage
is that it allows for the simultaneous fitting of multiple dataset. Doing so it
is possible to share some but not all parameters across the different curves.
0.25
0.20
Rate
0.15
0.10
Concentration
Figure 3.26: The data and estimated Michaelis-Menten rate curve from Example
3.6.7
We can then call summary on the resulting object to get information on the parameter
estimates, estimated standard errors etc.
> summary(ethConvNls)
Parameters:
Estimate Std. Error t value Pr(>|t|)
beta1 0.309408 0.006420 48.19 5.35e-09 ***
beta2 0.029391 0.002503 11.74 2.30e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example 3.6.8 (Four parameter logistic model). One of the flexible and popular
non-linear regression models is known as the four parameter logistic model and is
208 Statistical models and inference
β2 − β1
gβ (y) = + β1
1 + exp(β4 (y − β3 ))
Parameters:
Estimate Std. Error t value Pr(>|t|)
beta1 -0.007897 0.017200 -0.459 0.654
beta2 2.377241 0.109517 21.707 5.35e-11 ***
beta3 1.507405 0.102080 14.767 4.65e-09 ***
beta4 -0.941106 0.050480 -18.643 3.16e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The conclusion from this summary is that the parameter β1 can be taken equal to
0. This is sensible as we expect a 0 measurement if there is no DNase in the sample
(the concentration is 0). Taking β1 = 0 a reestimation of the model yields the new
estimates
Parameters:
Estimate Std. Error t value Pr(>|t|)
beta2 2.34518 0.07815 30.01 2.17e-13 ***
beta3 1.48309 0.08135 18.23 1.22e-10 ***
beta4 -0.96020 0.02975 -32.27 8.51e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2
1.5
1
Standardized residuals
Optical density
1.0
0
0.5
−1
0.0
Figure 3.27: The data and estimated logistic model for the relation between optical
density and log-concentration for the DNase ELISA considered in Example 3.6.8.
For this model figure 3.27 shows the data and the estimated mean value curve for
run 1 in the dataset together with the residual plot, which shows that the model is
adequate.
When y is the log-concentration an alternative formulation of the logistic model
expressed directly in terms of the concentration is
β2 − β1 β2 − β1
gβ (conc) = + β1 = β4 + β1 .
1 + exp(β4 (log(conc) − β2 )) conc
1+ exp(β2 )
14
12
Log−intensity (base 2)
10
8
6
−2 0 2 4 6 8 10
Log−concentration (base 2)
Figure 3.28: .
We will use the logistic model as considered above to capture the relation between
log-concentration and log-intensity. The concentrations are in picoMolar.
Using nls in R the summary of the resulting object reads
Parameters:
Estimate Std. Error t value Pr(>|t|)
beta1 7.79500 0.06057 128.69 <2e-16 ***
beta2 12.68086 0.12176 104.15 <2e-16 ***
beta3 5.80367 0.09431 61.54 <2e-16 ***
beta4 -0.95374 0.08001 -11.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the summary we see that none of the parameters can be taken equal to 0. It is
most notable that β1 is significantly larger than 0 and in fact the estimate β̂1 = 7.795
is far from 0. It suggests that there is a considerable background signal even when
Regression 211
2
1
residuals(spikeIn95Nls, type = "pearson")
0
Sample Quantiles
−1
−2
−2
−3
−4
−4
8 9 10 11 12 −3 −2 −1 0 1 2 3
Figure 3.29: .
the concentration equals 0. Figure 3.28 shows the data and the estimated curve.
There are only 13 different concentrations but a large number of replications for each
concentration and the x-values have therefore been jittered to aid visualization. We
see from this plot that the logistic curve captures the overall behavior rather well, but
already from this plot it seems that there are problems with variance inhomogeneity
and a number of observations – in particular some of the small observations for the
high concentrations – seem problematic.
Figure 3.29 shows that residual plot and a QQ-plot of the residuals against the
normal distribution. It is clear from the residual plot that an assumption of the
same variance across the different concentrations does not hold. The normal QQ-plot
shows deviations from the normal distribution in both tails. This form of deviation
suggests that the distribution is left-skewed compared to the normal distribution
because there are too many small residuals and too few large residuals. However,
because the variance cannot be assumed constant this is in reality a mixture of
different distributions with different scales. ⋄
Exercises
For the next four exercises you are asked to consider the Rubber data from the MASS
library.
212 Statistical models and inference
Exercise 3.6.1. In the dataset there is another variable, tens. Plot loss versus tens
and carry out a linear regression of loss on tens. This includes model diagnostics.
What is the conclusion?
Exercise 3.6.2. Compute the residuals when you regress loss on hard and plot the
residuals versus tens. Interpret the result.
Exercise 3.6.3. If we let yi denotes the hardness and zi the tension, consider the
extended model
Xi = β0 + β1 yi + β2 zi + σεi
where ε1 , . . . , ε30 are iid N (0, 1). Use lm (the formula should be loss~hard+tens) to
carry our a linear regression model including model diagnostics where you regress on
hardness as well as tension.
Exercise 3.6.4. Use the function predict to compute 95% prediction intervals for
the 30 observed values using
• the model where we only regress on hardness
• the model where we regress on hardness and tension.
Compare the prediction intervals. It can be useful to plot the prediction intervals
versus the fitted values.
Exercise 3.6.5. Show that the Michaelis-Menten rate equation 3.20 can be rephrased
as
1 β2 1 1
= + .
r β1 y β 1
Argue that this formula suggests a linear regression model for the inverse of the rate
regressed on the inverse of the substrate concentration. Estimate the parameters using
this linear regression model and compare with Example 3.6.7.
3.7 Bootstrapping
The basic idea in bootstrapping for constructing confidence intervals for a parameter
of interest τ when we have an estimator τ̂ of τ = τ (θ) is to find an approximation
of the distribution of τ̂ − τ (θ0 ) – usually by doing simulations that depend upon the
observed dataset x ∈ E. What we attempt is to approximate the distribution of
τ̂ − τ (θ0 )
A minor modification of the algorithm is given by replacing the last two bullet
points by
PB
• Compute the empirical mean τ = B1 i=1 τ̂ (xi ) and the empirical standard
deviation v
u B
u 1 X
ˆ =t
se (τ̂ (xi ) − τ )2
B − 1 i=1
• Define I(x) = [τ̂ (x) − sez ˆ α ] where zα is the 1 − α/2 quantile for
ˆ α , τ̂ (x) + sez
the N (0, 1)-distribution.
The two most important and commonly encountered choices of Px are known as
parametric and non-parametric bootstrapping respectively. We consider parametric
bootstrapping here and non-parametric bootstrapping in the next section.
214 Statistical models and inference
Parameters
α β LD50
se
ˆ 0.713 0.358 0.080
[τ̂ (x) − 1.96se,
ˆ τ̂ (x) + 1.96se]
ˆ [3.74, 6.53] [2.00, 3.41] [−2.06, −1.74]
[τ̂ (x) − ŵ0.05 , τ̂ (x) − ẑ0.05 ] [3.47, 6.14] [1.81, 3.19] [−2.06, −1.74]
Note that the confidence interval(s) for LD50 are quite narrow and the estimated
standard deviation is likewise small – at least compared to the parameters α and β.
The parameter LD50 is simply much better determined than the two original param-
eters. Note also that the two different types of intervals differ for α and β but not for
LD50 . This can be explained by a right-skewed distribution of the estimators for the
two former parameters, which turn into a left translation of the second confidence
intervals as compared to the symmetric intervals based on se. ˆ The distribution of
the estimator for LD50 is much more symmetric around LD50 . ˆ ⋄
The empirical measure is the collection of relative frequencies, εn (A), for all events
A ⊆ E, which we encountered when discussing the frequency interpretation in Sec-
tion 2.3. It is also the frequency interpretation that provides the rationale for con-
sidering the empirical measure.
To define non-parametric bootstrapping we need to assume that the dataset consid-
ered consists of realizations of iid random variables. Thus we assume that E = E0n
and that the dataset x = (x1 , . . . , xn ) is a realization of n iid random variables
X1 , . . . , Xn . The empirical measure based on x1 , . . . , xn on E0 is denoted εn .
Bootstrapping 215
Px = εn ⊗ . . . ⊗ εn .
That is, random variables X1 , . . . , Xn with distribution Px are iid each having the
empirical distribution εn .
Whereas the procedure for doing parametric bootstrapping is straight forward, the
definition of non-parametric bootstrapping seems a little more difficult. In fact, the
definition of non-parametric bootstrapping is somewhat complicated. In practice it
is really the other way around. How to do parametric bootstrap simulations from
Pθ̂(x) relies upon the concrete model considered, and sometimes it can be difficult
to actually simulate from Pθ̂(x) . It does as a minimum require a simulation algo-
rithm that is model dependent. To do non-parametric bootstrap simulations using
the empirical measure is on the contrary easy and completely model independent.
Simulating from Px is a matter of simulating (independently) from εn , which in turn
is a matter of sampling with replacement from the dataset x1 , . . . , xn . Theorem 3.7.5
below is an adaption of the general Algorithm 2.11.1, which is suitable for simulat-
ing from any empirical measure. The recommended approach is to sample indices
from the dataset uniformly. This approach is efficient and completely generic. The
implementation requires no knowledge about the original sample space whatsoever.
The vector y then contains 1000 simulations from the empirical measure. Note the
parameter replace which by default is FALSE.
Result 3.7.5. Let x1 , . . . , xn be a dataset with values in the sample space E and
corresponding empirical measure. If U is uniformly distributed on {1, . . . , n} then
the distribution of
X = xU
is the empirical measure εn .
216 Statistical models and inference
defined by
hx1 ,...,xn (i) = xi .
Then if U has the uniform distribution on {1, . . . , n} the theorem states that the
transformed variable hx1 ,...,xn (U ) has the empirical distribution.
Remark 3.7.6. It follows from the theorem that if U1 , . . . , UB are B iid uniformly
distributed random variables taking values in {1, . . . , n} then X1 , . . . , Xn defined by
Xi = xUi
Non-parametric bootstrapping makes a priori only sense if our observables are iid. In
a regression setup we cannot use non-parametric bootstrapping directly. However,
we can non-parametrically bootstrap the residuals, which gives a bootstrapping al-
gorithm where
Xi = gβ̂ (yi ) + eUi
with U1 , . . . , Un are iid uniformly distributed on {1, . . . , n}. Thus we use the para-
metric estimate of the mean value relation between the observable and the regressor
and then we use the non-parametric, empirical distribution of the residuals to sample
new error variables.
Math Box 3.7.1 (Percentile interval). Assume that there exists a strictly increas-
ing function ϕ : R → R such that for all θ ∈ Θ the distribution of ϕ(τ̂ ) is symmetric
around ϕ(τ ) under Pθ . This means that the distribution of ϕ(τ̂ ) − ϕ(τ ) equals the
distribution of ϕ(τ ) − ϕ(τ̂ ), and in particular it follows that
where zα′ (τ ) and wα′ (τ ) are the α/2 and 1 − α/2 quantiles for the distribution of
ϕ(τ̂ ) under Pθ . The construction of standard (1 − α)-confidence intervals for ϕ(τ )
gives
[2ϕ(τ̂ (x)) − wα′ (τ̂ (x)), 2ϕ(τ̂ (x)) − zα′ (τ̂ (x))] = [zα′ , wα′ ].
Using that ϕ is strictly increasing allows us to take the inverse and find that the
corresponding quantiles for the distribution of τ̂ are zα = ϕ−1 (zα′ ) and wα =
ϕ−1 (wα′ ). The confidence interval obtained in this way for τ – by transforming
back and forth using ϕ – is therefore precisely the percentile interval [zα , wα ].
This argument in favor of the percentile method relies on the existence of the func-
tion ϕ, which introduces symmetry and justifies the interchange of the quantiles.
For any practical computation it is completely irrelevant to actually know ϕ. What
is really the crux of the matter is whether there exists such a ϕ that also makes
the distribution of ϕ(τ̂ ) − ϕ(τ ) largely independent of θ.
We may discard this construction by arguing that the [zα , wα ] is simply a misun-
derstanding of the idea in confidence intervals. Confidence intervals are intervals of
parameters for which the observation is reasonably likely. The interval [zα , wα ] is an
interval where the estimator will take its value with high probability if θ = θ̂(x).
This is in principle something entirely different. There are arguments, though, to
justify the procedure. They are based on the existence of an implicit parameter
transformation combined with a symmetry consideration; see Math Box 3.7.1 The
interval [zα , wα ] – and its refinements – is known in the literature as the percentile
confidence interval or percentile interval for short.
Symmetry, or approximate symmetry, of the distribution of the estimators around τ
makes the percentile interval and the classical interval very similar in many practical
cases. But rigorous arguments that justify the use of the percentile intervals are even
more subtle; see Math Box 3.7.1. If the distribution of the estimator is skewed, the
percentile intervals may suffer from having actual coverage which is too small. There
are various ways to remedy this, known as the bias corrected percentile method
and the accelerated bias corrected percentile method, but we will not pursue these
matters here.
There is one argument in favor of the percentile method, and that is invariance
under monotone parameter transformations. This means that if ϕ : R → R is an
increasing function and τ ′ = ϕ(τ ) is our new parameter of interest, then if [zα , wα ]
is a (1 − α)-percentile confidence interval for τ , then [ϕ(zα ), ϕ(wα )] is a (1 − α)-
percentile confidence interval for τ ′ = ϕ(τ ).
218 Statistical models and inference
Exercises
Exercise 3.7.1. Consider the Poisson model from Exercise 3.3.4. Implement para-
metric bootstrapping for computing confidence intervals for the two parameters α and
β. Compute for the Ceriodaphnia data 95% confidence intervals for the parameters.
The Poisson model can be estimated using glm in R:
glm(organisms~concentration,family=poisson,data=Ceriodaphnia)
use summary on the result to compute estimates of the standard error and compare
the bootstrapped confidence intervals with standard confidence intervals based on the
estimated standard error.
Exercise 3.7.2. Consider the four parameter logistic model for the ELISA data from
Example 3.6.8. Implement a bootstrapping algorithm (parametric or non-parametric)
for the computation of confidence intervals for the predicted value gβ̂ (x). Use the al-
gorithm for computing 95%-confidence intervals for predicted values using the ELISA
data from Example 3.6.8 for different choices of x – make a plot of the confidence
bands. Compare with corresponding confidence bands computed using a linear re-
gression on the log-log-transformed data as in Example 3.6.2.
4
We have previously introduced the mean and the variance for probability measures
on R given by a density or given by point probabilities. For the further development
it is beneficial to put these definitions into a more general context. In this chapter
we deal with expectations of real valued random variables in general. We get several
convenient results about how to compute expectations (means) and variances, and
we get access to a better understanding of how the empirical versions approximate
the theoretical mean and variance. We also touch upon higher order moments and
we discuss the multivariate concept of covariance. As an illustration of how some
of these methods can be applied, we discuss Monte Carlo simulations as a general
method based on random simulation for computing integrals numerically and we
discuss aspects of asymptotic theory. The chapter ends with a brief treatment of
entropy.
4.1 Expectations
For the general development and computations of means and variances of real valued
random variables it is useful to introduce some notation. With reference to Sections
2.4 and 2.6 we define the expectation of a real valued random variable by one of the
two following definitions.
Definition 4.1.1. If X is a real valued random variable with density f then
Z ∞
EX = xf (x)dx
−∞
219
220 Mean and Variance
If X1 and X2 are two real valued random variables with a joint distribution having
density f (x1 , x2 ) Result 2.13.5 shows that their marginal densities are
Z Z
f1 (x1 ) = f (x1 , x2 )dx1 f2 (x2 ) = f (x1 , x2 )dx1 .
Provided that the integrals make sense we can compute the expectation of X1 as
Z Z Z
EX1 = x1 f1 (x1 )dx1 = x1 f (x1 , x2 )dx2 dx1
The result in Result 4.1.3 is very useful since we may not be able to find an explicit
analytic expression for the density of the distribution of h(X) – the distribution may
not even have a density – but often the distribution of X is specified in terms of the
density f . The computation of the iterated integrals can, however, be a horrendous
task.
Taking h(x1 , x2 ) = x1 + x2 we find that
Z Z
E(X1 + X2 ) = (x1 + x2 )f (x1 , x2 )dx1 dx2
Z Z Z Z
= x1 f (x1 , x2 )dx2 dx1 + x2 f (x1 , x2 )dx1 dx2
= EX1 + EX2 .
The computations are sensible if X1 +X2 has finite expectation, but using the remark
above and noting that |x1 + x2 | ≤ |x1 | + |x2 | it follows that X1 + X2 indeed has finite
expectation if X1 and X2 have finite expectations. In conclusion, the expectation of
the sum is the sum of the expectations.
A result similar to Result 4.1.3 but for a discrete distribution can also be derived.
In fact, we will do that here. If X is a random variable with values in a discrete set
E, the random variable h(X) takes values in the discrete subset E ′ ⊆ R given by
E ′ = {h(x) | x ∈ E}.
For each z ∈ E ′ we let Az = {x ∈ E | h(x) = z} denote the set of all x’s in E, which
h maps to z. Note that each x ∈ E belongs to exactly one set Az . We say that the
sets Az , z ∈ E ′ , form a disjoint partition of E. The distribution of h(X) has point
probabilities (q(z))z∈E ′ given by
X
q(z) = P (Az ) = p(x)
x∈Az
by Definition 2.9.3, and using Result 4.1.2 the expectation of h(X) can be written
as X X X
Eh(X) = zp(z) = z q(x).
z∈E ′ z∈E ′ x∈Az
Since the sets Az , z ∈ E ′ , form a disjoint partition of the sample space E the sum
on the r.h.s. above is precisely a sum over all elements in E, hence
X
Eh(X) = h(x)p(x).
x∈E
Remark 4.1.6. Similar to the continuous case, h(X) has finite expectation if and
only if X
|h(x)|p(x) < ∞.
x∈E
Remark 4.1.7. If X is a Bernoulli variable with success probability p we find that
EX = 1 × P(X = 1) + 0 × P(X = 0) = p. (4.1)
In Section 4.2 we will develop the theoretical details for the assignment of an expec-
tation to a real valued random variable. In summary, a positive real valued random
variable X can be assigned an expectation EX ∈ [0, ∞] – but it may be equal to ∞.
A real valued random variable X with E|X| < ∞ – the expectation of the positive
real valued random variable |X| is finite – can be assigned an expectation EX ∈ R.
If E|X| < ∞ we say that the random variable X has finite expectation. The main
conclusion from Section 4.2 can be stated as the following result.
Result 4.1.8. If X and Y are two real valued random variables with finite expecta-
tion then X + Y has finite expectation and
E(X + Y ) = EX + EY.
Furthermore, if c ∈ R is a real valued constant then cX has finite expectation and
E(cX) = cEX.
Moreover, if X and Y are independent real valued random variables with finite ex-
pectation then
E(XY ) = EX EY.
Example 4.1.9. If X1 , . . . , Xn are iid Bernoulli variables with success probability
p then
X = X1 + . . . + Xn ∼ Bin(n, p).
We can find the expectation for the binomially distributed random variable X by
using Result 4.1.2
Xn
n k
EX = k p (1 − p)n−k ,
k
k=0
but it requires a little work to compute this sum. It is much easier to use Result
4.1.8 together with (4.1) to obtain that
EX = EX1 + . . . + EXn = p + . . . + p = np.
⋄
Expectations 223
Example 4.1.10. Let X1 , . . . , Xn be iid random variables with values in the four
letter alphabet E = {A, C, G, T}. Let w = w1 w2 . . . wm denote a word from this
alphabet with m letters. Assume that m ≤ n (we may think of m ≪ n) and define
n−m+1
X
N= 1(Xi Xi+1 . . . Xi+m−1 = w).
i=1
Thus N is the number of times word w occurs in the sequence. It follows using (4.1)
and Result 4.1.8 that the expectation of N is
n−m+1
X
EN = E1(Xi Xi+1 . . . Xi+m−1 = w)
i=1
n−m+1
X
= P(Xi Xi+1 . . . Xi+m−1 = w)
i=1
For k = 2 the central second moment is the variance, and will be treated in greater
detail below.
224 Mean and Variance
Occasionally we may also use the notation x to denote the average of n real numbers.
Though the average is always a well defined quantity, we can only really interpret
this quantity if we regard the x’s as a realization of identically distributed random
variables. If those variables have distribution P and if X is a random variable with
distribution P , then we regard µ̂n as an estimate of µ = EX. In Section 4.5 we derive
some results that say more precisely how µ̂n behaves as an estimate of µ = EX.
In this section we will make one observation – namely that the average is also the
expectation of a random variable, whose distribution is the empirical distribution,
εn , given by x1 , . . . , xn . The empirical distribution on R, which is defined in terms of
x1 , . . . , xn , can be seen as a transformation of the uniform distribution on {1, . . . , n}
via
hx1 ,...,xn : {1, . . . , n} → R,
which is defined by
hx1 ,...,xn (i) = xi .
where p(i) = 1/n are the point probabilities for the uniform distribution on {1, . . . , n}.
Notationally it may be useful to express what the distribution of X is when we
compute its expectation – if that is not clear from the context. Therefore we write
EP X for the expectation, EX, if the distribution of X is P . This allows us for
instance to write
µ̂n = Eεn X,
Exercises
Exercise 4.1.1. Consider the setup from Example 4.1.10 with the word w = TATAAA.
Compute the expectation of N when n = 10000 and the random variables have the
uniform distribution on the alphabet.
This section contains a rather technical development of the results for computing
expectations based on some fundamental results from measure and integration theory
that is beyond the scope of these notes. The section can be skipped in a first reading.
Result 4.2.1 (Fact). There exists an expectation operator E that assigns to any
positive random variable X (i.e. X takes values in [0, ∞)) a number
EX ∈ [0, ∞]
such that:
One should note two things. First, the expectation operator assigns an expectation
to any random variable provided it is real valued and positive. Second, that the
expectation may be +∞. The theorem presented as a fact above gives little infor-
mation about how to compute the expectation of a random variable. For this, the
reader is referred to the previous section.
One of the consequences that we can observe right away is, that if X and Y are
two positive random variables such that X ≤ Y (meaning P(X ≤ Y ) = 1), then
Y − X ≥ 0 is a positive random variable with E(Y − X) ≥ 0. Consequently, if
X ≤ Y , then X + Y − X = Y and
Math Box 4.2.1 (The expectation operator). The expectation operator of a pos-
itive random variable X can be defined in the following way. First, for 0 = s0 <
s1 < . . . < sn some positive real numbers we form a subdivision of the positive half
line [0, ∞) into n + 1 disjoint intervals I0 , I2 , . . . , In given as
[0, ∞) = I0 ∪ I1 ∪ . . . ∪ In−1 ∪ In
k k ... k k
= [0, s1 ] ∪ (s1 , s2 ] ∪ . . . ∪ (sn−1 , sn ] ∪ (sn , ∞)
Then we can compute the average of the si ’s weighted by the probabilities that
X ∈ Ii :
Xn
ξn (X) := si P(X ∈ Ii ).
i=0
If the size of each of the intervals shrinks towards zero as n → ∞ and sn → ∞ then
it is possible to show that ξn (X) always converges to something. Either a positive
real number or +∞. This limit is called the expectation of X and we may write
EX = lim ξn (X).
n→∞
Note that ξn (X) is defined entirely in terms of the distribution of X and so is the
limit.
Moreover, we can also regard
n
X
Xn = si 1(X ∈ Ii )
i=0
using for instance that 1(X ∈ Ii ) is a Bernoulli variable. It is definitely beyond the
scope of these notes to show that ξn (X) converges let alone that Result 4.2.1 holds
for the limit. The mathematically inclined reader is referred to the literature on
measure and integration theory.
Moreover, note that if X is any random variable and A an event then 1(X ∈ A) is
a Bernoulli random variable and by (4.3)
Example 4.1.10 in the previous section actually showed how the expectation of a
positive random variable could be computed from the few elementary rules in Result
4.2.1. One question arises. Why does Result 4.2.1 require that X is positive and not
More on expectations 227
just real valued? The answer lies in the following consideration: If X is a real valued
random variable we can define two positive real valued random variables by
which are called the positive and negative part of X respectively. They are both
transformations of X, and one can get X back from these two variables by
X = X + − X −.
Definition 4.2.2. If X is a real valued random variable we say that it has finite
expectation if
E|X| < ∞.
In this case the expectation of X, EX, is well defined by (4.7), that is
EX = EX + − EX − .
Without too much trouble, one can now show that the properties listed above as
(4.4) and (4.5) for the expectation of a positive random variable carries over in an
appropriate form for the expectation of a general real valued variable. This provides
a proof of the fundamental Result 4.1.8.
228 Mean and Variance
The derivation of Result 4.1.8 from the properties of E for positive random variables
is mostly a matter of bookkeeping. First we note that
|X + Y | ≤ |X| + |Y |,
so if E|X| < ∞ and E|Y | < ∞ then by (4.6) E|X + Y | < ∞. So X + Y does have
finite expectation if X and Y do.
To show the additivity we need a little trick. We have that
(X + Y )+ − (X + Y )− = X + Y = X + − X − + Y + − Y − ,
(X + Y )+ + X − + Y − = X + + Y + + (X + Y )− ,
where all terms on both sides are positive random variables. Then we can apply
Result 4.2.1, (4.5), to obtain
E(X + Y )+ + EX − + EY − = E(X + Y )− + EX + + EY +
and we see that cX has finite expectation if and only if |X| has finite expectation.
Moreover, if c ≥ 0
(cX)+ = cX + and (cX)− = cX − ,
thus
If c < 0
(cX)+ = −cX − (cX)− = −cX +
and
where the third equality follows form independence of X and Y . Without indepen-
dence the conclusion is wrong. This is the starting point for a general proof – relying
on the construction discussed in Math Box 4.2.1.
At this point in the abstract development it is natural to note that the Definitions
4.1.1 and 4.1.2 of the expectation of course coincide with the abstract definition.
In fact, Definitions 4.1.1 and 4.1.2 can be viewed as computational techniques for
actually computing the expectation. As we showed in Example 4.1.10 we may be able
to derive the mean of a random variable without getting even close to understanding
the actual distribution of the random variable. The integer variable N introduced
in Example 4.1.10 has a far more complicated distribution than we can handle, but
its mean can be derived based on the simple computational rules for E. Returning
to Definitions 4.1.1 and 4.1.2 for computing the expectation should be regarded as
a last resort.
Example 4.2.3 (Mixtures). Most of the standard distributions like the normal
distribution are not suited for models of multimodal data. By this we mean data
that cluster around several different points in the sample space. This will show up
on a histogram as multiple modes (peaks with valleys in between). If we for instance
consider gene expressions for a single gene in a group of patients with a given disease,
multiple modes may occur as a result of, yet unknown, subdivisions of the disease on
the gene expression level. In general, we want to capture this phenomena that there
is a subdivision of the observed variable according to an unobserved variable. This
is captured by a triple of independent variables (Y, Z, W ) where Y and Z are real
valued random variables and W is a Bernoulli variable with P(W = 1) = p. Then
we define
X = Y W + Z(1 − W ).
The interpretation is that the distribution of X is given as a mixture of the distri-
bution of Y and Z in the sense that either W = 1 (with probability p) in which
case X = Y or else W = 0 (with probability 1 − p) in which case X = Z. Since
|X| ≤ |Y |W + |Z|(1 − W ) ≤ |Y | + |Z| it follows by Result 4.1.8 that
E|X| ≤ E|Y | + E|Z|.
Thus X has finite expectation if Y and Z have finite expectation. Moreover, Result
4.1.8 implies, since Y and W are independent and Z and W are independent, that
EX = E(Y W ) + E(Z(1 − W )) = EY EW + EZ (1 − E(W )) = pEY + (1 − p)EZ.
Math Box 4.2.2. If X is a positive real valued random R ∞ variable with distribution
function F then it has finite expectation if and only if 0 1−F (x)dx < ∞ in which
case Z ∞
EX = 1 − F (x)dx.
0
This makes us in principle capable of computing the expectation of any real valued
random variable X. Both X + and X − are positive random variables, and with F +
and F − denoting their respective distribution functions we get that if
Z ∞ Z ∞
1 − F + (x)dx < ∞ and 1 − F − (x)dx < ∞
0 0
EX = EX + − EX −
Z ∞ Z ∞
= 1 − F + (x)dx − 1 − F − (x)dx
Z0 ∞ 0
− +
= F (x) − F (x)dx.
0
but note that the distribution of X is far from being a normal distribution. ⋄
4.3 Variance
Definition 4.3.1. If X is a real valued random variable with expectation EX, then
if X has finite second moment, that is, if X 2 has finite expectation, we define the
variance of X as
VX = E(X − EX)2 (4.8)
√
and the standard deviation is defined as VX.
The variance is the expectation of the squared difference between X and its expec-
tation EX. This is a natural way of measuring how variable X is.
Remark 4.3.2. Writing out (X − EX)2 = X 2 − 2XEX + (EX)2 and using Result
4.1.8 we obtain
VX = EX 2 − 2EXEX + (EX)2 = EX 2 − (EX)2 , (4.9)
which is a useful alternative way of computing the variance. The expectation of X 2 ,
EX 2 , is called the second moment of the distribution of X.
Remark 4.3.3. For any µ ∈ R we can write
(X − µ)2 = (X − EX + EX − µ)2
= (X − EX)2 + 2(X − EX)(EX − µ) + (EX − µ)2 ,
Variance 231
from which
with equality if and only if EX = µ. The number E(X − µ)2 is the expected squared
difference between µ and X, and as such a measure of how much the outcome
deviates from µ on average. We see that the expectation EX is the unique value of µ
that minimizes this measure of deviation. The expectation is therefore in this sense
the best constant approximation to any outcome of our experiment.
and
V(σX + µ) = E(σX + µ − µ)2 = E(σ 2 X 2 ) = σ 2 VX = σ 2 .
X−µ
We refer to σ as the normalization of X, which has mean 0 and standard deviation
1. ⋄
where µ̂n is the empirical mean. The empirical variance is an estimate of the variance
σ 2 . Like the empirical mean, the empirical variance is the variance of a random
232 Mean and Variance
variable having distribution εn . With the notation as in Section 4.1.1 and using
Result 4.1.5 the variance of X having distribution εn is
As for the expectation we use the subscript notation, VP X, to denote the variance,
VX, of X if the distribution of X is P . The square root of the empirical variance,
p
σ̃n = σ̃n2 ,
is called the sample standard deviation and is an estimate of the standard deviation,
σ, of the random variables.
Since σ̃n2 is the variance of a random variable X having distribution εn we can use
computational rules for variances. For instance, (4.9) can be used to obtain the
alternative formula
n
1X 2
σ̃n2 2
= Vεn X = Eεn X − (Eεn X) = xi − µ̂2n .
2
(4.12)
n
i=1
It should be remarked that whereas (4.9) can be quite useful for theoretical com-
putations it may not be suitable for numerical computations. This is because both
1 Pn 2 2
n i=1 xi and µ̂n can attain very large numerical values, and subtracting numeri-
cally large numbers can lead to a serious loss of precision.
Example 4.3.6 (Empirical normalization). We consider a dataset x1 , . . . , xn ∈ R
and let X be a random variable with distribution εn (the empirical distribution).
Then by definition
Eεn X = µ̂n and Vεn X = σ en2 .
The normalized dataset, {x′1 , . . . , x′n }, is defined by
xi − µ̂n
x′i =
en
σ
and the normalized empirical distribution, ε′n , is given by the normalized dataset. If
X has distribution εn the normalized random variable
X − µ̂n
X′ =
en2
σ
has distribution ε′n , and we find, referring to Example 4.3.5 that
′ X − µ̂n
Eε′n X = Eεn =0
σen2
Variance 233
and
′ X − µ̂n
Vε′n X = Vεn = 1.
en2
σ
In other words, if we normalize the dataset using the empirical mean and sample
standard deviation, the resulting normalized dataset has empirical mean 0 and sam-
ple standard deviation 1. ⋄
Example 4.3.7. In Example 3.1.4 we considered the two model paradigms – addi-
tive noise and multiplicative noise. The additive noise model for X was formulated
as
X = µ + σε (4.13)
where µ ∈ R, σ > 0 and ε is a real valued random variable with Eε = 0 and Vε = 1.
Using the rules for E developed in this chapter we find that
EX = µ + σEε = µ,
and
VX = σ 2 Vε = σ 2 .
The multiplicative noise model was given as
X = µε (4.14)
EX = µEε.
and
VX = µ2 Vε.
The
√ notable property of the multiplicative noise model is that the standard deviation,
VX = µσ, scales with the mean. That is, the larger the expected value of X is the
larger is the noise. For the additive noise model the size of the noise is unrelated to
the mean value.
As discussed in Example 3.1.4 one often transforms the multiplicative noise model
via the logarithm to get
which is an additive noise model. Logarithms and expectations are not interchange-
able, though, and it actually holds that
If we consider two real valued random variables X and Y , the bundled variable
(X, Y ) takes values in R2 . The mean and variance of each of the variables X and Y
rely exclusively on the marginal distributions of X and Y . Thus they tell us nothing
about the joint distribution of X and Y . We introduce the covariance as a measure
of dependency between X and Y .
Definition 4.4.1. If XY has finite expectation the covariance of the random vari-
ables X and Y is defined as
V(X, Y )
corr(X, Y ) = √ . (4.16)
VXVY
The covariance is a measure of the covariation, that is, the dependency between the
two random variables X and Y . The correlation is a standardization of the covariance
by the variances of the coordinates X and Y .
We should note that the covariance is symmetric in X and Y :
Furthermore, if X = Y then
which gives an alternative formula for computing the covariance. Using this last
formula, it follows from Result 4.1.8 that if X and Y are independent then
V(X, Y ) = 0.
On the other hand it is important to know that the covariance being equal to zero
does not imply independence. We also obtain the generally valid formula
using (4.9) again together with (4.17) for the last equality.
We also find that
Result 4.4.2. If X and Y are two random variables with finite variance then the
sum X + Y has finite variance and
V(X + Y ) = VX + VY (4.20)
holds if and only if V(X, Y ) = 0, which in particular is the case if X and Y are
independent. Note also that it follows from the theorem that
The inequality is a classical mathematical result, but the derivation is, in fact, quite
elementary given the tools we already have at our disposal, so we provide a derivation
here for completeness.
236 Mean and Variance
−1 ≤ corr(X, Y ) ≤ 1.
Σij = V(Xi , Yj ).
That is
VX1 V(X1 , X2 ) · · · V(X1 , Xn )
V(X2 , X1 ) V(X2 ) ··· V(X2 , Xn )
Σ= .. .. .. .. .
. . . .
V(Xn , X1 ) V(Xn , X2 ) · · · VXn
Note that due to the symmetry of the covariance we have that the covariance matrix
Σ is symmetric:
Σij = Σji .
As a direct consequence of Result 4.4.2 we have the following result about the vari-
ance of the sum of n real valued random variables.
Result 4.4.5. If X1 , . . . , Xn are n real valued random variables with finite variance
and covariance matrix Σ then
n
! n X n
X X
V Xi = Σij
i=1 i=1 j=1
Xn X
= VXi + 2 V(Xi , Xj ).
i=1 i<j
Multivariate Distributions 237
Example 4.4.6. Continuing Example 4.1.10 we want to compute the variance of the
counting variable N . This is a quite complicated task, and to this end it is useful to
introduce some auxiliary variables. Recall that X1 , . . . , Xn are iid random variables
with values in E = {A, C, G, T} and w = w1 w2 . . . wm is an m-letter word. Define for
i = 1, . . . , n − m + 1
Yi = 1(Xi Xi+1 . . . Xi+m−1 = w)
to be the Bernoulli random variable that indicates whether the word w occurs with
starting position i in the sequence of random variables. Then
n−m+1
X
N= Yi .
i=1
To compute the variance of N we compute first the covariance matrix for the vari-
ables Y1 , . . . , Yn−m+1 . If i + m ≤ j then Yi and Yj are independent because the
two vectors (Xi , Xi+1 , . . . , Xi+m−1 ) and (Xj , Xj+1 , . . . , Xj+m−1 ) are independent.
By symmetry
Σij = V(Yi , Yj ) = 0
if |i − j| ≥ m. If |i − j| < m the situation is more complicated, because then the
variables are actually dependent. We may observe that since the X-variables are iid,
then for all i ≤ j with fixed j − i = k < m the variables
(Xi , Xi+1 , . . . , Xi+m−1 , Xj , Xj+1 , . . . , Xj+m−1 )
have the same distribution. Thus if we define
ρ(k) = V(Y1 , Yk )
for k = 0, . . . , m − 1 then, using symmetry again,
Σij = V(Yi , Yj ) = ρ(|i − j|)
for |i − j| < m. Thus for 2m − 1 ≤ n the covariance matrix Σ has the structure
ρ(0) ρ(1) ρ(2) . . . ρ(m−1) 0 0 ... 0 0 0
0
ρ(1) ρ(0) ρ(1) . . . ρ(m−2) ρ(m−1) 0 ... 0 0
ρ(2) ρ(1) ρ(0) . . . ρ(m−3) ρ(m−2) ρ(m−1) . . . 0 0 0
.
. .
. .
. .. .
. .
. .
. .. .
. .
.
. . . . . . . . . .
ρ(m−1) ρ(m−2) ρ(m−3) . . . ρ(0) ρ(1) ρ(2) ... 0 0 0
Σ= 0 ρ(m−1) ρ(m−2) . . . ρ(1) ρ(0) ρ(1) ... 0 0 0 ,
0 0 ρ(m−1) . . . ρ(2) ρ(1) ρ(0) ... 0 0 0
.. .. .. .. .. .. .. .. .. ..
. . . . . . . . . .
0 0 0 ... 0 0 0 . . . ρ(0) ρ(1) ρ(2)
. . . ρ(1) ρ(0) ρ(1)
0 0 0 ... 0 0 0
0 0 0 ... 0 0 0 ... ρ(2) ρ(1) ρ(0)
which is non-zero in a diagonal band. Using Result 4.4.5 we see that there are n−m+1
terms ρ(0) (in the diagonal) and 2(n − m + 1 − k) terms ρ(k) for k = 1, . . . , m − 1,
and therefore
m−1
X
V(N ) = (n − m + 1)ρ(0) + 2(n − m + 1 − k)ρ(k).
k=1
238 Mean and Variance
which means that the m − k prefix of the word equals the m − k suffix of the word.
In that case
and therefore
ρ(k) = pw p(wm−k+1 ) . . . p(wm ) − p2w
if w has a k-shift overlap. ⋄
where
n
1X
bi,n =
µ xil .
n
l=1
Properties of the Empirical Approximations 239
As for the variance this is not a recommended formula to use for the actual compu-
tation of the empirical covariance.
e n is given by
The empirical covariance matrix Σ
e ij,n = σ
Σ 2
eij,n
Let A be any event, then since 1(Xi ∈ A) is a Bernoulli variable we can use Result
4.2.1 to find that
E1(Xi ∈ A) = P (A)
so
n
! n
1X 1X
Eεn (A) = E 1(Xi ∈ A) = E1(Xi ∈ A)
n n
i=1 i=1
n
X
1
= P (A) = P (A).
n
i=1
We have derived the following result about the empirical probability measure.
Result 4.5.1. With εn the empirical probability measure, and A any event it holds
that
Eεn (A) = P (A) (4.23)
and
1
Vεn (A) = P (A)(1 − P (A)). (4.24)
n
As for all other probability measures the collection of numbers εn (A) for all events
A ⊆ E is enormous even for a small, finite set E. If E is finite we will therefore
prefer the smaller collection of frequencies
n
1X
εn (z) = 1(xi = z)
n
i=1
for z ∈ E – which is also sufficient for completely determining the empirical measure
just like for any other probability measure on a discrete set. If P is given by the
point probabilities (p(z))z∈E Result 4.5.1 tells us that
and
1
Vεn (z) = p(z)(1 − p(z)). (4.26)
n
Properties of the Empirical Approximations 241
R Box 4.5.1 (Mean and variance). If x is a numeric vector one can compute the
(empirical) mean of x simply by
> mean(x)
> var(x)
and finally
n
! n
1X 2 1X
Eσ̃n2 = E Xi − µ̂2n = EXi2 − Eµ̂2n
n n
i=1 i=1
1 n−1
= EX 2 − Vµ̂n − (Eµ̂n )2 = EX 2 − (EX)2 − VX = VX.
n n
Thus we have the following result.
242 Mean and Variance
Result 4.5.2. Considering the empirical mean µ̂n and the empirical variance σ̃n2 as
estimators of the mean and variance respectively we have
1
Eµ̂n = EX and Vµ̂n = VX (4.27)
n
together with
n−1
Eσ̃n2 = VX. (4.28)
n
The theorem shows that the expected value of µ̂n equals the true expectation EX
and that the variance of µ̂n decreases as 1/n. Thus for large n the variance of µ̂n
becomes negligible and µ̂n will always be a very close approximation to EX. How
large n should be depends on the size of VX. Regarding the empirical variance
its expectation does not equal the true variance VX. The expected value is always
smaller than VX. The relative deviation is
VX − Eσ̃n2 1
= ,
VX n
which becomes negligible when n becomes large. However, for n = 5, say, the em-
pirical variance undershoots the true variance by 20% on average. For this reason
the empirical variance is not the preferred estimator of the variance. Instead the
standard estimator is
n
n 1 X
σ̂n2 = σ̃n2 = (xi − µ̂n )2 . (4.29)
n−1 n−1
i=1
It follows from Result 4.5.2 and linearity of the expectation operator that
Eσ̂n2 = V(X).
p
The square root σ̂n = σ̂n2 naturally becomes the corresponding estimator of the
standard deviation. Note, however, that the expectation argument doesn’t carry over
to the standard deviations. In fact, it is possible to show that
p
Eσ̂n < V(X)
which is not a particularly nice formula either. One can observe though that the
variance decreases approximately as 1/n, which shows that also the empirical vari-
ance becomes a good approximation of the true variance when n becomes large. But
regardless of whether we can compute the variance of the empirical variance, we can
compare the variance of σ̃n2 with the variance of σ̂n2 and find that
2
n
V(σ̂n2 ) = V(σ̃n2 ).
n−1
Hence the variance of σ̂n2 is larger than the variance of the empirical variance σ̃n2 .
This is not necessarily problematic, but it should be noticed that what we gain by
correcting the empirical variance so that the expectation becomes right is (partly)
lost by the increased variance.
If we consider an n-dimensional random variable X = (X1 , . . . , Xn ) we can also
derive a result about the expectation of the empirical covariance.
Using (4.21) yields
n
1X
Eσ̃ij,n = EXil Xjl − Eb bj,n
µi,n µ
n
l=1
n n
1 XX
= EXi Xj − EXil Xjm
n2
l=1 m=1
There are n(n − 1) such terms in the last sum above. There are n terms equaling
EXi Xj . This gives that
1 n−1
Eσ̃ij,n = EXi Xj − EXi Xj − EXi EXj
n n
n−1 n−1
= (EXi Xj − EXi EXj ) = V(Xi , Xj ).
n n
Thus we have the result.
Result 4.5.3. Considering the empirical covariance as an estimator of the covari-
ance, its expectation is
n−1
Eσ̃ij,n = V(Xi , Xj ).
n
As we can see the empirical covariance also generally undershoots the true covariance
leading to the alternative estimate
n
1 X
σ̂ij,n = (xil − µ
bi,n )(xjl − µ
bj,n) (4.31)
n−1
l=1
244 Mean and Variance
Exercises
Expectations have a special and classical role to play in probability theory and statis-
tics. There are quite a few examples of distributions on R where we can analytically
compute the expectation. However, there are also many many situations where we
have no chance of computing the expectation analytically. Remember that one of
the analytic tools we have at our disposal is the formula from Definition 4.1.1 or
more generally from Result 4.1.3
Z ∞
Eh(X) = h(x)f (x)dx.
−∞
when X is a real valued random variable, whose distribution has density f and
h : R → R such that h(X) has finite expectation. Thus the success of an analytic
computation of an expectation depends heavily on our ability to compute integrals.
The process can be reversed. Instead of computing the integral analytically we can
rely on the empirical mean as an approximation of the expectation and thus of the
integral. As discussed in Section 4.5 the more (independent) observations we have
the more precise is the approximation, so if we can get our hands on a large number
of iid random variables X1 , . . . , Xn , whose distribution is given by the density f , we
can approximate the integral by the (random) empirical mean
n
1X
h(Xi ).
n
i=1
According to Result 4.5.2 the variance of this random quantity decays like 1/n for
n tending to ∞, and thus for sufficiently large n the empirical mean is essentially
not random anymore. The empirical mean becomes a numerical approximation of
the theoretical integral, and with modern computers and simulation techniques it is
often easy to generate the large number of random variables needed. This technique
is called Monte Carlo integration referring to the fact that we use randomness to do
numerical integration.
Example 4.6.1. We know from Example 2.6.15 that the density for the Gumbel
distribution is
We should note that the latter formula is more suitable than the former for numerical
computations. The mean value for the Gumbel distribution can then be written as
Z ∞
x exp(−x − exp(−x))dx.
−∞
There is no easy way to compute this integral, but it can be computed numerically
in R using the integrate function. It can also be computed using Monte-Carlo
246 Mean and Variance
0.8
0.8
0.7
0.7
Empirical mean
Empirical mean
0.6
0.6
0.5
0.5
0.4
0.4
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
n n
Figure 4.1: The average for n simulated Gumbel variables (left) as a function of n
and the average of − log y1 , . . . , − log yn for n simulated exponential variables (right)
as a function of n.
We recognize this integral as the expectation of − log Y where Y has the exponential
distribution. This integral is of course as difficult as the former to compute explicitly,
but we can use Monte-Carlo integration where we take f (x) = exp(−x) the density
for the exponential distribution and h(x) = − log x.
Most likely the two Monte-Carlo algorithms turn out to be identical when it comes
to implementations. ⋄
R∞
Elaborating a little on the idea, we may start with an integral −∞ g(x)dx that we
would like to compute. Initially we have no reference to a probability measure, but
we may take f to be any density, which we for technical reasons assume to be strictly
positive everywhere, that is f (x) > 0 for all x ∈ R. Then
Z ∞ Z ∞
g(x)
g(x)dx = f (x)dx,
−∞ −∞ f (x)
so if we define h(x) = g(x)/f (x), which is well defined as f (x) > 0 for all x ∈ R, we
Monte Carlo Integration 247
find that Z
∞
g(X)
g(x)dx = Eh(X) = E
−∞ f (X)
where the distribution of X has density f . Simulating X1 , . . . , Xn as independent
random variables, whose distribution has density f , we get the empirical approxi-
mation
n Z ∞
1 X g(Xi )
≃ g(x)dx,
n f (Xi ) −∞
i=1
which is valid for large n. In this particular setup the Monte Carlo integration is also
known as importance sampling. This is basically because we are free here to choose
f , and a good choice is in general one such that f (x) is large when g(x) is large.
The x′ s where g(x) is large (and g is a reasonably nice function, not oscillating too
rapidly) contribute the most to the integral, and by taking f (x) large when g(x)
is large means that we put a large weight f (x) on the important points x where
we get the largest contribution to the integral. This is a heuristic, not a precise
mathematical result.
2
Example 4.6.2. Take g(x) = 1[a,∞) (x)f0 (x) where f0 (x) = √12π exp − x2 is the
density for the normal distribution with mean 0 and variance 1 and a > 0. We think
of a as quite large, and thus if Y is a random variable with distribution having
density f0 we see that
Z ∞ Z ∞ Z ∞
P(Y ≥ a) = f0 (x)dx = 1[a,∞) (x)f0 (x)dx = g(x)dx.
a −∞ −∞
and if a is large, we may very well risk that no or only a few of the Y ’s are ≥ a. This
is a central problem in computing small probabilities by Monte Carlo integration.
Even though the absolute error will be small almost by definition (a small probability
is close to zero, so even if we get an empirical mean being 0 with high probability
it is in absolute values close to the true probability) the relative error will be very
large. Using importance sampling, taking
2
1 (x − a)2 1 x + a2 − 2xa
f (x) = √ exp − = √ exp − ,
2π 2 2π 2
which is the density for the normal distribution with mean a, we find that simulating
248 Mean and Variance
The previous example, though representing a problem of real interest, does not do
justice to the general applicability of Monte Carlo integration and the impact it
has had due to the rapid development of computer technology. One-dimensional
numerical integration of a known f does not in general pose a real challenge. The
real challenge is to do high-dimensional numerical integration. Using the full version
of Result 4.1.3 we have that
Z
Eh(X) = h(x)f (x)dx
to Bayesian principles. Prior to MCMC practical Bayesian data analysis was often
obstructed by the difficulties of computing the posterior distribution.
Even with a strictly frequentistic interpretation of probabilities the Bayesian method-
ology for high-dimensional parameter estimation has turned out to be useful. In
Chapter 3 the primary approach to estimation was through the minimization of
the minus-log-likelihood function. The density for the posterior distribution is in
effect the likelihood function multiplied by a penalization factor, and the resulting
minus-log becomes a penalized minus-log-likelihood function.
A Bayesian estimator for the parameter suggests itself. Instead of minimizing the
penalized minus-log-likelihood function we can compute the expectation of the poste-
rior as an estimator of the parameter. In principle we get rid of the difficult practical
problem of minimizing a high-dimensional function and replace it with a much sim-
pler problem of computing an average of some simulated random variables, but there
is a caveat. Problems with local minima of the penalized minus-log-likelihood func-
tion can lead to poor simulations and just as for the minimization problem one has
to pay careful attention to whether the simulation algorithm in reality got caught
in an area of the parameter space that is located around a local minimum.
Exercises
for λ = 2, 4, 10, 100. Plot the average computed as a function of n. Try also to compute
the mean by Monte-Carlo integration for λ = 21 , 1 – what happens?
Exercise 4.6.2. Consider the following integral
Z
1 − 100
P x2
i
i=1 2 −ρ
P99
i=1 xi xi+1 dx
50
e
R100 (2π)
Compute it using Monte Carlo integration with ρ = 0.1. Provide an estimate for the
variance of the result. Can you do it for ρ = 0.2 also? What about ρ = 0.6?
Hint: You should recognize that this can be seen as an expectation of a function of
d = 100 iid N (0, 1)-distributed random variables. It may also be useful to plot the
running mean as a function of the number of simulations n.
250 Mean and Variance
P(|µ̂n − µ| > ε) → 0
for n → ∞ for all ε > 0. The theorem defines a notion of convergence, which we call
convergence in probability . We write
P
µ̂n −→ µ
if P(|µ̂n − µ| > ε) → 0 for n → ∞ for all ε > 0. The much stronger result that we
will discuss in this section also gives us asymptotically the distribution of µ̂n and is
known as the central limit theorem or CLT for short.
Result 4.7.1 (CLT). If X1 , . . . , Xn are n iid real valued random variables with finite
variance σ 2 and expectation µ then for all x ∈ R
√ Z x 2
n(µ̂n − µ) 1 x
P ≤ x → Φ(x) = √ exp − dx
σ 2π −∞ 2
for n → ∞. We write
1
µ, σ 2
as
µ̂n ∼ N
n
and say that µ̂n asymptotically follows a normal distribution.
First note that what we are considering here is the√distribution function of the nor-
malization of the random variable µ̂n . We say that n(µ̂σn −µ) converges in distribution
to the standard normal distribution.
is the relative frequency of A-occurrences – the empirical mean of the iid Bernoulli
variables 1(X1 = A), . . . , 1(Xn = A). Since
Monte-Carlo integration is one situation where the use of the central limit theorem
is almost certainly justified because we want to run so many simulations that the
empirical average is very close to the theoretical average. This will typically ensure
that n is so large that the distribution of the average is extremely well approximated
by the normal distribution. It means that if X1 , . . . , Xn are iid having density f and
we consider the empirical average
n Z
1X
µ̂n = h(Xi ) ≃ µ = h(x)f (x)dx
n
i=1
as
then µ̂n ∼ N µ, n1 σ 2 where
Z
σ 2 = Vh(X1 ) = (h(x) − µ)2 f (x)dx.
and we plug this estimate into the formula for the confidence interval above.
Example 4.7.3. If we return to the Monte-Carlo integration of the mean for the
Gumbel distribution as considered in Example 4.6.1 we find that the estimated mean
and standard deviation is
0.8
0.7
Empirical mean
0.6
0.5
0.4
Figure 4.2: The average for n simulated Gumbel variables as a function of n including
a 95%-confidence band based on the asymptotic normal distribution of the average
for large n.
Here we base the estimates on 10000 simulations. Figure 4.2 shows the same plot
of the average as a function of n as Figure 4.1 but this time with 95%-confidence
bands. The resulting confidence interval is [0.537, 0.587]. ⋄
Example 4.7.4. As in the Example above, but this time considering an m-length
word w as in Example 4.1.10, the relative frequency of w is
n−m+1
1 1 X
p̂w = N= 1(Xi . . . Xi+m−1 = w).
n n
i=1
tiable in µ then
as 1 ′ 2 2
h(Zn ) ∼ N h(µ), h (µ) σ .
n
The ∆-method provides first of all a means for computing approximately the variance
of a non-linear transformation of a real valued random variable – provided that
the transformation is differentiable and that the random variable is asymptotically
normally distributed. If Zn is a real valued random variable with density fn , if
h : R → R, if h(Z) has finite variance, and if µh = Eh(Zn ) we know that
Z ∞
V(h(Zn )) = (h(x) − µh )2 fn (x)dx.
−∞
If Zn also has finite variance and VZn = σ 2 /n, this variance is small for large n and
Zn does not fluctuate very much around its mean µ = EZn . Therefore we can make
the approximation µh ≃ h(µ). Moreover, if h is differentiable we can make a first
order Taylor expansion of h around µ to get
h(x) ≃ h(µ) + h′ (µ)(x − µ),
from which we find
Z ∞ Z ∞
Vh(Zn ) = (h(x) − µh )2 fn (x)dx ≃ (h(µ) + h′ (µ)(x − µ) − h(µ))2 fn (x)dx
−∞ −∞
Z ∞
1
= h′ (µ)2 (x − µ)2 fn (x)dx = h′ (µ)2 VZn = h′ (µ)2 σ 2 .
−∞ n
254 Mean and Variance
If X1 , . . . , Xn are iid real valued random variables with finite fourth order moment,
and if the X-variables have mean µ and variance σ 2 then
µ̂n as µ 1
∼N , Σ
σ̂n σ n
so the bivariate random variable consisting of the empirical mean and variance con-
verges in distribution to a bivariate normal distribution. The asymptotic covariance
matrix is µ3
2 1 2
Σ=σ µ3 µ4 −1
2 4
where µ3 and µ4 are the third and fourth moments of the normalized random
variable
X1 − µ
Y = .
σ
Thus Y has mean 0 and variance 1 and
µ3 = EY 3 and µ4 = EY 4 .
Note the the values of µ3 and µ4 do not depend upon µ or σ 2 . If the distribution
of the X-variables is N (µ, σ 2 ) it holds that µ3 = 0 and µ4 = 3.
Computing the true variance analytically is in most cases impossible, but differenti-
ating h and computing the approximation is most likely doable. But the ∆-method
actually provides much more information than just a variance approximation. It
also tells that the differentiable transformation preserves the asymptotic normality,
so that you actually have a way to approximate the entire distribution of h(Zn ) if
Zn asymptotically has a normal distribution.
Example 4.7.6. Continuing Examples 4.7.2 and 4.7.4 we consider the function
h : (0, 1) → (0, ∞) given by
x
h(x) = ,
1−x
then for any event, B, we see that h(P (B)) = ξ(B) is the odds for that event. We
Asymptotic Theory 255
as
and if Zn ∼ N ξ, n1 Σ then
as 1
h(Zn ) ∼ N f (ξ), Dh(ξ)t ΣDh(ξ) .
n
We see that h(µ, σ) is the coefficient of variation, and using the results in Math
Box 4.7.1 and the multivariate ∆-method above we find that
dn = σ̂n ∼ as σ 1 σ4 σ 3 µ3 σ 2 (µ4 − 1)
CV N , − +
µ̂n µ n µ4 µ3 4µ2
1 µ4 − 1
= N CV, CV4 − CV3 µ3 + CV2
n 4
since
µ3 −σ
−σ 1 1
Dh(µ, σ)t Σ Dh(µ, σ) = σ2 µ2 µ µ3
2
µ4 −1
µ2
1
2 4 µ
σ4 σ 3 µ3 σ 2 (µ4 − 1)
= − +
µ4 µ3 4µ2
(1 − x) + x 1
h′ (x) = 2
= ,
(1 − x) (1 − x)2
If we consider the iid Bernoulli random variables 1(X1 = A), . . . , 1(Xn = A) with
p(A) ∈ (0, 1) we find, using Example 4.7.2 and the ∆-method in Result 4.7.5, that
the empirical odds for the nucleotide A, ξ̂(A) = p̂(A)/(1 − p̂(A)), is asymptotically
256 Mean and Variance
normal,
p̂(A) as p(A) 1 p(A) 1
ξ̂(A) = ∼N , =N ξ(A), ξ(A) .
1 − p̂(A) 1 − p(A) n 1 − p(A) n
If w is a length m word that is not self overlapping and if we consider the (dependent)
Bernoulli random variables 1(X1 . . . Xm = w), . . . , 1(Xn−m+1 . . . Xm = w) we find
instead, using Example 4.7.4 and the ∆-method, that the empirical odds for the
word w, ξ̂(w) = p̂w /(1 − p̂w ), is asymptotically normal,
ˆ p̂w as pw 1 pw (1 − (2m − 1)pw )
ξ(w) = ∼N , .
1 − p̂w 1 − pw n (1 − pw )2
⋄
Exercises
Exercise 4.7.1. Consider the Monte-Carlo integration from Exercise 4.6.1 for the
computation of the mean in the t-distribution. Continue this exercise by computing
estimates for the variance and plot the estimates as a function of n. Try the range
of λ’s 12 , 1, 2, 10. What happens and how can you interpret the result? Choose n and
compute 95%-confidence intervals for the value of the mean for different choices of λ.
It is the exception, not the rule, that we analytically can find the distribution of an
estimator. On the contrary, it is often the case that estimators, like most maximum
likelihood estimators, do not even have an explicit analytic expression but are given
as solutions to equations or maximizers of a function. An alternative to knowing the
actual distribution of an estimator is to know a good and useful approximation.
For an estimator, θ̂, of a one-dimensional real parameter θ, we may be able to
compute the expectation, ξ(θ) = Eθ θ̂, and the variance, σ 2 (θ) = Vθ θ̂. A possi-
ble approximation of the distribution, which turns out to be quite well founded, is
N (ξ(θ), σ 2 (θ)) – the normal distribution with mean ξ(θ) and variance σ 2 (θ). We
present this as the following pseudo definition.
for x ∈ R. We write
N (ξ(θ), σ 2 (θ))
approx
θ̂ ∼
Asymptotic Theory 257
So from the fact that the integral of the density is 1 we obtain the identity that the
expectation of the minus-log-likelihood function is 0. A second differentiation yields
Z
d2 log fθ (x) d log fθ (x) 2
0 = fθ (x) + fθ (x)dx
dθ 2 dθ
2
d log fθ (x) dlX (θ)
= Eθ + Vθ .
dθ 2 dθ
Thus
d2 log fθ (x) dlX (θ)
i(θ) = −Eθ = Vθ .
dθ 2 dθ
These computations show that the parameters ξ(θ) = θ and σ 2 (θ) = i(θ) are correct
for the approximating normal distribution above. They also show that i(θ) ≥ 0,
since the number equals a variance. We show below for a specialized model setup
with iid observations that one can obtain the two approximations above from the
law of large numbers and the central limit theorem respectively.
In a final step we argue that since
θ̂ ≃ θ − i(θ)−1 Z
with the right hand side a scale-location transformation of the random variable
approx
Z ∼ N (0, i(θ)) then θ̂ ∼ N (θ, i(θ)−1 ).
Since the inverse of i(θ) appears as a variance in the approximating normal distribu-
tion, we can see that i(θ) quantifies how precisely θ is estimated using θ̂. The larger
i(θ) is the more precise is the estimator. This justifies the following definition.
Definition 4.7.10. The quantity i(θ) is called the Fisher information, the expected
information or just the information – after the English statistician R. A. Fisher.
The quantity
d2 lX
(θ)
dθ 2
is called the observed information.
Asymptotic Theory 259
The computations in Remark 4.7.9 suffered from two deficiencies. One was the abil-
ity to formalize a number of analytic approximations, which in fact boils down to
control of the error in the Taylor expansion used and thus detailed knowledge of the
“niceness” of the likelihood function. The other claim was that
d2 lX dlX approx
(θ) ≃ i(θ) and (θ) ∼ N (0, i(θ)).
dθ 2 dθ
This claim is not completely innocent, and does not always hold. But it will hold in
situations where we have sufficient replications – either in an iid setup or a regres-
sion setup. To show why, lets consider the situation where X = (X1 , . . . , Xn ) and
X1 , . . . , Xn are iid such that the minus-log-likelihood function is
n
X
lX (θ) = − log fθ (Xi )
i=1
Likewise the first derivative of the minus-log-likelihood is a sum of iid random vari-
ables − d log dθ
fθ (X1 )
, . . . , − d log dθ
fθ (Xn )
, whose mean is 0 and whose variance is i0 (θ), and
the central limit theorem, Result 4.7.1, gives that
n
1 dlX 1 X d log fθ (Xi ) as 1
(θ) = − ∼ N 0, i0 (θ) .
n dθ n dθ n
i=1
1 d2 lX P
2
(θ) → i0 (θ)
n dθ
for n → ∞ and
1 dlX as 1
(θ) ∼ N 0, i0 (θ) .
n dθ n
Remark 4.7.12. If θ̂n denotes the MLE in the previous theorem the conclusion is
that
as 1
θ̂n ∼ N θ, .
ni0 (θ)
260 Mean and Variance
To use this distribution in practice we can use the plug-in estimate i0 (ϑ̂n ) as a
substitute for the unknown i0 (θ). This requires that we know a formula for i0 (θ). As
an alternative one can take the observed information evaluated in the MLE
1 d2 lX
(ϑ̂n )
n dθ 2
as an approximation to i0 (θ).
Clearly considering a one-dimensional parameter setup only is not sufficient for most
practical applications. The multivariate version of the Fisher information is a matrix
and the relevant approximations are described in Math Box 4.7.3.
4.8 Entropy
We use the convention that 0 log 0 = 0 in the definition. One may also note from
the definition that log p(x) ≤ 0, hence H(P ) ≥ 0, but the sum can be divergent if
E is infinite in which case H(P ) = ∞. We may note that − log p(X) is a positive
random variable and that
0.3
0.2
0.1
Figure 4.3: The entropy of a Bernoulli variable as a function of the success probability
p.
dH(X) 1−p
= − log p − 1 + log(1 − p) + 1 = log(1 − p) − log p = log .
dp p
The derivative is > 0 for p ∈ (0, 1/2), = 0 for p = 1/2, and < 0 for p ∈ (1/2, 1).
Thus H(X) is, as a function of p, monotonely increasing in the interval from 0 to
1/2 where it reaches its maximum. From 1/2 to 1 it decreases monotonely again.
This fits very well with the interpretation of H(X) as a measure of uncertainty. If p
is close to 0, X will quite certainly take the value 0, and likewise if p is close to 1,
X will quite certainly take the value 1. If p = 1/2 we have the greatest trouble to
tell what value X will take, as the two values are equally probable. ⋄
3
2
1
0
with (p(x|y))x∈E1 being the point probabilities for the conditional distribution of X
given Y = y, and we define the conditional entropy of X given Y = y to be
X
H(X | Y ) = p(y)H(X|Y = y)
y∈E2
where (p(y))y∈E2 are the point probabilities for the marginal distribution of Y .
The conditional entropy tells us about the average uncertainty about X given that
we know the value of Y . The gain in information about X, that is, the loss in
uncertainty as measured by entropy, that we get by observing Y tells us something
about how strong the dependency is between the variables X and Y . This leads to
the following definition.
Definition 4.8.5 (Mutual information). Let X and Y be two random variables
taking values in E1 and E2 respectively. The mutual information is defined as
I(X, Y ) = H(X) − H(X|Y ).
A value of I(X, Y ) close to 0 tells us that the variables are close to being independent
– not much knowledge about X is gained by observing Y . A value of I(X, Y ) close to
H(X) tells that the variables are strongly dependent. We therefore regard I(X, Y )
as a quantification of the dependence between X and Y .
Another way to view mutual information is through the relative entropy measure
between two probability measures.
264 Mean and Variance
If q(x) = 0 for some x ∈ E with p(x) > 0 then D(P | Q) = ∞. If X and Y are two
random variables with distribution P and Q respectively we define
D(X | Y ) = D(P | Q)
as the relative entropy from X to Y .
or alternatively that
X
H(X) + D(X | Y ) = − p(x) log q(x). (4.32)
x∈E
Result 4.8.7. If X and Y are two random variables with joint distribution R, if X
has (marginal) distribution P and Y has (marginal) distribution Q then
The program R is “GNU S”, a freely available environment for statistical computing
and graphics that runs on a variety of platforms including Mac OS X, Linux and
Windows. It is an implementation of the S language developed by John Chambers
and colleagues at the Bell Laboratories.
R consists of the base program with a number of standard packages and a large
and ever growing set of additional packages. The base program offers a Command
Line Interface (CLI), where you can interactively type in commands (R expressions).
One will (or should) quickly learn that the proper use of R is as a high level pro-
gramming language for writing R-scripts, R-functions and R-programs. One may
eventually want to extend the functionality of R by writing an entire R package
and/or implement various time-consuming algorithms in a lower level language like
C with an R-interface. Indeed, many of the base functions are implemented in C or
Fortran. Such advanced use of R is far beyond the scope of this appendix.
This appendix deals with a few fundamental questions that inevitably arise early
on when one wants to use R. Questions like how to obtain R, how to run R, what
is R all about, how to handle graphics, how to load packages and data, how to run
scripts, and similar problems. We also give directions for locating more information
on using and running R. The appendix can not stand alone, and you will for instance
need the manual An Introduction to R – see Section A.2.
The R program and all additional packages are available for download at the Compre-
hensive R Archive Network (CRAN) at http://cran.r-project.org/. The Dan-
ish mirror is http://mirrors.dotsrc.org/cran/. You can download binaries for
Linux, Windows, and Mac OS X, or you can download the source code and compile
267
268 R
it yourself if you want. R is available as Free Software under the terms of the Free
Software Foundation’s GNU General Public License.
You can also download the packages from CRAN, but once you have installed R, it
is quite easy to download and install packages from within R.
When R is properly installed1 on your computer, you can run it. How you do that
and the appearance of R differ a little from Linux to Windows. On a Linux machine,
you simply type R in a terminal, the program starts and you are now running R.
Whatever you type in next is interpreted by R. Graphics will appear in separate
windows. If you start R in Windows (by locating it from e.g. the Start Menu), R
runs in a new window called the RGui, with a window inside called the R console
containing a command prompt. Don’t be fooled – the fact that it says RGui doesn’t
mean that the Windows version provides a graphical user interface for using R, just
that a few things like running R-scripts, loading and installing packages, and saving
graphics can be done from the menu. It is recommended that you learn how to do
that from the command line anyway.
R runs in a working directory. All interaction with the file system like reading or
writing data files, running scripts or saving graphics will take place relative to the
working directory unless you specify a complete alternative path instead of just a file-
name. You get the current working directory by the command getwd(). You change
the working directory to be path by setwd("path"). In the RGui on Windows, this
can be done from the File menu as well.
You quit R by typing quit() or simply q(). On a Windows machine you can also
close the program in the File menu (Exit). When you quit, you are always asked
if you want to save the workspace. Doing so, all objects and the command history
are stored in the working directory. When starting R it automatically loads the
workspace most recently stored in the default working directory.
In general, we refer to the manuals that come with R for detailed information on
R and how to use R. The manuals in PDF-format are located in the subdirec-
tory doc/manual. For the Windows version you can also find them through the
Help menu, and you can always find the most recent version on the R homepage:
http://www.r-project.org/. The most important manual for ordinary use of R is
An Introduction to R.
You can also access the manuals in HTML-format by issuing the help.start()
command in R. The HTML-page that opens contains links to the manuals, links
to help pages grouped according to the package they belong to, as well as a search
1
How one installs the program can be machine and platform dependent. Installation of the
Windows binary should be completely straight forward.
The R language, functions and scripts 269
engine for searching the help pages. Links to some additional material like FAQs are
also given. When running R on Windows you can find the HTML-help page in the
Help menu together with direct links to FAQs, the search engine, etc.
You can access the help pages from within R by the command help. For instance,
help(plot) will give you the help page for the plot function. A shortcut is ?plot.
You can also search for help pages containing the word“plot”by help.search("plot").
Note the quotation marks when using help.search. For a few functions like the
binary operator plus (+), you will need quotation marks when you call the help
function, i.e. help("+").
You may find this section a little difficult, if you are not that familiar with computer
programming and programming languages. However, this may explain a little about
why things are the way they are.
In principle, we can use the functions provided by R and the packages to perform
the computations we want or to produce the graphics we want by typing in function
calls one after the other. If we want, we can even define an entirely new function and
then use it – doing everything from the command line interface. Defining a function,
sq, that squares its argument can be done as follows
One will, however, very quickly find it to be tedious, boring and inconvenient to
type in one function call after the other and in particular to define functions using
the command line editor – even though the command line editor keeps a history of
your commands.
To avoid working with the command line interface you can use R together with any
text editor for writing a file containing R function calls, which can then be loaded
into R for sequential evaluation. You use source("foo.r") to load the file foo.r.
Graphics 271
Note that you may need to include either the full path or the relative path (relative
to the working directory, cf. Section A.1) if foo.r is not in the working directory.
The usage of source ranges from writing simple scripts that basically collect a
number of function calls over implementations of new functions to entire R programs
that perform a number of different tasks. It is good practice to get used to working
with R this way, i.e. to write R-scripts – then you can always experiment by copy-
pasting to the command line editor, if you are not sure that the whole script will
run, or that you only need to run some parts of the script.
You can in principle use any text editor. There is a quite extensive environment called
ESS (Emacs Speaks Statistics) for the family of Emacs editors. There is more infor-
mation on the homepage http://www.sciviews.org/_rgui/projects/Editors.html,
which gives links to a number of editors that support R script editing with features
such a syntax highlighting and indentation. The RGui for Windows provides in the
File menu its own simple editor for editing scripts, and you can also execute the
source command from the File menu.
A.4 Graphics
A.5 Packages
> install.packages("ggplot2")
installs the package and installs in addition all packages that ggplot2 depends upon.
Packages 273
Table A.1: This example illustrates some uses of R for making graphics. The symbol
# produces comments in R.
> library("ggplot2")
loads the package. Note that you need an Internet connection to install the packages.
One can also load a package by the command require. Using require returns a
logical, which is TRUE if the package is available. This is useful in e.g. scripts for
checking that needed packages have actually been loaded.
A.5.1 Bioconductor
> source("http://www.bioconductor.org/biocLite.R")
> biocLite()
A.6 Literature
How you are actually going to get R to do anything interesting is a longer story. The
present lecture notes contains information embedded in the text via R boxes that
describe functions that are useful in the context they are presented. These boxes can
not entirely stand alone, but must be regarded as directions for further study. An
indispensable reference is the manual An Introduction to R as mentioned above and
the online help pages, whether you prefer the HTML-interface or the help function.
The homepage http://www.r-project.org/ contains a list of, at the time of writ-
ing, 94 books related to R. This author is particularly familiar with three of the
books. An introduction to statistics in R is given in
Peter Dalgaard. Introductory Statistics with R.
Springer, 2002. ISBN 0-387-95475-9,
which treats statistics more thoroughly than the manual. This book combined with
the manual An Introduction to R provides a great starting point for using R to do
statistics. When it comes to using R and S-Plus for more advanced statistical tasks
the bible is
William N. Venables and Brian D. Ripley. Modern Applied Statistics with S.
Fourth Edition. Springer, 2002. ISBN 0-387-95457-0.
A more in-depth book on the fundamentals of the S language is
William N. Venables and Brian D. Ripley. S Programming.
Springer, 2000. ISBN 0-387-98966-8..
There is also a book on using R and Bioconductor for Bioinformatics:
Gentleman, R.; Carey, V.; Huber, W.; Irizarry, R.; Dudoit, S. (Eds.)
Bioinformatics and Computational Biology Solutions Using R and Bioconductor.
Springer, 2005. ISBN: 0-387-25146-4.
The R user community is growing at an increasing rate. The language R has for some
time been far more than an academic language for statistical experiments. Today, R
is just as much a workhorse in practical data analysis and statistics – in business and
in science. The expanding user community is also what drives R forward and users
contribute with packages at many different levels and there is a growing number of
R-related blogs and web-sites. When you want to find an answer to your question,
Google is often your friend. In many cases questions have been asked on one of the
R emailing lists or treated somewhere else, and you will find it if you search. Being
new to R it may be difficult to know what to search for. Two recommended places
to look for information is the R wiki
Other resources 275
http://rwiki.sciviews.org/doku.php
and the list of contributed documents
http://cran.r-project.org/other-docs.html
The latter is a mixed collection of documents on specialized R topics and some
beginners guides that may be more friendly than the manual.
276 R
B
Mathematics
B.1 Sets
Ac = {x ∈ E | x 6∈ A}.
A ∪ B = {x ∈ E | x ∈ A or x ∈ B}
A ∩ B = {x ∈ E | x ∈ A and x ∈ B}.
We also define
A\B = A ∩ B c ,
which is the set of elements in A that do not belong to B.
277
278 Mathematics
The integers Z, the non-negative integers N0 , the positive integers N (also called the
natural numbers), the rational numbers Q, and the real numbers R are all examples
of sets of numbers. We have the following chain of inclusions
N ⊆ N0 ⊆ Z ⊆ Q ⊆ R.
There is also the even larger set of complex numbers C. We find for instance that
N0 \N = {0},
and that Z\N0 is the set of negative integers. Note that this is the complement of N0
within Z. The complement of N0 within R is a larger set. The set R\Q (which is the
complement of the rational numbers within R) is often called the set of irrational
numbers.
B.2 Combinatorics
In the derivation of the point probabilities for the binomial distribution, Example
3.2.1, we encountered the combinatorial quantity nk . This number is the number of
ways we can pick k out of n elements disregarding the order, since this corresponds
to the number of ways we can pick out k xi ’s to be equal to 1 and the remaining
n − k xi ’s to equal 0 such that the sum x1 + . . . + xn = k. If we take the order
into account there are n possibilities for picking out the first element, n − 1 for the
second, n − 2 for the third and so on, hence there are n(n − 1)(n − 2) · · · (n − k + 1)
ways of picking out k elements. We use the notation
With k = n this argument reveals that there are k(k) = k! orderings of a set of k
elements. In particular, if we pick k elements in order there are k! ways of reordering
the set hence
n n(k) n!
= = .
k k! k!(n − k)!
The numbers nk are known as binomial coefficients. They are often encountered in
combinatorial problems. One useful formula that relies on binomial coefficients is
the following: For x, y ∈ R and n ∈ N
n
X n k n−k
(x + y)n = x y , (B.1)
k
k=0
which shows that the point probabilities for the binomial distribution indeed sum
to one (as they necessarily must).
A simple continuation of the argument also gives a formula for the multinomial
coefficients
n
k1 . . . km
encountered in the multinomial distribution in Example 3.2.2. As we argued above
there are n! orderings of the n elements. If we assign labels from the set {1, . . . , m}
by choosing one of the n! orderings and then assign a 1 to the k1 first elements, a
2 to the following k2 elements and so on and so forth we get n! ways of assigning
labels to the elements. However, for any ordering we can reorder within each group
and get the same labels. For a given ordering there are k1 !k2 ! · · · km ! other orderings
that result in the same labels. Hence
n n!
= .
k1 . . . km k1 !k2 ! · · · km !
A sequence of real numbers, x1 , x2 , x3 , . . ., often written as (xn )n∈N , can have a limit,
which is a value that xn is close to for n large. We say that xn converges to x if we,
for all ε > 0 can find N ≥ 1 such that
|xn − x| ≤ ε
xn → x, for n → ∞ or lim xn = x.
n→∞
x1 ≤ x2 ≤ x3 . . . .
An increasing sequence is either upper bounded, in which case there is a least upper
bound, and the sequence will approach this least upper bound, or the sequence is
unbounded, in which case the sequence grows towards +∞. An increasing sequence
is therefore always convergent if we allow the limit to be +∞. Likewise, a sequence
is decreasing if
x1 ≥ x2 ≥ x3 . . . ,
and a decreasing sequence is always convergent if we allow the limit to be −∞.
Let (xn )n∈N be a sequence of non-negative reals, i.e. xn ≥ 0, and define
n
X
sn = xk = x1 + x2 + . . . + xn ,
k=1
280 Mathematics
then, since the x’s are non-negative, the sequence (sn )n∈N is increasing, and it has
a limit, which we denote
X∞
xn = lim sn .
n→∞
n=1
It may be +∞. We write
∞
X
xn < ∞
n=1
if the limit is not ∞.
If (xn )n∈N is any sequence of reals we define
x+
n = max{xn , 0} and x−
n = max{−xn , 0}.
Then xn = x+ − + −
n − xn and the sequences (xn )n∈N and (xn )n∈N are sequences of
positive numbers. They are known as the positive respectively the negative part of
the sequence (xn )n∈N . If
X∞
+
s = x+
n <∞
n=1
and
∞
X
s− = x−
n <∞
n=1
then we define the infinite sum
X∞ ∞
X ∞
X
xn = s+ − s− = x+
n − x−
n
n=1 n=1 n=1
and we say that the sum is convergent. If one of the sums, s+ or s− , is +∞ we say
that the sum is divergent. We may also observe that
|xn | = x+ −
n + xn
A classical infinite sum is the geometric series with xn = ρn−1 for ρ ∈ (−1, 1), then
∞ ∞
!
X X 1
n−1 n
ρ = ρ = . (B.2)
1−ρ
n=1 n=0
valid for λ ∈ R.
Integration 281
B.4 Integration
which is the area under the graph of f from a to b. This area being computed with
a sign. In some cases it makes sense to let a or b tend to −∞ or ∞ respectively. In
that case we get that
Z ∞
f (x)dx
−∞
is the entire area under the graph from −∞ to ∞. If f is a positive function, this
“area” always makes sense, though it may be infinite.
If f (x) ≥ 0 for all x ∈ R, the sequence of numbers
Z n
In = f (x)dx
−n
is an increasing sequence, hence it has a limit, which may equal +∞, for n → ∞.
We write Z ∞ Z n
f (x)dx = lim In = lim f (x)dx.
−∞ n→∞ n→∞ −n
such that f (x) = f + (x) − f − (x) and |f (x)| = f + (x) + f − (x). We say that f is inte-
grable over R if the two positive functions f + and f − are integrable or equivalently
if the positive function |f | is integrable. In this case we also have that
Z ∞ Z n
f (x)dx = lim f (x)dx.
−∞ n→∞ −n
The integral Z ∞
xλ−1 exp(−x)dx
0
is finite for λ > 0. It is known as the Γ-integral (gamma integral), and we define the
Γ-function by
Z ∞
Γ(λ) = xλ−1 exp(−x)dx. (B.4)
0
Γ(n + 1) = n!
for n ∈ N0 . For non-integer λ the Γ-function takes more special values. One of the
peculiar results about the Γ-function that can give more insight into the values of
Γ(λ) for non-integer λ is the reflection formula, which states that for λ ∈ (0, 1)
π
Γ(λ)Γ(1 − λ) = .
sin(πλ)
This can together with (B.5) be used to compute Γ(1/2 + n) for all n ∈ N0 . For
instance,
√
Γ(1/2) π
Γ(3/2) = Γ(1/2 + 1) = = .
2 2
Γ(λ1 )Γ(λ2 )
B(λ1 , λ2 ) = (B.6)
Γ(λ1 + λ2 )
is also a continuous function from R to R. We can integrate this function over the
interval [c, d] for c < d, c, d ∈ R and get the multiple integral
Z dZ b
f (x, y)dydx.
c a
We can interpret the value of this integral as the volume under the function f over
the rectangle [c, d] × [a, b] in the plane.
It is possible to interchange the order of integration so that
Z dZ b Z bZ d
f (x, y)dydx = f (x, y)dxdy.
c a a c
[a1 , b1 ] × . . . × [ak , bk ].
but the limit may be equal to +∞. For any continuous function f : Rk → R we have
that if Z ∞ Z ∞
··· |f (x1 , . . . , xk )|dxk . . . dx1 < ∞
−∞ −∞
then
Z ∞ Z ∞ Z n Z n
··· f (x1 , . . . , xk )dxk . . . dx1 = lim ··· f (x1 , . . . , xk )dxk . . . dx1 .
−∞ −∞ n→∞ −n −n
Index
284
Index 285
randomization, 79 continuous, 39
refractory period, 120 uniform distribution
regression discrete, 25
assay, 10
rejection region, 162 vector, 24
relative frequency, 18
Weibull distribution, 42
reparameterization, 144
Welch two-sample t-test, 168
residual sum of squares, 189
with replacement, 129
residuals, 194
without replacement, 129
non-linear regression, 205
Robinson-Robinson frequencies, 23 Yahtzee, 1
rug plot, 44
sample mean, 30
sample space, 13
discrete, 21
sample standard deviation, 232
sample standard error of the mean, 242
sample variance, 31, 231
scale, 10
scatter plot, 109
significance level, 163
slope parameter, 190
standard curve, 10
standard deviation
integer distributions, 29
standard error of the mean, 242
standardized residuals, 195
statistical test, 162
statistically significant, 163
stochastic variable, 62
structural equation, 78
success probability, 63
symmetric distribution, 65
t-distribution, 43
t-test
one-sample, 186
tabulation, 27
test, 162
test statistic, 162
transformation, 63
transition probabilities, 112
uniform distribution