0% found this document useful (0 votes)
17 views122 pages

Statistics For Applied Science 200l

The document is a set of lecture notes on statistical methods in applied computer science, emphasizing the importance of statistics in understanding and managing uncertainty in data. It covers various statistical concepts, including Bayesian inference, test-based inference, and data modeling, while highlighting the relevance of these methods in real-world applications such as machine learning and data mining. The notes aim to unify methodologies across different fields using probabilistic and statistical tools.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views122 pages

Statistics For Applied Science 200l

The document is a set of lecture notes on statistical methods in applied computer science, emphasizing the importance of statistics in understanding and managing uncertainty in data. It covers various statistical concepts, including Bayesian inference, test-based inference, and data modeling, while highlighting the relevance of these methods in real-world applications such as machine learning and data mining. The notes aim to unify methodologies across different fields using probabilistic and statistical tools.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228553849

Statistical Methods in Applied Computer Science Lecture Notes

Article · November 2005

CITATIONS READS

0 1,292

2 authors, including:

Stefan Arnborg
KTH Royal Institute of Technology
83 PUBLICATIONS 4,815 CITATIONS

SEE PROFILE
Statistic for Applied Compute Science
Lecture Note
September to December 2024

ENGR SONNY PIUS

Abstract
In the real world, lots of things are uncertain; including the proper way
to handle uncertainty.
Old Jungle Saying

Contents
1 Introduction 3
1.1 Professional context of statistics in Computer Science . . . . . . 3
1.2 Summary and related work . . . . . . . . . . . . . . . . . . . . . 5
1.3 Usage of probability concepts . . . . . . . . . . . . . . . . . . . . 5
1.4 Course Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Schools of statistics 6
2.0.1 *Kolmogorov’s axioms . . . . . . . . . . . . . . . . . . . . 7
2.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Very short introduction to Bayesian inference . . . . . . . 12
2.1.2 Choosing between two alternatives . . . . . . . . . . . . . 14
2.1.3 Finite set of alternatives . . . . . . . . . . . . . . . . . . . 14
2.1.4 Infinite sets of alternatives and parameterized models . . 15
2.1.5 Simple, parameterized and composite models . . . . . . . 16
2.1.6 Recursive inference . . . . . . . . . . . . . . . . . . . . . . 16
2.1.7 *Exchangeability . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.8 Dynamic inference . . . . . . . . . . . . . . . . . . . . . . 17
2.1.9 Does Bayes give us the right answer? . . . . . . . . . . . . 18
2.1.10 A small Bayesian example. . . . . . . . . . . . . . . . . . 18
2.1.11 Bayesian decision theory and parameter estimation . . . . 20
2.1.12 *Bayesian analysis of cross-validation . . . . . . . . . . . 24
2.1.13 Biases and analyzing protocols . . . . . . . . . . . . . . . 25
∗ email: [email protected]; mail: CSC/NADA, KTH, SE-100 44 Stockholm, Sweden

1
2.1.14 How to perform Bayesian inversion . . . . . . . . . . . . . 26
2.2 Test based inference . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.1 A small hypothesis testing example . . . . . . . . . . . . . 34
2.2.2 Multiple testing considerations . . . . . . . . . . . . . . . 34
2.2.3 Finding p-values using Monte Carlo . . . . . . . . . . . . 35
2.3 Example: Mendel’s results on plant hybridization . . . . . . . . . 36
2.4 Discussion: Bayes versus frequentism . . . . . . . . . . . . . . . . 38
2.4.1 Model checking in Bayesian analysis . . . . . . . . . . . . 41
2.5 Evidence Theory and Dempster-Shafer structures . . . . . . . . . 41
2.6 Estimating a distribution, Decision and Maximum Entropy . . . 44
2.7 Uncertainty Management . . . . . . . . . . . . . . . . . . . . . . 47
2.8 Beyond Bayes: PAC-learning and SVM . . . . . . . . . . . . . . 48
2.8.1 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . 54
2.9 Algorithmic conformal prediction and anomaly detection . . . . . 57

3 Data models 59
3.1 Univariate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.1 Discrete distributions and the Dirichlet prior . . . . . . . 59
3.1.2 Estimators and Data probability of Dirichlet distribution 60
3.1.3 The normal and t distributions . . . . . . . . . . . . . . . 62
3.1.4 Nonparametrics and mixtures . . . . . . . . . . . . . . . . 64
3.1.5 Piecewise constant distribution . . . . . . . . . . . . . . . 64
3.1.6 Univariate Gaussian Mixture modeling - The EM and
MCMC ways . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Multivariate and sequence data models . . . . . . . . . . . . . . . 70
3.2.1 Multivariate models . . . . . . . . . . . . . . . . . . . . . 70
3.2.2 Dependency tests for categorical variables: Dirichlet mod-
eling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.3 The multivariate normal distribution . . . . . . . . . . . . 73
3.2.4 Dynamical systems and Bayesian time series analysis . . . 74
3.2.5 Discrete states and observations: The hidden Markov model 75
3.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.1 Causality and direction in graphical models . . . . . . . . 79
3.3.2 Simpson’s paradox . . . . . . . . . . . . . . . . . . . . . . 79
3.3.3 Segmentation and clustering . . . . . . . . . . . . . . . . . 81
3.3.4 Graphical model analysis - overview . . . . . . . . . . . . 81
3.3.5 Graphical model choice - local analysis . . . . . . . . . . . 82
3.3.6 Graphical model choice - global analysis . . . . . . . . . . 84
3.3.7 Categorical, ordinal and Gaussian variables . . . . . . . . 87
3.4 Missing values and errors in data matrix . . . . . . . . . . . . . . 87
3.4.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5 Clustering and Mixture Modelling . . . . . . . . . . . . . . . . . 91
3.6 Markov Random Fields and Image segmentation . . . . . . . . . 94
3.7 Regression and the linear model . . . . . . . . . . . . . . . . . . . 97
3.8 Bayesian regression . . . . . . . . . . . . . . . . . . . . . . . . . . 98

2
4 Approximate analysis with Metropolis-Hastings simulation 98
4.1 Why MCMC and why does it work? . . . . . . . . . . . . . . . . 99
4.1.1 MCMC burn-in and mixing . . . . . . . . . . . . . . . . . 102
4.1.2 The particle filter . . . . . . . . . . . . . . . . . . . . . . . 104
4.2 Basic classification with categorical attributes . . . . . . . . . . . 105

5 Two cases 106


5.1 Case 1: Individualized media distribution - significance of profiles 106
5.2 Case 2: Schizophrenia research–hunting for causal relationships
in complex systems . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6 Conclusions 111

1 Introduction
Computers have since their invention been described as devices performing fast
and precise computations. For this reason it might be easy to conclude that
probability and statistics has little to do with computer science, as quite many
of our students have indeed done. This attitude may have had some justifica-
tion before computers became more than a prestigious and expensive piece of
equipment. Today, computers are involved everywhere, and the ubiquitous un-
certainty characterizing the real non-computer world can no longer be separated
from computing. Automatically recorded data are collected in vast amounts,
and the kind of interpretation of sounds, images and other recordings that was
earlier made by humans, or under close supervision by humans, must now in-
evitably be performed to larger and larger extent by computers, unaided by
humans. If we emphasize the human performance of interpretation as a gold
standard, we may call the methodology used in such applications artificial intel-
ligence. If we emphasize that we must make machines learn from experience we
may call it machine learning. Other popular names of the emerging technolo-
gies are data mining, emphasizing the large amounts of data and the importance
of finding business opportunities in them, or perhaps uncertainty management,
emphasizing that computer applications must be made more and more aware
of the basic uncertainties in the real world. The terms pattern recognition and
information fusion have also been used. Knowledge discovery is another term
that evokes emotions, particularly in natural sciences. These notes were made
for a course in statistical methods in computer engineering, where the theme is
to unify the methodology in all the above fields with probabilistic and statistical
tools. By this you may miss some lore of a specific field, but probably it will
not matter much.

1.1 Professional context of statistics in Computer Science


Data acquired for analysis can have many different formats. We will describe
the analysis of measurements that can be thought of as samples drawn from a
larger population, and the conclusions will be phrased in terms of this larger
population. I will focus on very simple models. My experience is that it is
too easy to start working with complex models without understanding the ba-
sics, which can lead to all sorts of erroneous conclusions. As the investigator’s

3
understanding of a problem area improves, the statistical models tend to be-
come complex. Some examples of such areas are genetic linkage studies[24],
ecosystem studies[68] and functional MR investigations[109], where the signals
extracted from measurements are very weak but potentially extremely useful
for the application area. As an example, analysis of fMRI experiments have
complex models for the signal distortion, noise, movement artifacts, variation of
anatomy and function among different subjects, variations within the individual
and even drift in a single experiment. All these disturbances can be modeled us-
ing basic physics, knowledge of anatomy and statistical summaries on inter-brain
variation and the total model is overwhelmingly complex[69]. Experiments are
typically analyzed using a combination of visualization, Bayesian analysis and
conventional test and confidence based statistics. When it becomes necessary
to develop more sophisticated models, it is vital that the analyst communicates
the developments to the non-statistical members of the research team. In engi-
neering and commercial applications of data mining, the goal is not normally to
arrive at eternal truths, but to support decisions in design and business. Nev-
ertheless, because of the competitive nature of these activities, one can expect
well founded analysis methods and understandable models to provide more use-
ful answers than ad hoc ones. In any case, even with short deadlines it seems
important that the methodological discussion in a project contains more than
the question of which buttons to press in commercial data analysis packages.
If each sample point contains measurements of two variables, it is easy to
produce a scatter plot in two dimensions where sometimes a conclusion is imme-
diate. The human eye is very good at seeing structure in such plots. However,
sometimes it is too good, and constructs structure that does simply not exist.
If one makes a scatter plot with points drawn uniformly inside a polygon, the
point cloud will typically be deemed dense in some areas and sparse in others, in
other words one sees structure that does not exist in the underlying distribution,
as exemplified in figure 1.
There are several other reasons to complement visualization methods with
more sophisticated statistical methods: Most data sets will contain many more
than two variables, and this leads to a dimensionality problem. Producing
pairwise plots means that many plots are produced. It is difficult to examine all
of them carefully, and some of them will inevitably accidentally contain structure
that would be deemed real in a single plot examination but not as an extreme
among many plots. It would also mean that co-variation of several variables that
cannot be explained as a number of two-variable co-variations cannot be found.
Nevertheless, scatter plots are among the most useful explanatory devices once
a structure in the data has been verified to be ’probably real’.
This text emphasizes characterization of data and the population from which
it is drawn with its statistical properties. Nonetheless, the application owners
have typically very different concerns: they want to understand, they want to
be able to predict and ultimately to control their objects of study. This means
that the statistical investigation is a first phase, which must be accompanied by
interpretation, activities extracting meaning from the data. There is relatively
little theory on these later activities, and it is probably fair to say that their
outcome depends mostly on the intellectual climate in the team of which the
analyst is only one part.

4
1.2 Summary and related work
Our purpose is to explain some advantages of the Bayesian approach and to show
how probability models can capture the information or knowledge we are after in
an application. It is also our intention to give a full account of the computations
required. It can serve as a survey of the area, although it focuses on techniques
being investigated in present projects in medical informatics, defense science and
customer segmentation. Several of the computations we describe have been an-
alyzed at length, although not exactly in the way and with the same conclusions
as found here. The contribution here is a systematic treatment that is mostly
confined to pure Bayesian analysis and puts several established data mining
methods in a joint Bayesian framework. We will see that, although many compu-
tations of Bayesian data-mining are straightforward, one soon reaches problems
where difficult integrals have to be evaluated, and presently only Markov Chain
Monte Carlo (MCMC) methods - which are computationally demanding - and
variational Bayes methods - which are approximate - are available. There are
several recent books describing the Bayesian method from both a theoretical[16],
an ideological[95] and an application oriented[19] perspective. Particularly Ed
Jaynes unfinished lecture notes[61] have provided inspiration for me and numer-
ous students using them all over the world. A current survey of MCMC meth-
ods, which can solve many complex evaluations required in advanced Bayesian
modeling, can be found in the book[48]. Books explaining theory and use of
graphical models are Lauritzen[64] and Cox and Wermuth[26]. A tutorial on
Bayesian network approaches to data mining is found in (Heckermann[56]) and
they are thoroughly covered in (Jensen[62]). We will describe data mining in
a relational data structure with discrete data (discrete data matrix) and the
simplest generalizations to numerical (floating point) data. A recent book with
good coverage of visualization integrated with models and methods is the sec-
ond edition of Gelman et al[45]. Good tables of probability distributions can
be found in [79, 16, 45], and a fascinating book on probability theory is Feller’s
[38].
These lecture notes were written for a Data Mining PhD course given an-
nually 1996-2006. A significant update was made when it was decided to give
the course at the advanced (Masters) level. The contents of these notes has
been influenced by the research undertaken by me and by colleagues and PhD
students I worked with. A short version of these notes was published in Wang[2,
Ch 1].

1.3 Usage of probability concepts


We will stick to the conventions of applied mathematics using probability density
functions, which can also be generalized in the sense of having point masses
like Dirac’s δ function. This function from real to real has the property that
I
δ(x)dx is one if the interval I contains
 zero, otherwise the integral value
is zero. This gives the useful formula f (x)δ(x − y)dx = f (y) for a smooth
function f : R → R. The Dirac function is thus not an ordinary function, but
a generalized function, a measure-theoretic construction which can be thought
of as a limit of a sequence of functions whose supports are intervals around 0
with decreasing length going to zero, and such that each member integrates to
one over the real line. We will not consider measure-theoretic complications or

5
detailed handling of non-uniform state spaces. Thus, when we call, e.g., q(z|x)
a probability distribution, we mean that we have a probability distribution over
the space of z that depends on x. The space of z is Euclidean n-dimensional
space or a discrete finite set. In the latter case the distribution is simply a set of
non-negative (probability) numbers summing to one. In the first case we may
think of an integrable continuous function, possibly with discontinuities along
n − 1 dimensional sub-manifolds, and a set of singular ’peaks’ ’ridges’, etc, that
can be modeled using the Dirac δ function. The case of state spaces consisting
of unions of dissimilar spaces poses no problem except the need to visualize and
understand. Such applications with variable dimension state (parameter) space
are rapidly emerging, prominent examples being target tracking systems for
multiple targets, and intelligent cruise control systems keeping track of nearby
vehicles and obstacles.

1.4 Course Guide


Chapters and exercises marked * are either more demanding or more abstract
than the main text which is easy (but long). Exercises marked + will be given
as homework and their solutions are not supplied here. It is important to have
access to a computer with Matlab, Octave, or R. Many of the methods described
here can be found in publicly available code libraries (instead of indexing this
moving target, I suggest you Google up the codes you need, or go through
the Matlab central file exchange where the codes are quality reviewed by other
users).
The most popular supplementary readings have been Jaynes[61], (particu-
larly chapters 1, 2,4 and 5) for the philosophically minded, and [45] for those
wanting thorough competence in practical application of Bayesian statistical
tools and methods.

2 Schools of statistics
Statistical inference has a long history, and one should not assume that all scien-
tists and engineers analyzing data have the same expertise and would reach the
same type of conclusion using the objectively ’right’ method in the analysis of a
given data set. On the contrary, analysis proceeds by a trial-and-error process
influenced by developments in the particular science or application community
as well as in basic statistical sciences. A lot of science could once be developed
using very little statistical theory, but this is no longer true. Probability theory
is the basis of statistics, and it links a probability model to an outcome. But this
linking can be achieved by a number of different principles. A pure mathemati-
cian interested in mathematical probability would only consider abstract spaces
equipped with a probability measure. Whatever is obtained by analyzing such
mathematical structures has no immediate bearing on how we should interpret
a given data set collected to give us knowledge about the world – it is given that
bearing only through interpretation linking it to human experience.
When it comes to inference about real-world phenomena, there are at least
two different and complementary views on probability that have competed for
the position as ’the’ statistical method. With both views, we consider models
that tell how data are generated in terms of probability. The models used for

6
analysis reflect our - or the application owners - understanding of the problem
area. In a sense they are hypotheses and in inference a hypothesis is often more
or less equated with a probability model. With probability theory we can find
the probability of observing data with specified characteristics under a prob-
ability model. But inference is concerned with saying something about which
probability model generated our data - for this reason inference was sometimes
called inverse probability[29]. The name Bayesian, referring to inference of the
19th century style of inverse probability, is however more recent[39].

2.0.1 *Kolmogorov’s axioms


The Komogorov axioms are often referred to in a ceremonial way in parts of
the engineering literature. They are not based on probability density functions
which are mostly used in applied work. Instead, the concept of probability space
is defined as a triple (Ω, F, P ), where Ω is the outcome space, F is a σ-algebra of
subsets (events) of Ω, and P is a real-valued function on F satisfying the three
Kolmogorov axioms:

• P has non-negative values on F ;


• P (Ω) = 1;
• For every finite
 or countable sequence of disjoint sets(events) Ei ∈ F we
must have i P (Ei ) = P ( i Ei ).

The σ-algebra mentioned above is a set F of subsets of Ω such that Ω ∈ F , if


f ∈ Fthen Ω−f ∈ F , and if Ei is a finite or countable sequence of elements in F ,
then i Ei is also in F . It is not difficult to see that a smooth probability density
function defined on, e.g., Rn , will give rise to a probability function on smooth
subsets (obtained by integration) that satisfies the Kolmogorov axioms. This
more sophisticated approach to probability is sounder from a pure mathematics
point of view. Engineers have a tendency to work with generalized probability
density functions, which also give right results as long as they are manipulated
in a standard fashion. In these notes we use the abbreviation pdf meaning
probability density function (for continuous domains) or probability distribution
(for discrete domains).
Consider, e.g., a variable that with probability 1/2 is uniformly distributed
over the unit interval, and with probability 1/2 has the exact value 1/4. The
appropriate probability function P for this situation assigns to a subset r of the
unit interval the probability that is obtained by taking the interval length |r|
and, if 1/4 is in r, adding 1, and then halving the result. An engineer might
describe this situation with a density function p(x) = 0.5(1 + δ(x − 0.25)) on
0 ≤ x ≤ 1, where δ is the Dirac function. Since the integral of δ(x − 0.25) over a
set is one if the set contains 0.25 and otherwise zero, these two definition methods
give the same probability for all ’nice’ sets. The reason that Kolmogorov’s
axioms define P only on the σ-algebra is that there are weird subsets of most
dense sets, like the real unit line, whose probability cannot be determined –
they are non-measurable. Such sets do not pop up in normal applications. One
common example of an unmeasurable set is the Cantor set: Start out with the
closed unit interval, remove the middle third open interval which results in two
closed intervals. Recursively perform the same operation on all closed intervals

7
appearing during this construction process. The resulting Cantor set contains
no closed interval but is also non-denumerable and non-measurable.

8
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 1: A sample of points distributed uniformly in the unit square. It is easy


to find areas with apparently lower density of points. But there is no structure
in the underlying distribution.

9
Figure 2: Florence Nightingale (1820-1910) is perhaps best known as the founder
of the medical nursing profession. She was also an able administrator and gifted
mathematician, and pioneered the use of graphical statistical summaries to iden-
tify areas for improvement in military and civilian health care, and public health.
This is an early example of systematic data mining and visualization used to
influence decision makers to improve operations. She is recognized as inventor
of circular statistical diagrams (visualizing, e.g., seasonal variations) and pie
charts.

10
Figure 3: Andrei Kolmogorov(1903-1987) is the mathematician best known
for shaping probability theory into a modern axiomatized theory. His axioms
of probability tells how probability measures are defined, also on infinite and
infinite-dimensional event spaces. He did not work on statistical inference about
the real world, but many of the analytical tools he developed are useful for such
work.

11
2.1 Bayesian Inference

Figure 4: Thomas Bayes, (1701?-1762), was a Presbyterian clergyman and am-


ateur mathematician, but he was also a member of the Royal Society. It is
believed that he was admitted because he participated in a discussion on the
rationality of the new, Newton type, mathematics when it was attacked by the
bishop Berkeley[14]. Bayes was also the first to analyze a way to infer a proba-
bility model from observations, although his essay is technically weak by modern
standards. It is not verified that this contemporary portrait of a priest actually
is a likeness of Thomas Bayes, but it is often used as such.

2.1.1 Very short introduction to Bayesian inference


The first applications of inference used Bayesian analysis[29], where we can di-
rectly talk about the probability that a hypothesis in the form of a probability
model generated our observed data. The basic mechanism is easy: There is an
observation space of possible observations, and a state space of possible states
(of the world, or of significant aspects of the world). Let our models be defined
by their data generating probabilities, i.e, for every state there is a probabil-
ity distribution over the possible observations. The likelihood is obtained by
regarding probability of a given observation as a function of the state. The
likelihood is not itself a probability density function, since it is not normalized.
It is sometimes convenient to think of the likelihood (after normalization) as

12
a probability distribution. The prior is a probability distribution over state,
representing belief in different possible states. When an observation has been
obtained, it will change our belief in the system’s state from the prior to a new
probability distribution, the posterior.
As a first example, let us look at the simplest case where both spaces have
exactly two elements. This is quite common in many cases where an observa-
tion, e.g., on the reaction of a test kit (outcomes are often called positive and
negative), is used to asses a persons state wrt a disease. These tests are made
in such a way that the test outcome is normally correct, positive if you have the
disease and negative when you do not have the disease. But there are possibil-
ities of both false negatives (where you actually have the disease in spite of a
negative test) and false positives, where the test gives a ’false alarm’. Assume
the probabilities of having and not having the disease before taking the test are
(pd , pnd ), where pd + pnd = 1. Also, for test outcome o the probabilities of get-
ting outcome o with the disease and without the disease are (pod , pond ), we form
the vector (pd pod , pnd pond ). This is not a probability vector, but it becomes one,
(pd , pnd ) after normalization, i.e., by dividing each component by pd pod + pnd pond .
The Bayesian interpretation (which we shortly will motivate in more detail) is
that the prior probability pd of having the disease is transformed to the posterior
probability of having the disease, pd , after obtaining the observation o.
In a slightly more complex case the state space is infinite, e.g. the set
of real numbers. Here the prior is a probability density f (x), a non-negative
function that integrates to one over the real line. For a continuous observation
space, like the real numbers, we have instead a density function f (x, o) where
for each fixed x, f (x, o) is a density over the reals. In this case we come from
the prior, f (x), to the posterior, f  (x), by multiplying functions pointwise in x,
f  (x) = cf (x)f (x, o) for observation o. Here c is a normalization constant that
is chosen so that f  (x) integrates to one over the real line. If now the state x is
the true length of an object and o is the estimated length of the object, then the
prior f (x) can be a normal distribution with mean µ and variance σ 2 , and the
measurement error can be another normal 2distribution with mean 0 and variance
σm2
, thus with the density cm exp(− (x−o) 2σ2 ), where the normalization constant
m
2
cm is √ 1
2πσm
. Multiplying this with the prior c exp(− (x−µ)
2σ2 ) and normalizing
σ2 µ+σ2 o 2 −1
we get another normal distribution with mean σm2 +σ2 and variance ((σm ) +
m
2 −1 −1
(σ ) ) . The mean of the posterior is thus a weighted average of the prior
mean and the measurement. If the measurement has small variance, this mean
is close to the measurement, if the variance of the prior is small then it is
close to the prior mean. If a another and independent measurement is made,
our new belief in the objects length is handled similarly, with the old posterior
taking the place of the new prior. Since multiplication of functions is associative
and commutative, it does not matter in which order a set of such independent
measurements are obtained or grouped, as long as each measurement is included
exactly once. Handling of discrete ad continuous state spaces are thus analogous
in Bayesian inference, and we will sometimes use the same notation in both cases.
In this example, if two measurements are made giving o1 and o2 , the posterior is
f (x|o1 , o2 ) = f (o1 |x)f (o2 |x)f (x) under the assumptions that the measurements
are independent. One significant source of dependence is a systematic error
affecting both o1 and o2 .

13
The notation used below is quite compact and maybe gives a simple picture,
but you should observe that when actual probability distributions are substi-
tuted into the formulas one usually has a quite demanding analysis to make,
analytical, numeric, or stochastic. Examples are given in exercises and in Chap-
ter 3.

2.1.2 Choosing between two alternatives


With two models H1 and H2 , the probability of seeing data D is P (D|H1 ) and
P (D|H2 ), respectively. Using the standard definition of conditional probability,
P (A|B) = P (AB)/P (B), we can invert the conditional of the model:

P (H1 |D) = P (D|H1 )P (H1 )/P (D).

The data probability P (D) regardless of generating model is not unexpectedly


difficult to assess and cannot be used. For this reason, Bayesian analysis does not
normally consider a single hypothesis, but is used to compare different models
in such a way that the data probability regardless of model is not needed. Using
the same equation for H2 , we can eliminate the data probability:

P (H1 |D) P (D|H1 ) P (H1 )


= (1)
P (H2 |D) P (D|H2 ) P (H2 )
This rule says that the odds we assign to the choice between H1 and H2 , the
prior odds P (H1 )/P (H2 ), is changed to the posterior odds P (H1 |D)/P (H2 |D),
the odds after also seeing the data, by multiplication with the Bayes factor
P (D|H1 )/P (D|H2 ). In other words, the Bayes factor contains all information
provided by the data relevant for choosing between the two hypotheses. The
rule assumes that we have subjective probability, dependent on information the
observer holds, e.g., by having seen the outcome D of an experiment.
If we assume that exactly one of the hypotheses is true, we can normalize
probabilities by P (H1 |D) + P (H2 |D) = 1 and find the unknown data proba-
bility P (D) = P (D|H1 )P (H1 ) + P (D|H2 )P (H2 ). This probability has however
no obvious meaning in an application, it is merely an inverse normalization
constant.

2.1.3 Finite set of alternatives


If we have more than two hypotheses {Hi }i∈I for finite set I, a similar calcu-
lation leads to a formula defining a posterior probability distribution over the
hypotheses that depends on the prior distribution:

P (D|Hi )P (Hi )
P (Hi |D) =  (2)
j∈I P (D|Hj )P (Hj )

The assumption behind the equation is that exactly one of the hypotheses
is true. 
Then the inverse normalization constant and the data probability is
P (D) = j∈I P (D|Hj )P (Hj ).
When the number of hypotheses |I| goes to infinity, we can intuitively see the
prior and the resulting posterior as parameterized probability density functions.
Conversely, one way to solve the continuous inference problem is to approximate
it by a finite problem using equation (2).

14
2.1.4 Infinite sets of alternatives and parameterized models
If we have a parameterized model Hλ , the data probability is also dependent
on the value of the parameter λ which is taken from some parameter space
or possible world space Λ. In Bayesian analysis the parameter space can be
overwhelmingly complex, such as in the case of image reconstruction where Λ is
the set of possible images represented, e.g., by voxel vectors of significant length
in 3D images. But the parameter space can also be simple, like a postulated
’true value’ of a measured real valued quantity. We can also let the parameter
space be a finite set and so can regard equation (2) as a special case of equation
(3) below.
There are also interesting cases where the parameter space has complex
structure. Typically, a parameter value in such cases is a vector whose dimension
is not fixed. Such models can describe a multiple target problem in defense
applications, where one wants to make inference both about the number of
incoming targets and the state (position and velocity) of each, or in genetic
studies where one assumes that a disease is determined by combinations of an
unknown number of gene variants, or simply in an effort to describe a probability
distribution as a mixture of an unknown number of normal distributions. In
the latter case, the parameter can be the number of such normal components
followed by a sequence of weights, mean values and variances of the components,
the sequence (λ1 , µ1 , σ12 , λ2 , . . .) of equation (25).
Consideration of parameterized models leads to the parameter inference rule,
which can be explained as a limit form of equation (2):

f (λ|D) ∝ P (D|Hλ , λ)f (λ), (3)

where f (λ) is the prior density and f (λ|D) is the posterior density, and ∝
is a sign that indicates that a normalization constant (independent of λ) has
been omitted. For posteriors of parameter values the concept of credible set is
important. A credible set is a set of parameter values in which the parameter
has a high probability of lying, according to the posterior distribution. For a
real-valued parameter, a credible set which is a (possibly open-ended) interval is
called a credible interval. Likewise, a credible interval with posterior probability
0.95 is called a 95% credible interval. There are typically many 95% credible
intervals for a given posterior. It may seem natural to choose that interval for
which the posterior probability density is larger in the interval than outside it.
This interval is not robust to rescaling, however. A robust credible interval is
one that is defined by the quantiles at its end-points. It is usually centered in
the sense that the probability is the same for the posterior to lie above as below
the interval. In the case of a 95% credible interval, these two probabilities are
thus 2.5% and 97.5%.
There are many situations where the ’data’ part of the analysis is not pre-
cisely known, but only a probability distribution is available, perhaps obtained
as a posterior in a previous inference. In this case the inference rule will be

f (λ|f (x)) = f (λ|x)f (x)dx. (4)

where f (λ|x) ∝ f (x|λ)f (λ), and f (x) is a probability density function over
the observation space, a fuzzy observation.

15
2.1.5 Simple, parameterized and composite models
It is common in statistical work to refer to probability distributions generat-
ing experimental results as models. We can distinguish three types of models
throughout this course:
• A simple model is a model like P (D|H). Its data generating distribution is
fixed. Simple models can be used in inference work as stated in equations
(1,2).
• A parameterized model is a probability distribution that depends on one or
more parameters like P (D|Hλ , λ) above. They are used to make inference
on one or more of the parameters as in equation (3). This equation gives
a posterior for all parameters. If the parameter is composite (consisting of
several values like mean and variance for a normal distribution), we can get
a posterior for one of them by integrating out the others. So if inference
about a normal distribution parameter gives us the posterior as a joint
pdf f (µ, σ 2 ), and we are only interested in the mean, the the posterior
of the mean is the so-called marginal distribution f (µ) = f (µ, σ 2 )dσ 2 .
This process of eliminating uninteresting parameters in Bayesian inference
is called integrating out nuisance parameter.
• A composite model is obtained from a parameterized model P (D|Hλ , λ)
by specifying a prior distribution f (λ) for the parameter and integrating
out the parameter. The resulting composite
 hypothesis Hc has the data
probability distribution P (D|Hc ) = P (D|Hλ , λ)f (λ)dλ
An example of a composite model is given in section 2.1.10 to describe an
unbalanced coin. Since we do not know exactly how the coin is unbalanced, we
average over all possible unbalanced coins, assuming a uniform ’probability of
heads’ distribution.

2.1.6 Recursive inference


Bayesian analysis for many observed quantities can be formulated recursively,
because the posterior obtained after one observation can be used as a prior for
the next observation, so called recursive inference:

f (λ|Dt ) ∝ f (dt |λ)f (λ|Dt−1 ) (5)


f (λ|D0 ) = f (λ)

Here Dt = (d1 , . . . , dt ) is the sequence of observations obtained at differ-


ent times, and we have assumed they are independently generated, f (Dt |λ) =
Πti=1 f (di |λ).
The analysis of a sequence of data items, each obtained by an identical
procedure is fundamental in statistics. If one wants to be very careful, one uses
the concept exchangeability: see below.

Exercise 1 Show that (5) follows from (3) if the di are independently drawn
from a common distribution

16
2.1.7 *Exchangeability
A term occurring frequently in theoretically oriented statistical papers is ex-
changeability. It originates in work by de Finetti, but has been reinvented many
times. Assume the data obtained from an experiment is a sequence of items, all
of the same type. If the only thing we know about its distribution is that all
permutations of the sequence have the same probability (or probability density),
then the distribution is exchangeable. This seems to be the weakest condition
resulting in what is colloquially referred to as a sequence of independent and
identically distributed (iid) observations. Exchangeability is often considered
a weaker assumption than iid, but the difference is subtle. The representation
theorem, due originally to de Finetti([30]), states that an exchangeable sequence
is always describable as a sequence of independent variables generated according
to some, typically unknown, distribution. The 0-1 representation theorem of de
Finetti says that an exchangeable distribution over 0-1 (i.e., binary) sequences
has a probability distribution obtainable by first specifying a ’success probabil-
ity’ p from some distribution over probability (i.e., reals from 0 to 1) and then
considering the sequence as generated by coin tossing with success probability
p.
For a fuller account of representation theorems and their proofs, see, e.g.,
[16].

2.1.8 Dynamic inference


The recursive formulation (5) is the basis for the Chapman-Kolmogoroff ap-
proach where the task is to track a system state that changes with time, having
state λt at discrete time t (t is thus an integer). This case is called dynamic
inference, since it tries to estimate the state of a moving target at times indexed
from 1 upwards. In this case the sequence of observations is not exchangeable.
The Chapman-Kolmogoroff equation is:


f (λt |Dt ) ∝ f (dt |λt ) f (λt |λt−1 )f (λt−1 |Dt−1 )dλt−1 (6)

f (λ0 |D0 ) = f (λ0 ),

where f (λt |λt−1 ) is the maneuvering (process innovation) noise assumed, often
called the transition kernel. The latter is a pdf over state λt dependent on state
at the previous time-step, λt−1 . This distribution may depend on t, but often a
stationary process is assumed where the distribution of λt depends on λt−1 but
not explicitly on t. As an example, if we happen to know that the state does
not change, we would use Dirac’s delta δ(λt − λt−1 ) for the transition kernel and
equation (6) will simplify to (5). If we constrain the model to be linear with
Gaussian process and measurement noise with known covariance, equation (6)
simplifies to a linear algebra equation, whose solution is the classical Kalman
filter (this is pedagogically explained in [13]).
In many applications, additional information can be obtained by recom-
puting the past trajectory of the system. This is because later observations
can give information about the earlier behavior. This is known as retrodiction.
When the system has been observed over T steps, the posterior of the trajectory
λ = (λ0 , λ1 , . . . , λT ) is given by:

17

T
f (λ|DT ) ∝ f (λ0 ) f (dt |λt )f (λt |λt−1 ) (7)
t=1

Dynamic inference is a quite common case of inference, examples being target


tracking in civil and military vehicle surveyance, spread of epidemics, climate,
and more.
A continuous-time version of dynamic inference leads to study of the Focker-
Planck (or Focker-Planck-Kolmogorov) equation, of which the Chapman-Kolmogorov
equation is a discretization.

2.1.9 Does Bayes give us the right answer?


Above we have only explained how the Bayesian crank works. It would of course
be nice to know also that we will get the right answers.
The Bayes factor (1) estimates the support given by the data to the hypothe-
ses. Inevitably, random variation can give support to the ’wrong’ hypothesis.
A useful rule to apply when choosing between two hypotheses as in (1) is the
following: If the Bayes factor is k in favor of H1 , then the probability of getting
this factor or larger from an experiment where H2 was the true hypothesis is
less than 1/k. For many specific hypothesis pairs, the bound is much better[87].
There is also a nice characterization of long run properties of the equa-
tion (5) that has an accessible proof in [45]: If the observations are generated
by a common distribution g(d) and we try to find the parameter λ by using
equation (5), then as the number of observations tends to infinity, the poste-
rior f (λ|Dn ) will concentrate around a value λ̂ that minimizes the Kullback-
Leibler distance between the true and the estimated  distribution of observations,
λ̂ = argminλ KL(g(·), f (·|λ)), where KL(g, f ) = f (x) log(f (x)/g(x))dx. It is
not difficult to see that the KL distance is minimized and zero when the two
distributions f (d|λ) and g(d) are equal. So if the real distribution g(x) is equal
to the considered distribution f (x, λ), then the dynamic Bayesian inference will
asymptotically give the right answer. Convergence rate can however be very
slow, as discussed in [45].

2.1.10 A small Bayesian example.


We will see how Bayes’ method works with a small example, in fact very similar
to the example used by Thomas Bayes. Assume we have found a coin among the
belongings of a notorious gambling shark. Is this coin fair or unfair? The data
we can obtain is a sequence of outcomes in a tossing experiment, represented
as a binary string D. Let one hypothesis be that the coin is fair, Hr . Then
P (D|Hr ) = 2−n , where n = |D| is the number of tosses made. Since the tosses
are assumed independent, the number of ones, s, and the number of zeros,
f , completely characterizes an experiment. Since every outcome has the same
probability, we can not evaluate the experiment with respect to only Hr , but we
introduce another hypothesis that can fit better or worse to an outcome. Bayes
used a parameterized model where the parameter is the unknown probability,
p, of getting a one in a toss. For this model Hp , we have P (D|Hp ) = ps (1 − p)f ,
and the probability of an outcome is clearly a function of p. We can consider
Hp a whole family of models for 0 ≤ p ≤ 1. If we assume, with Bayes, that the

18
prior distribution of p is uniform in the interval from 0 to 1, we get a posterior
distribution equal to the normalized likelihood, f (p|D) = cps (1 − p)f , a Beta
distribution where the normalization constant is (as can be found from a table
of probability distributions or proved by double induction) c = (n + 1)!/(s!f !).
This function has a maximum at the observed frequency s/n. We cannot say
that the coin is unfair just because we have s = f since the normal variation
makes inequality very much more likely than equality if we made a large number
of tosses, even if the coin is fair.
If we want to decide between fairness and unfairness we can introduce a
composite hypothesis/model by specifying a probability distribution for the pa-
rameter p in Hp . A conventional choice is again the uniform distribution. Let
Hu be the hypothesis of unfairness, expressed as Hp with a uniform distribu-
1
tion on the parameter p. By integration we find P (D|Hu ) = 0 P (D|Hp )dp =
1 s
0 p (1 − p) dp = s!f !/(n + 1)!. Suppose now that we toss the coin twelve times
f

and obtain the sequence 00011000001, three successes and nine failures. The
probability of this outcome D under the fairness hypothesis Hr is P (D|Hr ) =
2−12 , and under the unfairness hypothesis Hu we have P (D|Hu ) = 3!9!/13!.
The Bayes factor in favor of unfairness will be

P (D|Hu )/P (D|Hr ) = 2n s!f !/(n + 1)!

Inserting n = 12, f = 9 and s = 3 we obtain the Bayes factor 212 3!9!/13! =


1.4, slightly in favor of unfairness. But this is a too small value to be of interest.
Values above 3 are worth mentioning, above 30 significant, and factors above
300 would give strong support to the first hypothesis, whereas values below
1/3, 1/30 and 1/300 give similar support to the second hypothesis. But we are
only comparing two hypotheses – this scheme cannot tell us that none of the
alternatives is plausible.
The above is an analytical way to solve Bayesian inference, and it relies on
having priors and likelihoods such that their product can be integrated analyt-
ically. This is often not the case since several functions do not have a primitive
function that can be defined in terms of functions we have named. We can ob-
tain a simple approximate solution by approximating the continuos parameter
space with a finite set, e.g., probability values spaced uniformly from 0 to 1.
We then use the formula (refeq:db) as an approximation to (refeq:parm). This
works well for a one-dimensional parameter space, but with many parameters
the curse of dimensionality sets in and we will have to use MCMC methods.
The application of Bayes factor for the case of a composite and a simple
model has been criticized for the case where the simple model is a special case of
the parameterized model from which the composite is obtained. An alternative,
proposed in [45], uses a high probability symmetric credible interval and chooses
the simple model if the corresponding parameter value falls in the interval.
Another alternative is to find the posterior probability that the parameter value
is larger than the value of the simple model. If this probability is not close to
one or zero, this indicates that the simple model is sufficient. This is not a
pure Bayesian approach but has the flavor of hypothesis testing and p-values,
as we will describe in Ch 2.2. The method is a good alternative to the Bayes
factor in the case of real valued parameters like the success probability for
a coin being 0.5 or a regression coefficient being zero (Ch 3.7). In one of our
important applications of the Bayes’ factor, the dependency test of Ch 3.2.2, two

19
composite models are compared and despite the fact that the simpler (coarser)
model (modeling independency) is a special case of the finer one, there is no easy
way to define a real valued measure which can be tested for zero. In this case the
Bayes factor approach is the only reasonable alternative with high credibility.
We have here compared two data generating mechanisms, out of a potentially
unbounded set of possibilities. We have simply assumed that the tosses are
independent and identically distributed in the sense that the success probability
is constant over the experiment. It is perfectly possible to assume that there is
autocorrelation in the experiment, or that the success probability drifts during
the experiment. In order to investigate these possibilities we need different data
generating models (and longer experiments, because there are more parameters
to consider). Once credible models of those aspects of an experiment we want
to consider are available, the analysis follows the same steps as those above.

Exercise 2 A cheap test is available for a serious disease. Assume that this
test is cheap in the sense of having 5% probability of giving negative result if
you have the disease, and 10% probability of giving positive result if you do not
have it. Moreover, in a population of individuals similar to you, 0.1% have the
disease. How would you compute your probability of having the disease when the
test gives
a) positive result?
b) negative result?
c) Can the precision be improved by repeating the test? What assumptions
are reasonable for answering this question?
d) Do there seem to be systematic errors in your analysis when applied in a
practical setting?

Exercise 3 +The assumption that an unfair coin has a uniform distribution


for p is not very convincing if the coin is physical and we have had a chance
to look at it without actually tossing it. Assume that we instead assign a prior
distribution for p by saying that the coin is well expected to be balanced in a series
of 2k trials. This can be formalized as stating the prior to be be proportional
to pk (1 − p)k in the unbalanced case. How would such an assumption change
the posterior and interpretation of the example (with three successes and nine
failures)? Find the posterior probability of p > 0.5 in the parameterized model
for the two cases uniform and Beta(k+1,k+1) distributed prior, for some suitable
values of k, like 5 and 50!

Exercise 4 Assume that we use equation (1) to choose between two models
stating respectively probability a and b for heads in a coin tossing series. Assume
also that the true probability of heads is c. For which values of c will the Bayes
factor in favor of a against b go to zero when the number of tosses increases?

2.1.11 Bayesian decision theory and parameter estimation


The posterior odds in equation (1) gives a numerical measure of belief in the two
hypotheses compared. Suppose our task is to decide by choosing one of them.
If the Bayes factor is greater than one, H1 is more likely than H2 assuming no
prior preference of either. But this does not necessarily mean that H1 is true,
since the data can be misleading by natural random fluctuation. Two types of

20
Figure 5: Pierre Simone de Laplace was a court (later Napoleon’s minister of the
interior) mathematician contributing to quite many problems in mathematics,
fluid mechanics and astronomy. He rediscovered Bayes’ method and developed
it in a more mathematically mature way. He used the same uniform prior for an
unknown probability as Bayes did, and used it to compute the probability that
the sun will rise tomorrow, given that it has risen N times without exception.
The answer, (N + 1)/(N + 2) is obtained by using the posterior mean value
estimator which in a discrete setting is known as the Laplace estimator: for
a discrete distribution over d outcomes, where ni observations of outcome i
were observed, the Laplace estimate of the probability distribution (p1 , . . . , pd )
is pi = (ni + 1)/(n + d), the relative frequencies after one more observation of
each outcome has been added. In this course we can also use Laplace’s parallel
combination: this term is sometimes used to describe the computation (5) for a
finite Λ: Take two vectors indexed by state, multiply them component-wise and
normalize. In Matlab: tmp=likelihood.*prior; posterior=tmp/sum(tmp);.

21
error are possible: Choosing H1 when H2 is true, and choosing H2 when H1
is true. Our choice must take the consequences of the two types of error into
account. If it is much worse to act on H1 when H2 is true than vice versa, the
choice of H1 must be penalized in some way, whereas if the consequences of
both error types are similar we should indeed chose the alternative of largest
posterior probability. Statistical Bayesian decision theory resolves the problem
by introducing a cost for each pair of choice and true hypothesis. For finite
hypothesis sets, the table of these costs forms a matrix, the cost matrix, which
is column indexed by true hypothesis and row indexed by our choice. The recipe
for choosing is to make the choice with smallest expected cost[11]. This rule
is applicable also when simultaneously making many model comparisons. The
term utility is equally common in statistical decision making. It differs from
cost only by a sign, and we thus strive to maximize it. The general Bayesian
decision making paradigm can be captured in the prescription:


a∗ = argmaxa∈A u(a, λ)f (λ|D)dλ,

where u(a, λ) is the utility of executing action a from a set of actions A in world
state λ.
When making inference for the parameter value of a parameterized model,
equation (3) gives only a distribution over the parameter value. If we want a
point estimate λ̂ of the parameter value, we should also use Bayesian decision
theory. We want to minimize the loss incurred by stating the estimate λ̂ when
the true value is λ. Let this loss be given by a loss function1 L(λ̂, λ). But we
do not know λ, we only know its posterior distribution. As with a discrete set
of
 decision alternatives we minimize the expected loss over the posterior for λ,
L(λ̂, λ)f (λ|D)dλ. If λ is real valued and the loss function is the squared error,
the optimal estimator is the mean of f (λ|D); if the loss is the absolute value of
the error, the optimal estimator is the median; with a discrete parameter space,
minimizing the probability of an error gives the Maximum A Posteriori(MAP)
estimate. When the parameter is continuous, MAP is the argument of highest
probability density for the parameter. The estimate is often called the mode of
the posterior distribution. Certainly, the probability of an error in this case is
usually one, but the estimate gets closer and closer to the mode if we postulate
a sequence of decreasing error bounds. As an example, when tossing a coin gives
s heads and f tails, the posterior with a uniform prior is f (p|s, f ) = cps (1 − p)f ,
the MAP estimate for p is the observed frequency s/(s + f ), the mean estimate
is the Laplace estimator (s+1)/(s+f +2) and the median is a fairly complicated
quantity expressible, when s and f are known, as the solution to an algebraic
equation of high degree. The Laplace estimator is often preferable to the simple
observed frequency estimator, see figure 5.
To see that minimizing expected squared error leads to the mean value es-
 (λ − λ̂) . Minimiz-
2
timate, consider the squared error loss function L(λ, λ̂) =
d
ing the expected loss wrt λ̂ leads to finding a zero of dλ̂ Λ L(λ, λ̂)f (λ|x)dλ =

− Λ 2(λ − λ̂)f (λ|x)dλ = −2λ̄ + 2λ̂. The solution is λ̂ = λ̄, and it is easily
verified to give a minimum, since the second derivative with respect to λ̂ is 2.
1 Of course, it is not necessary to introduce both utility and loss since they differ only in

sign. But both are used in different communities

22
Exercise 5 (+) (Sivia’s lighthouse example[95], modified) Light pulses are emit-
ted horizontally from a lighthouse at angles uniformly (and iid) distributed. The
lighthouse is placed at distance d from a straight coastline, and the normal from
the lighthouse hits the coastline at coordinate x0 . Consider only light rays that
actually hit the coastline (and forget those sent away from the coastline). So
the emitting angles (αi ) are uniformly distributed on [−π/2, π/2] and the points
where they hit the coastline have coordinates xi = d tan(αi ). (i) Derive the prob-
ability distribution of the points (xi ) where the coastline is hit, and verify that
it is in the family of Cauchy distributions.
(ii) Consider estimating x0 from the xi when d is known to be 1, using sample
mean, sample median and maximum likelihood estimators. Implement the esti-
mators and check them on samples of size 10, 100, 1000 and 10000. Some of
them seem to behave unexpectly bad. Explain why, and assess the adequacy of
your estimators.
(iii) Similar problem, but design and analyze a maximum likelihood estimator
for d when x0 is known to be 0. Optional: what would mean and median esti-
mators look like? How good are they? (Hint: You will probably have to do this
numerically)

Exercise 6 Verify the optimal estimator of λ to be


+a) the median when the loss function is L(λ, λ̂) = |λ − λ̂|;
+b) the mode when the loss function is L(λ, λ̂) = −δ(λ − λ̂); where δ is
Dirac’s delta function.
c) Which of the three estimates (mean, median, MAP) are scale-invariant
when applied to a real valued variable (Scale invariance: if the variable is trans-
formed by u = g(x), and thus the distribution to f (x)/g  (x), then û = g(x̂),
where the hat denotes one of the three estimators. Mode: estimate with highest
probability or probability density).

Exercise 7 Find the optimal estimator from the posterior when the estimated
quantity is a probability distribution and we know that this distribution will only
be used for maximum expected utility decision making.

23
2.1.12 *Bayesian analysis of cross-validation
The current state-of-the-art in Bayesian (and also in test-based) inference is that
non-parametric models are avoided because we do not fully understand them
when applied to complex multi-variate data sets. So we are coerced into using
models that are not universal, and then Bayesian inference gives us a posterior
for the parameters that may differ substantially from some ’true’ distribution
whose existence, but not form, we may postulate or which may become obvious
in the future. The Bayesian model selection framework we have earlier described
may be called the model-closed perspective. In that perspective we assume that
there is a true model somewhere in the set of models we consider. If we consider
the possibility of a true model which is not part of the set we analyze, we call it
the model-open perspective. A thorough analysis of this problem area is given in
[16, Ch 6.1]. We shall look at a selected part of this analysis, which is concerned
with analyzing an exchangeable sequence for the purpose of making a decision.
In [16, Ch 6.1] the authors also consider an in-between case, a model-completed
perspective, where a true model is known, but not in the set of models Mi . The
reason for this perspective may be that although we know the true model, it
has a too complicated form to be used in decision making.
Suppose that we wish to consider a set of simple or composite models
{Mi }i∈I , having made a sequence of independent observations x = x1 , . . . , xn
from a process. The notation suggest that I can be a discrete set, an interval of
real numbers, a set of vectors in Euclidean space filling a part of it, or a more
complex object such as the set of all distributions, even if we would be hard
pressed to actually use the latter. We can handle both model selection (select
a value for i) and model averaging (select some probability distribution over I)
by, in the latter case, extending the model set to the set of weighted averages of
the original models in I, thus possibly getting a larger set of actual models than
the original {Mi }i∈I . Such set is the convex closure of the original set. Suppose
we must now choose one i ∈ I and use it as a model, and then make a decision
based on i, which is followed by some type of payoff. This payoff can either be
based on the accuracy of a prediction of a previously unknown value, or on the
accuracy of a parameter estimate or predicted distribution. The situation can
be described with a utility function u(mi , ai , ω), where mi is a model selected,
ai is an action performed based on i, and ω is the response of nature - a future
observation or a distribution. The objective is to select mi and ai in a way
that optimizes u(mi , ai , ω)dω. However, when ai is selected from some set Ai
after mi has been chosen, we must assume that {Mi } is the true model. The
second selection is thus easy, ai = argmaxa∈Ai u(mi , a, ω)fi (ω|x)dω, where
fi (ω|x) = P (ω|Mi , x).
In the case of simple models Mi , the function fi (ω|x) is independent of x.
A similar procedure solves the combined model selection and decision making
problem in the model-closed perspective:

(mi , ai ) = argmaxm∗i ∈M,a∗i ∈Ai u(m∗i , a∗i , ω)fi (ω)dω. (8)

This is little more than the formulation of inference as decision making we


saw in section 2.1.11. Another way to express the solution is that the ex-
pected utility of choosing model mi is u(mi , ai , ω)f (ω|x)dω, where f (ω|x)
is the posterior of ω given the data observed x in the model-closed perspec-

24
tive. In the model-open perspective we have no reason to put great confi-
dence in the posterior, and we must use some other method to estimate f (ω|x).
However, we can remove one item xj from x and use the rest of the sequence
x−j = (x1 , . . . , xj−1 , xj+1 , . . .) as a possible observation sequence and then use
xj as an approximation for ω. For a long sequence we can thus get many ex-
amples of an initial sequence followed by a new observation, and this yields a
possible method to select models outside the standard Bayesian framework.
As an example, with the decision problem of predicting the next observation
with quadratic loss, the state ω will be the next observation  y, the actions ai will
bepredicted values for y, and the expected utility will be u(mi , ai , y)f (y|x)dy =
− (y−ai )2 f (y|x)dy. By cross-validation we can estimate this quantity (remem-

ber, we do not know f (y|x)) by −1/(k kj=1 (xj − ai )2 ). This is a kind of Monte
Carlo estimation of the utility of model i, and in Bayesian cross-validation one
tries a number of models and selects one with best estimated utility. For other
types of decisions, like the classical machine learning task of predicting a future
variable given partial information about it, the procedure can be easily adapted.
We have not here gone into detail concerning reliability estimates. Such esti-
mates are rather infrequent in the literature, and one is usually content with
experimental validations of the method selected for a particular application, the
reason (i.e., the problem) being that the model-open perspective admits any
model as possible.

2.1.13 Biases and analyzing protocols


We will see in the chapter on inference by testing (frequentism), exemplified by
the coin-tossing example, that a frequentist analysis must consider the protocol
of an experiment, not just the outcome. However, there are also important
factors of experimental protocols that can influence both a Bayesian and a
frequentist analysis.
In investigations based on questionnaires, the detailed formulation of ques-
tions and the administration of the questionnaire has significant effect on what
subjects answer. If such investigations touch very sensitive parts of society or
the individual, they have little reliability unless their protocols are very well
documented, designed and validated. When asked about ’antisocial’ and polit-
ically incorrect or even illegal habits, subjects’ actual trust in the anonymity
offered can be crucial, and work in unexpected ways. When asked about circum-
stances that can lead to changes in legislation or public surveillance, subjects
can also give the answer that tends to turn legislation their way. The intents of
the actors creep back into the protocol when they are regarded as subjects.
In medical ’meta-analyses’, many different investigations are pooled from
publications to give better statistical strength. Pitfalls in this might be that the
investigations are not quite comparable, and that investigations not leading to
significant conclusions might not even be published. Such investigations may
also be published in less prestigouos ways and fail to be found by the meta-
analyst. This phenomenon is known as publication bias.
In clinical trials, where a new treatment is compared with the currently stan-
dard one, biases can be avoided by randomizing the two groups (i.,e., patients
are not given a choice) and if possible hiding the choice even for health care
personnel involved. This can be shown to eliminate completely the commonly
observed biases like those described below in section 3.3.2, but there are still

25
possible problems in the procedure by which patients are included and excluded
from the study, particularly in that all subjects(participants) in a trial must
nowadays give informed consent. This can for example bias the population in a
study with potential connection to psychiatric conditions.

Exercise 8 In questionaires with sensitive questions is is difficult to assess


to which extent the answers are true, maybe because subjects fear that their
anonymity will not be respected. Suggest some reasons why subjects can give de-
liberately wrong answers in investigations of their attitudes. One way to tackle
this is to ask subjects to give wrong answers with some probability (and governed,
e.g., by throwing coins or dice while answering). Assume we want to assess the
frequency of some legally or morally inapproppriate behaviour and expect that a
’yes’ answer is not obtained when it should with probability p0 , but ’no’ answers
are correct. Assume also that we ask subjects to give the wrong answer with
probability p1 . When does this procedure give more accurate estimates, assum-
ing the answers will be truthful in the second case? (Hint: use a MATLAB
simulation).

Exercise 9 In a famous TV quiz show, the contestant has to guess in which of


three boxes a valuable prize can be found and obtained. After the contestant’s
first selection of a box, the host sometimes opens one unselected box and shows
that it is empty. The contestant is given the opportunity to switch his guess to
the unselected remaining box. It is a popular quiz in probability courses to prove
that this switch is indeed advantageous and the offer should be accepted.
Analyze and advice on the decision under the following assumptions:
i) The host is forced by the protocol to show one empty box and allow the
contestant to change selection.
ii) (+) The opening is voluntary for the host, and he will try to make you
fail.
iii) (+) The opening is voluntary for the host, and he will try to make you
win
iv) (+) The opening is voluntary for the host, and you do not know the
desire, if any, of the host.(Hint: You should assume that the host is as clever as
you are. Is there a randomized optimal choice which performs well regardless
of the intent of the host?).

2.1.14 How to perform Bayesian inversion


Development of Bayesian analysis was only enabled when desk-top computing
power was made available. Before computers were standard tools, Bayesian
analysis was completely dependent on formulating the inference problem using
equation (1), or (2) with a small number of alternatives, and various analytical
models that happened to be tractable because of conjugacy properties for the
equations (3), (5) and (6). We have already mentioned the Kalman filter, where
assuming all distributions Gaussian transforms the solution of (6) into a linear
algebra problem. Several such special cases exist, and the corresponding families
of functions with their properties are listed, e.g., in [16, App A] and [45]. In
our analysis of graphical models (also known as influence diagrams or Bayesian
Networks) we will thoroughly analyze one such example, the use of Dirichlet
distributions to make inference about discrete probability distributions. For

26
inference about low-dimensional parameter spaces, it is possible to discretize
parameter space and reduce the problems (3, 5, 6) to the discrete problem (2),
either using some sophistication with numerical analysis tools or just assuming
that both likelihoods and priors are piecewise constant within rectilinear boxes.
Then we work exclusively with probabilities that parameters fall into specific
boxes. We can thus work with distributions over discrete (but large) sets.

Figure 6: Andrei Markov (1856-1922) was a Russian mathematician develop-


ing, among other things, the theory of Markov processes. Markov assumptions
are common assumptions in statistical modeling saying that a variable’s distri-
bution, conditional on the other variable’s values, depends effectively only on a
subset of those variables. Examples of Markov assumptions are found in Markov
chains, where only the current state and not the full history of the process de-
cides the probability distribution of the next state, and in graphical models,
where the immediate ancestors (neighbors for undirected models) in a graph
give all available information (among ancestors in the case of directed models)
about a variable.

In general, however, the parameter and observation spaces of interest are


very high-dimensional (in genetic linkage studies and imaging, the parameter
dimension can be a couple of thousand and many millions, respectively). In these
cases Markov Chain Monte Carlo (MCMC) is the only presently feasible and
readily implementable technique (in certain key applications, however, other
very specialized numerical procedures have been implemented, as in military
tracking systems). The MCMC method assumes that a probability density
function on a continuous or discrete space is given in unnormalized form, π(x) =
cf (x), where we can evaluate f but not c or π. A Markov chain is constructed
in such a way that it has a unique stationary distribution which is π. This
means that a long chain x1 , . . . , xN can be used as an approximation to the
distribution, and the average of a function g(x) over π can be approximated
N
by the sum i=1 g(xi )/N . The details will be given in Ch. 4. The chain is

27
designed with a proposal distribution q(z|x) which is used to generate a next
proposed state z when the last state of the chain is x. So the chain x1 , . . . , xt is
extended by drawing a proposed new state z from the distribution q(z|xt ), and
this proposal is evaluated by computing the ratio

f (z)q(xt |z)
r= . (9)
f (xt )q(z|xt )
If r > 1, the proposal is accepted. Otherwise the proposal is accepted with
probability r, and rejected otherwise. If the proposal is accepted, xt+1 is set
to z, otherwise xt+1 is set to xt . This can be implemented by comparing a
standard uniformly distributed pseudo random u number with r and accepting
the proposed state if r > u. Equation (9) shows that a move is preferred to the
extent that it would increases the probability of the state but also if the proposal
distribution makes it easy to return to the old state. A proposal distribution
with q(z|x) = q(x|z) is called a symmetric proposal distribution. Symmetric
proposals are the most common, and they simplify (9) a lot.
We can illustrate the MCMC method with a small example. Assume a
distribution consisting of two different normal distributions, so that with prob-
ability about 0.5 each is selected as the distribution of a new item (data point).
Figure 7 shows a distribution with two high- probability regions (top frame).
Then three traces with different proposal distributions are shown. In the first
case with a small proposal step, the trace has difficulty switching between the
two high-probability regions, and therefore the estimate of their relative weights
will be of low accuracy. The second trace mixes well and gives an excellent esti-
mate. In the last trace with a large step size, most of the proposals are rejected
because they fall in low probability regions. Therefore the estimate becomes
inaccurate - the trace has too many repeated values and looks like an estimate
of a distribution over a finite set of values.
It is a significant problem in MCMC to determine the proposal distribution
so that good mixing is guaranteed. Often, experiments are required before a
good proposal distribution is found.
An alternative method, variational Bayes method is gaining increasing at-
tention as an alternative to the computer intensive MCMC method. Instead of
computing an intractable posterior over many parameters, one tries to find fac-
tored posteriors where each factor depends on only one or a few parameters[7].

Exercise 10 The file x33.mat contains three traces with different proposal dis-
tributions similar to those in figure 7). Assume that the target distribution is
a mixture of two normal distributions. Estimate the mixing coefficients, mean
and variance for the two distributions using each of the three traces.

28
15

10

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10
1

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10
1

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
4
x 10

Figure 7: MCMC computation. From top: a) Density function is a mixture of


two normals of equal weight. b) MCMC trace with small proposal size - too
few jumps between peaks. c) Well chosen proposal. Mixes well. d) Too large
proposal step: most proposals rejected.

29
Figure 8: The MCMC method has been developed to solve dynamic inference
problems as defined by equation (6). This method is often called a particle filter.
In this example, from the Thesis of Hedvig Kjellström[94], a person walking in
a cluttered scene is tracked by a model built from cylinders. In each video
frame, several copies of the current best matching models are forwarded one
time step using the innovation process. Then all these forwarded models are
scored by matching its contours with edges in the next video frame, forming
a weighted sample. By resampling, a new unweighted sample is obtained, and
the tracking process starts anew. In the example, the innovation process was
not wide enough to prevent the subject from escaping the tracker. More on this
method in section 4.1.2.

30
2.2 Test based inference
It is completely feasible to reject the idea of subjective probability, in fact, this
is the more common approach in many areas of science. If a coin is balanced,
the probability of heads is 0.5, and in the long run it will land heads about half
the number of times it was tossed. However, it might not, but even then one
can claim that the probability is still the same. In other words, we can claim
that the updating of prior to posterior through the likelihood has nothing to do
with the ’objective’ probability of the coin landing heads.

Figure 9: Antoine Augustine Cournot (1801–1877). He formulated ’Cournot’s


bridge’ or Cournot’s principle as a link between mathematical probability and
the real world: An event, picked in advance, with very small or zero probability,
will not happen. If we observe such an event, we can conclude that one (or
more) of the assumptions made to find its probability is incorrect. In hypothesis
testing we apply Cournot’s bridge to a case where the null hypothesis is the only
questionable assumption we made in computing the p-value – thus we can reject
the null hypothesis if we observe a low p-value. The principle was considered of
fundamental importance by several pioneers of probability and inference, but it
does not figure prominently in current historic surveys. It was recently brought
to attention (for another purpose than ours) by Glenn Shafer[93].

The perceived irrelevance of long run properties of hypothesis probabili-


ties made one school of statistics reject subjective probability altogether. This
school works with what is usually known as objective probability or frequentist
statistics. Data is generated in repeatable experiments with a fixed distribu-
tion of the outcome. Since we cannot talk about the objective probability of a
hypothesis, we can not use such probabilities to express our belief in them. In

31
Figure 10: R A Fisher (1890–1962) was a leading statistician as well as geneticist.
In his paper Inverse probability[40], he rejected Bayesian analysis on grounds
of its dependency on priors and scaling. He launched an alternative concept,
’fiducial analysis’. Although this concept was not developed after Fishers time,
possibly because of erroneous or at least highly questionable statements in his
second fiducial paper[41], the standard definition of confidence intervals (devel-
oped by J Neyman) has a similar flavor. The fiducial argument was apparently
the starting point for Dempster in developing evidence theory[32].

32
the development of probability, many intriguing developments occurred that are
now not visible in the standard engineering statistics and probability, and a very
readable account can be seen in the recent book [93]. In particular, Cournot
hypothesized that the connection between statistics and the real world is that
an event (picked in advance) with probability zero will not occur. In the world
of finite time and resources, this translates to the thesis that an event (also
picked in advance of testing) with small probability will usually not happen.
This is the basis for the modern hypothesis-testing framework that is standard
in many applied disciplines, for example, in clinical testing.
One device used by a practitioner of objective probability is testing. For a
single hypothesis H, a test statistic is designed as a mapping t of the possible
outcomes to an ordered space, normally the real numbers. The data probability
function P (D|H) will now induce a distribution of the test statistic on the real
line. I continue by defining a rejection region, an interval with low probability,
typically 5% or 1%. Next the experiment is performed or the data D is obtained,
and if the test statistic t(D) falls in the rejection region, the hypothesis H is
rejected. For a parameterized hypothesis, rejection depends on the value of
the parameter. In objective probability inference about real valued parameters
we use the concept of a confidence interval, whose definition is unfortunately
rather awkward and is omitted. It is discussed in all elementary statistics texts,
but often the students leave introductory statistics courses believing that the
confidence interval is a credible interval. Fortunately, this does not matter
a lot, since numerically they are practically the same. The central idea in
testing is thus to reject models. If we have a problem of choosing between
two hypotheses, one is singled out as the tested hypothesis, the null hypothesis,
and this one is tested. The test statistic is chosen to make the probability of
rejecting the null hypothesis maximal if data is in fact obtained through the
alternative hypothesis. Unfortunately, there is no strong reason to accept the
null hypothesis just because it could not be rejected, and there is no strong
reason to accept the alternative just because the null was rejected. But this is
how testing is usually applied.
The p-value is an important concept in testing. This value is used when the
rejection region is an infinite interval on one side of the real line, say the left one.
The p-value is the probability of obtaining a test statistic not larger than the one
obtained, under the null hypothesis, so that a p-value less than 0.01 allows one
to reject the null hypothesis on the 1% level – the observed value is less than it
could plausibly be. We can define and compute p-values also for ordered sets of
rejection regions that are not half-infinite intervals on the real line, but caution
must be exercised. For the observation space D we must, before obtaining or at
least before looking at the data to be tested, define a dense or discrete ordered
set of subsets Rt of D such that Rs ⊂ Rt whenever t < s. The p-value for
an observation d with respect to the null hypothesis and the rejection sets is
now sup{P (d ∈ Rt ) : d ∈ Rt }. In other words, if we compute the probability
under the null hypothesis for the observation to fall in Rt for all t and call this
probability pt , then the p-value is the upper bound of pt for the sets Rt which
contain the observation.

33
2.2.1 A small hypothesis testing example
Let us analyze coin tossing again. We have again the two hypotheses Hf and
Hu . Choose Hf , the coin is fair, as the null hypothesis. Choose the number of
successes as test statistic. Under the null hypothesis we can easily compute the
p-value, the probability
12  of−12 obtaining nine or more failures with a fair coin tossed
twelve times, i=9 12 i 2 = .075. This is 7.5%, so the experiment does not
allow us to reject fairness at the 5% level. On the other hand, if the testing plan
was to toss the coin until three successes have been seen, the p value should
be computed
∞ asi+2the probability of seeing nine or more failures before the third
−(i+3)
success: i=9 i 2 = .0325. Since this is 3.25%, we can now reject the
fairness hypothesis at the 5% level. This is a feature of using test statistics: the
result depends not only on the choice of null hypothesis and significance level,
but also on the experimental design, i.e. on data we did not see but that could
have been seen.
If we chose Hu as the null hypothesis we would put the rejection region in
the center, around f = s, since in this area Hu has smaller probability than Hf .
In many cases we will be able to reject either both of Hf and Hu , or neither of
them. This shows the importance of the choice of null hypothesis, and also a
certain amount of subjectivity in inference by testing.

2.2.2 Multiple testing considerations


Modern analyses of large data sets involve making many investigations, and
many tests if a testing approach is chosen. This means that some null hypotheses
will be rejected despite being ’true’. If we test 10000 null hypotheses on the 5%
level, 500 will be rejected on the average even if they are all true. Correcting for
multiple testing is necessary. This correction is an analog of the bias introduced
in Bayesian model choice when the two types of error have different cost.
There are basically two types of error control proposed for this type of anal-
ysis: In family-wise error control (FWE[57]), one estimates the probability of
finding at least one erroneous rejection if there is one, whereas in the recently
analyzed false discovery rate control (FDR[9]) one is concerned with the fraction
of the rejected hypotheses that is erroneously rejected.
A Bonferroni correction divides the desired significance, say 5%, with the
number of tests made. So in the example of 10000 tests, the rejection level should
be decreased from 5% to 0.0005%. Naturally, this means a very significant loss
in power, and most of the ’false’ null hypotheses will probably not be rejected.
It was shown by Hochberg[57] that one could equally well truncate the rejection
list at element k where k = max{i : pi ≤ q/(m − i + 1)}, (pi )m 1 is the increasing
list of p-values and q is the desired FWE rate.
A recent proposal is the control of false discovery rate(FDR). Here we are
only concerned that the rate(fraction) of false rejections is below a given level.
If this rate is set to 5%, it means that of the rejected null hypotheses, on the
average no more than 5% are falsely rejected. It was shown in [9] that if the m
tests are independent, one should truncate the rejection list of m sorted p-values
at element k where k = max{i : pi ≤ qi/m}. The correct interpretation of the k
so found is: The expected number of the k tests giving p-values p1 , . . . , pk that
are falsely rejected is kq.
If we do not know how the tests are correlated, it was shown in [10] that the

34
cut-off value is safe for FDR control regardless
m of dependencies if it is changed
from qi/m to qi/(mHm ), where Hm = i=1 1/i, the mth harmonic number.
A more surprising (and more difficult to prove) result is that, for many types
of positively correlated tests, the correction factor 1/Hm is not necessary [10].
People practicing FDR seem often to have come to the conclusion that the
correction factor is too drastic for all but the most exotic and unlikely cases.
This may yet turn out to be an example of wishful thinking, however, since it
is difficult to test.
These new developments in multiple testing analyses will have a significant
effect on the practice of testing. As an example, from the project described in
section 5.2, in a 4000-test example, the number of p-values below 5% were 350,
but only 9 would be accepted by the Bonferroni correction and 257 with FDR
controlled at 5%. This means that the scientists involved can analyze 257 effects
instead of 9, with a 5% risk that one of the 9 was wrong and only 13 of the 257
expected to be wrong.
FDR control is a rapidly developing technique, motivated by the need for
acceptable alternatives to the Bonferroni correction which has very little power
when many tests have been performed. Such applications are rapidly developing
in medical research, where fMRI (functional Magnetic Resonance Imaging, a
technique that shows brain activity by a contrast between oxygenated and de-
oxygenated blood which suggests energy consumption by neurons and their
appendages, dendrites and axons) and micro-array investigations (a technique
for genotyping and measuring gene activity in many points or genes on the
genome simultaneously) produce very large numbers of very weak signals.
It is possible to produce an even stronger multiple testing procedure which
only tells us that with high confidence there is at least one false null hypothesis.
Such a test would build on either the Kolmogorov-Smirnov test that the p-values
are uniformly distributed, or a graphical/visual test like that proposed to test
for constant intensity in section 3.1.5. The drawback would be that we then
know only that at least one of possibly quite many null hypotheses is false.
The identification of such a case could however motivate that a larger study is
performed.

Exercise 11 (+) Five different gene variants were determined for a population
of 200 normal subjects and 300 subjects diagnosed with schizophrenia. Each
subject is classified with a genotype of five components, one for each gene, and
the component is 11, 12 or 22, showing which variant the subject has in his/her
two genes of this type. Using a standard test to decide if there is an associa-
tion between gene and diagnosis, five p-values were computed related to the null
hypothesis of no association (when the two subject classes have the same distri-
bution over gene variants). These were 1.1%, 1.1% 2.3%, 2.4% and 20%. With
the Bonferroni correction it is not possible to reject any null hypothesis on the
5% level. Make an FDR analysis of the situation, finding a set of hypotheses of
which on the average 95% are true.

2.2.3 Finding p-values using Monte Carlo


.
In Statistics there is an overwhelming battery of test statistics produced for
various purposes, and each has its own method for determining p-values. We

35
will shortly describe a general method to compute an approximate p-value for
any real valued test statistic. It is applicable whenever we are able to generate
samples from the distribution taken as the null hypothesis. We start by gener-
ating a large sample of N points, and compute the test statistic ti for each point
i, i = 1, . . . , N , after renumbering the sequence so that it is increasing (this is
simply done by sorting the sequence). Now the value ti for the test statistic is
given the p-value i/N for a left rejection interval, and 1 − i/N for right-sided re-
jection. When the test statistic of the data obtained in an experiment has been
computed, its approximate p-value is found by interpolation in the sequence.
The statistical error in this procedure can be estimated, but typically a value of
N giving enough precision is easy to find. The use of this method is illustrated
in the next section.

2.3 Example: Mendel’s results on plant hybridization


Gregor Mendel[70] made important discoveries on inheritance of traits in plants.
In one of his experiments, he examined the self-hybridization of plants (actu-
ally a kind of peas) that were heterozygotes (hybrids) with respect to a trait
such as seed shape, seed color, stem length. He found that on average three
out of four offspring would get the dominant trait, that can be explained by
assuming a hybrid plant has one dominant and one recessive gene, denoted Aa,
and assuming that the genes are randomly selected for the offspring, each with
probability 1/2, giving equal probability to the four outcomes AA Aa aA and
aa, only the last giving offspring with the recessive trait. Mendel’s statistics
where as summarized in the following Matlab table:
N1=[5474,1850];%seed shape
N2=[6022,2001];%seed color
N3=[705,224];%seed-coat color
N4=[882,299];%pod shape
N5=[428,152];%pod color
N6=[651,207];%flower position
N7=[787,277];%stem length

When Fisher got hold of these results, he felt that the proportions were
closer to Mendel’s stipulated 3:1 law than one would expect if the process had
been a true Bernoulli process with success probability 3/4. Fisher used the
χ2 test to find that the square deviations were smaller than expected. Today
we can directly simulate Mendel’s stipulated process and check the variability
of the results. The necessary Matlab code to compute the p-values for a test
with rejection region around the 3:1 point can be written and tested out in no
time (figure 13). As test statistics we take the sums of the relative squared
deviations from the mean for each trait. In other words, consider the first
row N1=[5474,1850]; above, standing for an experiment where 7324 seeds ob-
tained by self-fertilization from a hybrid parent gave 5474 peas with dominant
trait and 1850 with the recessive. The mean, under Mendel’s law, is the ex-
act proportions 3:1, namely 5493 and 1831. The relative squared difference is
(1850 − 1831)2 /1831 + (5474 − 5493)2/5493, and this is the test statistic for the
first row. For a test on the whole 7-trait experiment, the 7 test statistics are
just added to get the test statistic for the whole experiment.

36
In figure 11 we get the distribution of the test statistic and the outcome for
the seven traits. In figure 12 we see that the joint outcome has a p-value of
0.043, a bit on the low side. But if two experiments with the same outcome are
added, or if the ’best’ of two experiments were chosen, the hypothetical p-value
would have been completely normal. Of course, it is somewhat questionable to
do this analysis at all, since Cournot’s principle only applies to events picked
in advance. So Fisher probably made the error of using the observations to
pick a set of rejection intervals containing zero, where he could equally well
have chosen intervals containing infinity or a sequence of intervals containing
the mode (value at largest value for the pdf under the null hypothesis of success
probability 3/4). Doubling the counts as we did in the lower part of figure 12 was
only an illustration, of course. Doing this in a test where we want to strengthen
a real hypothesis would be completely forbidden. We are also not allowed to
repeat an experiment until we get data allowing rejection of null hypothesis. If
we repeat any experiment 20 times there is a good chance (probability more
than 1/2) that some of them rejects any null hypothesis at level 5%. When
our investigation forces us to perform many tests, we must use the theory of
multiple testing(section 2.2.2).

1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
p=0.37 p=0.091
1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
p=0.445 p=0.181
1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
p=0.472 p=0.424
1

0.5

0
0 0.5 1 1.5 2
p=0.557

Figure 11: Stochastic simulation of Mendel’s experiment, assuming that his 3:1
law holds. The p-values shown are for rejecting Mendel’s law because of too
little square sum deviation from the expectation values in each of the 7 trials.
These values do not support rejection.

Exercise 12 (+) A not so reputable person claims to have discovered an event


with exact probability 0.5, and he wants to support his claim with an experiment
where the event happened in exactly 40 of 80 occasions. Do you think he cheated?
By cheating we mean reporting a score too balanced even under the assumption
that the event has indeed probability 0.5. Quantify and motivate your judgment!
What if it happened in 4000 of 8000 occasions?

37
1

0.8

0.6

0.4

0.2

0
0 5 10 15 20 25
p=0.043

0.8

0.6

0.4

0.2

0
0 5 10 15 20 25 30
p=0.242

Figure 12: Putting together the 7 trials in one gives us a small p-value, 0.043,
that allows rejection on the 5% level. To test the robustness of this conclusion,
we can make the same computation with doubled counts (a hypothetical, twice
as large, experiment). This gives a totally normal p-value.

2.4 Discussion: Bayes versus frequentism


Considering that both types of analysis is used heavily in practical and serious
applications and by the most competent analysts, it would be somewhat opti-
mistic if one thought that one of these approaches could be shown right and
the other wrong. Philosophically, Bayesianism has a strong normative claim
in the sense that every method that is not equivalent to Bayesianism can give
results that are irrational in some circumstances, for example if one insists that
inference should give a numerical measure of belief in hypotheses that can be
translated to a fair betting odds[30, 89], or if one insists that this numerical mea-
sure be consistent under propositional logic operations[27]. Among stated prob-
lems with Bayesian analysis the most important is probably a non-robustness
sometimes observed with respect to choice of prior. This has been countered
by introduction of families of priors in robust Bayesian analysis[12]. Another is
the need for priors in general, because there are situations where it is claimed
that priors do not exist. However, this is a weaker criticism since the subjective
probability philosophy does not recognize probabilities as existing: they are as-
sessed, and can always be assessed[30, Preface]. Assessment of priors introduces
an aspect of non-objectivity in inference according to some critics. However,
objective probability should not be identified with objective science: good sci-
entific practice means that all assumptions made, like model choice, significance
levels, choice of experiment, as well as choice of priors, are openly described and
discussed.
One can also ask if Bayesian and testing based methods give the same result
in some sense. Under many different assumptions it can be proven that frequen-
tist confidence intervals and Bayesian credible intervals will asymptotically, for

38
% Mendel’s data to table NM1
NM1=[5474,1850;6022,2001;705,224;882,299;428,152;651,207;787,277];
NN=1000;%number of trials simulated
y=[1:NN]/NN;%scale y-axis to $p$-value
%Simulation of experiment and doubled experiment
for k=1:2
NM=NM1*k; % NM will hold the counts observed or twice these counts
chi2=zeros(size(NM,1),NN); %Allocate space for chi2-values
chi2s=zeros(size(NM,1),1);
for iN=1:size(NM,1) % Consider each trait tabled in iN
Ni=NM(iN,:); %Select the right column
N=sum(Ni); % Total count for trait
etr=0.75*N;% Expected number of plants/peas with dominant trait
emiss=0.25*N; %Expected number of plants/peas with recessive trait
R=rand(N,NN)<0.75;
tr=sum(R); %Simulated result (dominant) for sample
miss=N-tr; % and recessive
%Compute chi2-values for the NN simulated trials
chi2(iN,:)=((tr-etr).^2/etr+(miss-emiss).^2/emiss);
% and for Mendel’s table
chi2s(iN)=(Ni(1)-etr).^2/etr+(Ni(2)-emiss).^2/emiss;
if k==1
p=sum(chi2s(iN,:)<chi2s(iN))/NN;
%Plot the subtrials
subplot(4,2,iN);xlabel([’p=’,num2str(p)]);
ss=sort(chi2(iN,:));
cut=round(0.8*NN); % Cut tails
plot(ss(1:cut)’,y(1:cut),’b-’, ...
[chi2s(iN),chi2s(iN)]’,[0,1]’,’r-’);
xlabel([’p=’,num2str(p)]);
end
end;
if k==1
print figur1
end;
subplot(2,1,k);
schi2=sum(chi2);
schi2s=sum(chi2s);
schi2=sort(schi2’);
p=sum(schi2<schi2s)/NN;
plot(schi2,y,’b-’,[schi2s,schi2s]’,[0,1]’,’r-’);
xlabel([’p=’,num2str(p)]);
end
print figure

Figure 13: Matlab-program for Monte-Carlo-simulation of p-values.

39
Figure 14: Gregor Mendel (1822-1884) was a monk and the first to discover
important statistical properties of inheritance in plants. He did not get a lot
of attention during his lifetime, but after three botanists independently redis-
covered his laws around 1900, Mendel became one of the best known amateur
scientists.

40
many observations, converge.

2.4.1 Model checking in Bayesian analysis


It is possible to combine the Bayesian and hypothesis-testing frameworks[45].
The parameter inference (3) gives a posterior over the parameter space, and a
reasonable interpretation is that data is generated by the predictive distribution
P (D∗ ) = Λ P (D∗ |λ)f (λ|D)dλ. Now select a test statistic t(D). By obtaining
the test statistic of many predictive data sets D∗ by this distribution we can
obtain an approximate distribution of the test statistic, and if the test statistic
of the real data D (from which the posterior was obtained) falls far out in the
tails of the approximate distribution it is an indication that the parameterized
model does not capture the data set obtained. It can be seen as a test of the
hypothesis that the data was actually generated by the parameterized model
used in the derivation of the posterior. The technique is suggested by Gelman
also for ’informal’, visual, testing. In figure 23 we can see an example of this
method applied.

2.5 Evidence Theory and Dempster-Shafer structures


An important development in uncertainty management is the Dempster- Shafer
or evidence theory. Originally formulated by Dempster[32], the idea was to
handle the case where several sources of information having slightly different
frames of reference were to be combined. Later, it was popularized by Shafer,
who also decoupled the method from statistics. Philippe Smets[96] and many
others have developed the method, which now exists in quite many versions.
The basic concept in DS theory is the DS-structure, which is simply a prob-
ability distribution over the power-set 2Λ of the possible world set Λ, and with
zero probability for the empty set. When the possible world set is conveniently
modeled as a Euclidean space of some dimension, the power set is replaced
by the set of boxes defined by upper and lower coordinates in each coordinate
direction.
We will assume here that Λ is a finite set. The DS-structure is represented
as a function, often called a mass assignment, m : 2Λ → [0, 1], with i m(i) = 1
and m(∅) = 0. Typically, many values of a mass assignment are zero, and the
subsets of Λ with a non-zero mass are called focal elements of the assignment.
Singleton subsets are sometimes called atoms. How should the DS-structure be
regarded as evidence? Even if it is often denied to be essential, we can note
that most tutorials in DS-theory start out with an example of assignment of
probabilities to events, where the set e ⊂ Λ represents the union of its members,
and m(e) is said to be the amount of belief/probability that can be attributed
to the event e but not to any of its proper subsets e : e ⊂ e, e = e. We can
think of the evidence as a set of possible probability distributions over Λ. Such
sets are called imprecise probabilities, defined by upper and lower envelopes. We
will call a set of pdf:s obtainable from a DS-structure a capacity.
Example: In the very simple case of two states, Λ = {A, B}, the capacity
of the mass assignment m(A) = 0.8, m(B) = 0.1, m({A, B}) = 0.1 is a set of
pdf:s giving A a probability in the interval between 0.8 and 0.9. For larger state
sets, the capacities can be described as polytopes, convex sets spanned by a
finite number of corner points.

41
Figure 15: Philippe Smets(1938-2005) developed the Transferable Belief Model,
based on Dempster-Shafer’s evidence theory. He makes a distinction between
credal belief which is best represented by DS structures and two such beliefs
are combined using Dempster’s rule; and pignistic belief (used in the context of
betting and decision making), represented by a probability distribution. From a
credal belief, a DS-structure, the corresponding pignistic belief is obtained with
the pignistic transformation.

42
In Dempster’s view, a DS-structure gives an upper and a lower bound for the
probability of each event e ⊂ Λ, called its belief (lower bound) and plausibility
(upper bound), respectively:
 
m(e ) ≤ P (e) ≤ m(e ) (10)
e ⊂e {e :e∩e =∅}

In evidence theory, the specification of priors is usually ignored. The DS-


structure is thought of as an evidence (prior, likelihood, or something else)
about the system state. Two central ideas are pignistic transformations which
take a DS- structure to a probability distribution over Λ, and combination rules,
which say how two evidences can be combined to form a new one. The pignistic
transformation estimates a precise probability distribution from a DS-structure
by distributing evenly the probability masses of non-singleton focal elements to
their singleton members. The relative plausibility transformation, on the other
hand, consists of the normalized plausibility values for the singletons of 2Λ . The
pignistic andrelative plausibility transformations are given by:
P (w) = w∈e m(e)/|e|, all w ∈ Λ, and
P (w) ∝ {w}∩e=∅ m(e), w ∈ Λ, respectively.
The Dempster’s combination rule combines two DS-structures into a new
one using a random set operation: the random set intersection of the operands
(regarded as random sets) conditioned on being non-empty. An alternative
combination rule, the MDS rule, was recently proposed by Fixsen and Mahler
[42].
Whereas Dempster’s combination rule can be expressed as

mDS (e) ∝ m1 (e1 )m2 (e2 ), e = ∅, (11)
e=e1 ∩e2

the MDS rule is


 |e|
mMDS (e) ∝ m1 (e1 )m2 (e2 ) , e = ∅. (12)
e=e1 ∩e2
|e1 ||e2 |

It is illuminating to see how the pignistic and relative plausibility transfor-


mations emerge from a precise Bayesian inference: The observation space can in
this case be considered to be 2Λ , since this represents the only distinction among
observation sets surviving from the likelihoods. The likelihood will be a function
l : 2Λ × Λ → [0, 1], the probability of seeing evidence e ⊂ Λ given world state
λ ∈ Λ. Given a precise e ∈ 2Λ as observation and a uniform prior, the inference
over Λ would be f (λ|e) ∝ l(e, λ), but since we in this case have a probability
distribution over the observation space, we should use equation (4), weighting
the likelihoods by the masses of the DS-structures. Applying the indifference
principle, l(e, λ) should be constant for λ varying over the members of e, for
each e. The other likelihood values (λ ∈ / e) will be zero. Two natural choices
of likelihood are l1 (e, λ) ∝ 1 and l2 (e, λ) ∝ 1/|e|, for λ ∈ e. Amazingly, these
two choices lead to the relative plausibility transformation and to the pignistic
transformation, respectively:

43

fi (λ|m) ∝ m(e)li (e, λ) (13)
{e:λ∈e}
 
m(e)/ e |e|m(e) , i = 1
= {e:λ∈e}
{e:λ∈e} m(e)/|e| ,i = 2

It is also possible to combine two pieces of fuzzy (DS) evidence in the form of
two
 DS-structures m1 and m2 . We find the task of combining the two likelihoods
e m 1 (e)l(e, λ) and e m2 (e)l(e, λ) using Laplace’s parallel composition as in
equation (4) over Λ, giving

f (λ) ∝ m1 (e1 )m2 (e2 )li (e1 , λ)li (e2 , λ).
e1 ,e2

For the choice i = 1, this gives the relative plausibility of the result of fusing
the evidences with Dempster’s rule; for the likelihood l2 associated with the
pignistic transformation, we get e1 ,e2 :λ∈e1 ∩e2 m1 (e1 )m2 (e2 )/(|e1 ||e2 |). This is
the pignistic transformation of the result of combining m1 and m2 using the
MDS rule. There is some current debate on which estimation (pignistic or
relative plausibility) and combination (Dempster’s DS or Fixen/Mahler’s MDS)
operations that should be used. I hope the above derivations show convincingly
that the choice of such operators depend on the statistical model chosen for
the application. In other words, none of the possible choices are intrinsically
correct.
For a discussion of the relationships between standard and robust Bayesian
analysis and evidence theory, see, e.g., [6]. An alternative interpretation of
DS-structures exists, and was first argued for in [91].

Exercise 13 The US Air Force has several target classification systems that
give their output as a DS-structure. One such system outputs at some time its
belief that an incoming target is either an Attack(Fighter), Bomber or Civilian
aircraft. The parameter set is thus A, B, C. The output on one occasion is:
m(A) = 0.2, m(B) = 0.050, m(C) = 0.083, m({A, C}) = 0.022, m({B, C}) =
0.534, m({A, B, C}) = 0.111.
Represent the capacity corresponding to m as a polygon in the plane contain-
ing possible combinations of P (A) and P (B).

2.6 Estimating a distribution, Decision and Maximum En-


tropy
In robust Bayesian analysis one considers convex sets of probability distribu-
tions like the capacities of DS-structures. For decision making one uses either
expected utility maximax or maximin criteria, or estimates a precise probability
distribution to decide from. Examples of the latter are the pignistic and relative
plausibility transformations. An example of a decision-theoretically motivated
estimate is the maximum entropy estimate, often used in robust probability ap-
plications [61]. This choice can be given a decision-theoretic motivation since
it minimizes a game-theoretic loss function, and can also be generalized to a
range of loss functions [53]. Specifically, a Decision maker must select a dis-
tribution q while Nature selects a distribution p from a convex set Γ. Nature

44
Figure 16: Ed Jaynes (1922-1998) championed the Maximum Entropy method,
originally in physics and then in other application areas.

45
selects an outcome x according to its chosen distribution p, and the decision
makers loss is − log q(x). This makes the decision makers expected loss equal to
Ep {− log q(X)}. The minimum (over q) of the maximum (over p) expected loss
is then obtained when q is chosen to be the maximum entropy distribution in Γ.
It is thus, if this loss function is accepted, optimal to use the maximum entropy
transformation for decision making. An intuitive explanation of this lies in the
structure of the payoff, − log(q(x)). It is thus bad if we choose a q giving low
probability (because then the neg logarithm is large) to x chosen by Nature.
But x can only be chosen if Nature selects a p giving it a large probability.
Consequently, we should avoid giving x a small probability if Nature can give it
a large one. Putting the result in context, the MaxEnt estimate is appropriate
when we can assume that Nature knows which distribution q Decision Maker
chose when it selects its distribution p. Of course, this is a pessimistic case in
general, and most likely there are many cases where a better strategy can be
found, maybe the pignistic estimate, or the center of smallest enclosing sphere
of the polygon.
The maximum entropy principle differs significantly from the relative plau-
sibility and pignistic transformations, since it tends to select a point on the
boundary of a set of distributions (if the set does not contain the uniform dis-
tribution), whereas the pignistic transformation selects an interior point.
The pignistic and relative plausibility transformations are linear estimators,
by which we mean that they are obtained by normalization of a linear function
of the masses in the DS-structure. If we buy the concept of a DS-structure as a
set of possible probability distributions, it would be natural to require that as
estimate we choose a possible distribution, and then the pignistic transformation
of Smets gets the edge – it is not difficult to prove the following:
Proposition 1 The pignistic transformation is the only linear estimator of a
probability distribution from a DS-structure that is symmetric over Λ and always
returns a distribution in the capacity represented by the DS-structure.
Although we have no theorem to this effect, it seems as if the pignistic
transformation is also a reasonable decision-oriented estimator approximately
minimizing the maximum Euclidean norm of difference between the chosen dis-
tribution and the possible distributions, and better than the relative plausibility
transformation as well as the maximum entropy estimate for this objective func-
tion. The estimator minimizing this maximum norm is the center of the smallest
enclosing sphere. It will not be linear in m, but can be computed with some
effort using methods presented, e.g., in [44]. The centroid is sometimes proposed
as an estimator, but it does not correspond exactly to any known robust loss
function – it is rather based on the assumption that the probability vector is
uniformly distributed over the imprecision polytope.
The standard expected utility decision rule in precise probability translates
in imprecise probability to producing an expected utility interval for each de-
cision alternative, the utility of an action a being given by the interval Ia =
∪f ∈F u(a, λ)f (λ|x)dλ. In a refinement proposed by Voorbraak [104], deci-
sion
 alternatives are compared for each pdf in the set of possible pdfs: Iaf =
u(a, λ)f (λ|x)dλ, for f ∈ F . Decision a is now better than decision b if
Iaf > Ibf for all f ∈ F .
Some decision alternatives will drop out as unfavorable because they are
dominated in utility by others, but in general several possible decisions with

46
overlapping utility intervals will remain. In principle, if no more information
exists, any of these decisions can be considered right. But they are characterized
by larger or smaller risk and opportunity.

Exercise 14 (+) Consider again the USAF classifier of the previous exercise
and the mass assignment given there.
i)Find the relative plausibility, pignistic and Maximum Entropy estimates
for the target class from m, as three probability distributions over the parameter
set.
ii) Draw the points from i) in the polygon from the previous exercise . Con-
clusion or comment?

2.7 Uncertainty Management


Interpretation of observations is fundamental for many engineering applications,
and is studied under the heading of uncertainty management. Designers have
often found statistical methods unsatisfactory for such applications, and in-
vented a considerable battery of alternative methods claimed to be better in
some or all applications. This has caused significant problems in applications
like tracking in command and control, where different tracking systems with
different types of uncertainty management cannot easily be integrated to make
optimal use of the available plots and bearings. Among alternative uncertainty
management methods are Dempster-Shafer theory[91] and many types of non-
monotonic reasoning. These methods can be - to some extent - interpreted as
a robust Bayesian analysis, where the analyst need not give precise priors and
likelihoods but can specify convex sets of priors and likelihoods, [12, 3] and
Bayesian analysis with infinitesimal probabilities, respectively[108, 8]. We have
also generalized the analyses by de Finetti, Savage and Cox, showing that under
slightly weaker assumptions than theirs, uncertainty management where belief
is expressed with families of probability distributions that can contain infinites-
imal probabilities is the most general method satisfying compelling criteria on
rationality[5, 4]. Other alternative methods, like fuzzy logic and rough set the-
ory, neural networks and case-based reasoning should rather be seen as a search
for more useful model families than those used traditionally in statistics. These
methods can in principle be described in the Bayesian framework[110, 76, 83, 59],
although frequently one makes heuristic validation tests rather than statistical
ones. Moreover, since these methods were also developed with the intention of
avoiding statistics, there is a certain reluctance to start viewing them as statis-
tical models, besides the feeling that such analyses might be infeasible. A few
statistical methods can be applied even to rather unorthodox statistical models.
If my favorite method tells me that there is a certain structure in the data,
how can I verify this? The standard way is to obtain a new set of observations
and see if the new set shows the same structure. But new observations are in
many cases unobtainable for cost or time reasons. By resampling techniques
new observation sets can be created, by taking a part of the original data set. If
the same structure can be seen in all or most parts of the original data set, this
gives confidence that the structure seen is not only random noise. Likewise, a
methods tendency to see things in data that has no real meaning can be checked
by testing it on random data. By taking a real data set consisting of a number
of cases each with a set of variables, and randomly shuffling each variable among

47
the cases, one gets a data set where no serious method should find significant
structure. Likewise, complex prediction methods can be tested by randomly
partitioning the available data set into one training set from which the predic-
tor is obtained and another one from which its performance is estimated. These
methods should also be used when analyzing data using conventional statisti-
cal methods, since they provide good tests against programming and handling
errors.

2.8 Beyond Bayes: PAC-learning and SVM


Whereas Bayesian analysis is based on the assumption that priors and likeli-
hoods are precisely known, there is an interesting approach that is probabilistic
although it makes no assumptions on the probability distributions involved.
This paradigm was originated by Chervonenkis and Vapnik[103], and was re-
cently developed in computational learning theory [101] and even more recently
in the Support Vector Machine (SVM) method[28].
How can a method be probabilistic and at the same time be independent
of the actual probability distributions involved? First of all, we consider only
the problem of predicting an unknown quantity y from a known quantity x,
and using a set of training samples (xi , yi )i∈I . Here the xi are vectors with real
valued components drawn from a feature space Rd , and the yi are either binary
class indicators (the classification problem) or real numbers (for a regression
problem). In the first case we want to use the training set to predict the class
x (-1 or 1) from the new feature vector x; in the second case we want a real
valued prediction y from x.
If more than two classes are to be distinguished, or if more than one real
number is to be predicted, there are a number of ways to organize this using the
basic binary classification or single variable prediction method we will describe.
In order to obtain distribution independence we must inevitably assume that
there is some type of link between the training sample and the quantities that we
try to predict. This assumption is that the training sample has been generated
by the same probability distribution as those examples that we want to predict.
Such a distribution can be described as a joint probability distribution p(x, y) or
as a family of conditional probability distributions p(y|x) and a distribution over
the p(x). If we ignore the latter, we run the risk of having a training sample that
is unrepresentative for future use, if we ignore the former we run the risk that the
relation between x and y is different during training from what it is during use of
the predictor. Unfortunately, even when the assumption of common distribution
during training and testing is fulfilled, we run the risk of getting a bad predictor
because the training sample was accidentally unrepresentative. An assumption
in PAC and other machine learning is exchangeability. Since applications usually
take examples in a time sequence, this assumption is seldom strictly fulfilled(see
Hand[55]).
With the concept of PAC-learning, we consider a particular classifier (re-
stricting ourselves temporarily to the binary classification case) C(x) : x →
{−1, 1}. The classifier error is said to be the probability that C(x) is different
from y when (x, y) is drawn according to the distribution p(x, y). The empir-
ical error is the fraction of sample points where C(xi ) = yi . If the classifier is
constructed from a random sample (xi , yi )i∈I drawn according to p(x, y) and
the probability is less than δ to obtain a classifier error larger than 4, then

48
we say that we can obtain a (δ, 4) classifier for the distribution p. If there is
4 = 4(n, δ) so that this is true regardless of the distribution p(x, y), then we have
a PAC bound for the classification problem. This formulation has the flavor of
statistical testing in the sense that it asserts that the probability of obtaining
misleading data is small, regardless of the underlying distribution. On the neg-
ative side, it is obvious that bounds obtained over any distribution will usually
be pessimistic compared to cases where some knowledge about the plausibility
of possible distributions is encoded into a Bayesian analysis.
The basic form of the classifier in SVM stems from Rosenblatt’s perceptron
[86]. This was a hardware device taking a number of real valued inputs with
correct binary classification as training data, used to set weights used to linearly
separate further real-valued inputs. Let the input be a real valued vector, x =
(x1 , . . . , xn ) and the class y be −1 or +1. The classifier decides the class of x by
the rule C(x) = sign(xi · w + b), where the dot · is scalar product and vector w
and scalar b are the weights of the perceptron. A good deal of research has gone
into finding good ways to set the weights using repeated scanning of the training
set. However, if the classes of negative and positive examples can be linearly
separated, finding a separating hyperplane is equivalent to finding a feasible
solution of a linear program and is in principle easy[23]. A key empirical finding
has been that the classifier gets better if the hyperplane is placed at maximum
distance from all training points, giving a wide margin classifier. Such a classifier
can be found using a further development of Lagrange’s method with multipliers,
developed by Karush, Kuhn and Tucker.
A similar technique can be used for the regression problem. Here we want
to predict y by C(x) = xi · w + b, and the idea is to define the weights such
that the maximum error |y − C(x)| is minimized. This method is similar to the
Adaline, an early neural network due to Widrow and Hoff[107].
Dynamic prediction problems can be handled with the SVM regression ap-
proach. If we have a time series (xi , yi ), trying to predict yi from the lagged
vector (xi−k , . . . , xi−1 ) for a suitable choice of k is a standard way to solve
the dynamic prediction problem. It is here important to choose the time step
(by sub-sampling the original time series) and lag vector dimension (the value
of k) to obtain good performance. In practice the most critical aspect is the
stationarity (time-independent dynamics) of the analyzed system.
The perceptron and the Adaline were designed as learning machines with
a built-in assumption of linearity. This was heavily criticized by Minsky and
Papert[71], whose book on the perceptron more or less blocked further research
on neural networks for a long time.
The SVM builds on the concept of the Perceptron and Adaline. However,
the specification of a wide-margin separator or smallest error hyperplane means
that the PAC learning analysis can be performed, resulting in distribution-
independent error bounds. Specifically, Cristianini and Shawe-Taylor[28, Ch
4] prove a large number of PAC bounds. The derivation of these bounds is
somewhat lengthy, we will just give an example to show how they usually appear:

Proposition 2 Given a set of n examples with binary classification, such that


the support of the distribution of x lies in a ball of radius R. Fix γ. With
probability 1−δ, any two parallel hyperplanes 2γ apart and separating the positive
and negative examples (thus with margin γ), has error no more than

49
2 64R2 enγ 32n 4
4(n, δ, γ) = log log 2 + log , (14)
n γ2 9R2 γ δ
if n < 2/4 and 64R2 /γ 2 < n2 .

Similarly for regression (where y is a real number), if all training points lie
within two parallel hyperplanes 2(Θ − γ) apart, the corresponding predictor has
residual greater than Θ with probability at most 4(n, δ, γ) as defined in (14),
again if n < 2/4 and 64R2 /γ 2 < n2 .
As is obvious, the common distribution p(x, y) may well prevent accurate
prediction, particularly if it factors into a distribution for x and another for y,
p(x, y) = p(x)p(y). It is thus important in the formulation above that in such
cases we are unlikely to obtain a single classifier or regressor with the required
performance on the training set. We are also not allowed to create samples
repeatedly until we find one on which a good predictor or classifier exists. It
is also important to note that the bound γ is defined before we have seen the
examples, although there are ways around this. Classification is often useful
even on distributions where we are unlikely to obtain a good training set, but
the real distribution of examples gives some outliers. With few outliers we can
still find (4, δ) bounds, and a number of related bounds are given in [28, Ch 4].
An interesting property of the bound (14) is that it is independent of the
dimension (number of components) of x. In the perceptron discussion [71], it was
soon noted that many important examples exist where a nonlinear mapping to a
higher dimensional space makes classes linearly separable which are not linearly
separable in the original space. As an example, if x is two-dimensional, the
map Φ : (x1 , x2 ) → (x1 , x2 , x21 , x1 x2 , x22 ) has the interesting property that linear
separators in the target space R5 correspond to conic section separators in the
original space R2 . This can be seen by considering the R5 hyperplane defined
by w = (w1 , w2 , w3 , w4 , w5 ) and b. In R2 the R5 plane w · x + b = 0 is inverse
mapped to:

w1 x1 + w2 x2 + w3 x21 + w4 x1 x2 + w5 x22 + b = 0,
which is the general equation for a (possibly degenerate) conic section. Conic
sections are e.g., ellipses, hyperbola and parabola, but also two parallel lines
(separating the set between the lines from those outside the lines on both sides)
and two crossing lines (separating the sections opposite each other from the other
two opposite sections). The insensitivity of the bound (14) to the dimension of
the feature space means that we can expect good generalization performance
even after mapping the original examples to a high- or even infinite-dimensional
space. This is accomplished in a general fashion with the Kernel trick (next
section).
We can illustrate the SVM with a few examples visualizable in 2D. For
the classification problem we can see in Figure 17 how positive and negative
examples were uniformly generated with the red line demarcating the class.
The wide margin classifier decides using the slightly different line between the
two blue lines. In the generic 2D case, one blue line will usually be fixed by
two support points, the other by one (case (a)). But it is also possible to have
one support point on each side, in which case the blue lines are perpendicular
to their connecting line (case (b)). In higher dimensions the picture is similar,

50
but there are more cases on how the two sides of the margins are determined
by support points. A midpoint and the normal vector is obtained from the
multipliers of the support points.
In the regression case, xi is 1D and extended with the yi to get 2D zi , Figure
18 . Here the examples were generated around the red dotted line and the SVM
finds the smallest enclosing parallel lines. Here we also have two generic cases
on how support points interact with the narrowest possible ’corridor’, and more
on higher dimension.
It is of course somewhat risky to demand full separation of classes and full
compliance with the ’corridor’, given the unavoidable prevalence of measurement
and classification errors in real-world data sets. In many applications one uses
’soft margins’ and ’soft corridors’. The detailed usage of this facility is explained
in most SVM software packages.
We will not go into the detailed algorithm for finding wide margin separators
and narrow margin approximators, for details see [28, Ch 5]. It is however quite
important to understand one feature of the SVM method, namely the support
vectors and corresponding Lagrange multipliers:
Consider the primal optimization problem involving functions f , gi , with
i = 1, . . . , k defined on n-dimensional space:

minimise f (w), w ∈ Rn ,
subject to gi (w) ≤ 0, i = 1, . . . , k,

For both the classification and the regression version we ask for two parallel
hyperplanes with extremal (maximum or minimum) distance that fit to data
samples. In the classification version we work with a space having the dimension
d of the feature space given by the xi , whereas in regression we work with
dimension d + 1 given by the xi extended by corresponding yi . We call these
extended vectors zi . A hyperplane is defined by {x : w · x + b = 0}, and the pair
of hyperplanes we are looking for will be

{x+ : w · x+ + b = +1}, {x− : w · x− + b = −1},

For the classification problem where |yi | = 1 we ask for hyperplanes where
positive examples are above the first one, and negative below the second one.
This is obtained with the constraints yi (w · xi + b) ≥ 1. Since we want to
maximize the distance between the two parallell hyperplanes we will maximize
||w||2 = w · w.
For the regression problem, yi is the last dimension of z = (xi , yi ) ∈ Rn , we
want all points to lie between the two hyperplanes that themselves are placed
as close to each other as possible. The constraints are

w · z + b ≤ +1,
w · z + b ≥ −1.

We want the borders as close as close to each other as possible, so we maxi-


mize ||w||2 = w · w.

51
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 17: A sample of points distributed uniformly in the shown square. Those
above the red line are positive. Blue lines indicate the margin of the wide
margin classifier on these examples. With growing training set, the blue lines
will approach the red one. (a): three generic support points; (b): Two generic
support points

52
A development of Lagrange’s fundamental method due to Kuhn and Tucker([63]
is based on:

Proposition 3 A necessary condition for a point w∗ to be an extremal point


of f (w) subject to constraints gi (w) ≤ 0, where i = 1, . . . , m, where f is convex
and gi are linear (affine), is

∂L(w∗ , α∗ )
= 0,
∂w
α∗i gi (w∗ ) = 0, i = 1, . . . m
αi ≥ 0, i = 1 . . . m

where

m
L(w, α) = f (w) + αi gi (w).
i=1

In the SVM (classification) problem, the Lagrangian is


w·w 
L(w, b, α) = − αi (yi (w · xi + b) − 1).
2 i

We look for optimum, which is solution to

∂L(w∗ , b, α∗ ) 
0= = w− αi yi xi ,
∂w i
∂L(w∗ , b, α∗ ) 
0= = αi yi ,
∂b i

We can now eliminate w and b from the Lagrangian and express it in terms of
the αi and inputs. The optimal solution is obtained from the following quadratic
programming problem:
Maximize  1
L= αi − yi yj αi αj (xi · xj )
i
2 i,j

under the constraints



yi αi = 0
i
αi ≥ 0

The quadratic programming problem has been extensively studied, and effi-
cient codes form the computational engine of most SVM packages.
Three fundamental observations: (i) The formulation involves the examples
xi only in the form of scalar products, and this is essential for the ’kernel trick’
to go through; (ii) the resulting multipliers αi are nonzero only for support
vectors, i.e., points lying on one of the two hyperplanes; (iii) At the stationary
point, L(w∗ , α∗ ) = f (w∗ ), since for each i either αi or gi (w∗ ) is zero. Once the
optimal w and b are obtained from the αi , a new example x will be classified

53
as C(x) = sign(w · x + b) and a new regression example will be C(x) = y :
w · (x, y) + b = 0, or, if w = (w−(d+1) , wd+1 ), where w−(d+1) = (w1 , . . . , wd ),
w−(d+1) · x + b
y=− .
wd+1
Lagrange’s method was developed for analytical mechanics, and there the
multipliers represent forces between bodies that are in contact (the force is zero
if the constraint is not tightly satisfied), and thus it is natural to regard the
support vectors as outliers and the αi as a non-conformance measure, more
non-conformance the higher the value of αi . This is an essential property for
the Vovk/Gammerman hedged prediction scheme (section 2.9).
It is possible to modify the method using soft constraints, by also stating
maximum values for the αi . Then the solution will admit a limited number
of outliers in the training sample that may get incorrect classifications. This
modification is available as an option in most SVM program packages.
The stringency and elegance of the distribution-independent framework has
maybe promoted its use more than justified. It is not difficult to guess that the
bounds obtained will in many cases be quite large, even larger than one (which
is rather damaging for error bounds on probabilities). The popularity of the
SVM approach stems of course mainly from the experience that the results it
gives are ’useful in practice’ despite the depressingly large strict error bounds.
Alternative analyses of similar methods are given in a new book by Shafer,
Gammerman and Vovk [92], and a very short introduction will be given next
(section 2.9). The Relevance Vector Machine of Tipping [100] uses a Bayesian
approach and realizes a prediction and classification structure that is in some
respects more attractive than the SVM.

2.8.1 The Kernel Trick


As stated above, the SVM algorithms work using the example vectors only
through scalar products in the feature space. Thus, if we map the features xi to
a higher dimensional space using the mapping Φ, we will have to compute scalar
products Φ(x) · Φ(z). For the quadratic map Φ : (x1 , x2 ) → (x1 , x2 , x21 , x1 x2 , x22 )
where subscripts denote vector components, we get Φ(x) · Φ(z) = x1 z1 + x2 z2 +
x21 z12 +x1 x2 z1 z2 +x22 z22 . This can almost be expressed as (x·z+1)2 , the difference
being that there is a constant term and that the different terms are re-weighted
by constant multipliers, and corresponds to a slightly different map with the
same ’power’ to model surfaces as inverse images in R2 of hyperplanes in R5 :
√ √ √
Φ (x) = ( 2x1 , 2x2 , x21 , 2x1 x2 , x22 ).
The kernel technique is built on Mercer’s theorem, which says that certain
types of functions K(x, y) of two feature vectors will certainly correspond to
some mapping Φ in the sense that Φ(x) · Φ(z) = K(x, z). In this case we do
not need to explicitly map the example vectors to high-dimensional space: The
SVM algorithm works by only using vectors through scalar products, and these
can be directly evaluated from the kernel K(x, z).
Mercer’s theorem is part of the toolbox in mathematical physics [25] and is
quite interesting: it says under what conditions the eigenfunctions of an integral
equation have the property that the scalar product of the eigenfunction vectors

54
at two points is equal to the kernel at the two points. This allows one to use
mappings into infinite-dimensional space, since the eigenfunction sets are often
infinite. The background is Mercer’s condition on a symmetric kernel K(x, y)
of an integral equation:

K(x, y)f (x)dx = λf (y)
C
This equation can be though of as a simple eigenvector problem, but in
an infinite-dimensional Hilbert space. There is a sequence of eigenvalues λi de-
creasing in magnitude, and a corresponding sequence of real valued (normalized)
eigenfunctions Ψi : C → R. We are interested in cases √ where all eigenvalues are
positive and where the scaled eigenfunctions Φi = λi Ψi satisfy our require-
ment

K(x, y) = Φi (x) · Φi (y).
i

If this is the case, the kernel trick works for the map Φ(x) = (Φ1 (x), Φ2 (x), . . .)
and we can run the margin algorithm without actually mapping the feature vec-
tors to the possibly infinite-dimensional space. Mercer’s theorem says that this
will be the case for a symmetric kernel K(x, y) whenever Mercer’s condition is
satisfied:

K(x, y)g(x)g(y)dxdy ≥ 0,
C

for every square integrable function g(x), and if C is a compact subset of


some space Rd (the original feature space).
We can now generalize our conic section separator (x · z + 1)2 to K(x, z) =
(x · z + 1)d . This kernel correspond to mapping a feature vector x to a high-
dimensional vector whose components are all monomials of degree not greater
than d.
There are large numbers of kernels suitable for different types of modeling
problems, see, e.g., [28, 88]. The kernel trick is not only applicable in the SVM
context: Whenever the computation required can be described in terms of scalar
products of feature vectors the method is applicable. Several examples are given,
e.g., in [18]. The polynomial kernels are easy to comprehend, but sometimes
Gaussian kernels are preferable. The Gaussian kernel is defined by:

K(x, y) = exp(−||x − z||2 /σ 2 ).


This kernel contains the sigmoid functions popular in neural network mod-
eling and is popular because of its flexibility.
In Figure 19 we show an example of using Gaussian kernels in a classifier
for a checker-board training set. The positive and negative examples lie in
alternate squares of a checker-board (say, black squares are negative, white
squares positive). Positive and negative examples are marked + and × . The
examples are mapped to a high-dimensional space using the Gaussian kernel,
and the figure shows the inverse projection of the wide margin separator in the
10D space (levels -1, 0 and 1). The support vectors touch the 1 and -1 contour
lines (marked by black dots).

55
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 18: Support vector regression. Points lie around the red dotted line.
Blue lines indicate the walls of the section in which points lie. Three generic
support points, two to the right and one to the left.

Learning Data and Margin


4
−1
1

1 1
0
−1

3
0
0 −1

2
−1

10 1 −1 1 0 −1
1 −1 0
1
−1
−1
1

1
−1
0

10 0
x2

0 −1 −1
−1

0
−1 −1 −1
10 −1 01
10

−1
10

−2
−1
1

−3
0

1 −1
−4
−4 −3 −2 −1 0 1 2 3 4
x1

Figure 19: Gaussian kernel applied to the checker-board problem. Output of


SVM-KM classification example[17].

56
2.9 Algorithmic conformal prediction and anomaly detec-
tion
A new (with some predecessors, see [43, Discussion]) approach to online learning
and prediction was developed by Vovk and Gammerman and is pedagogically
explained in [92]. The main advantage is that it gives well-founded confidence
and credibility measures of individual predictions. The main practical idea is to
base the analysis on a non-conformance measure. We can think of two situations
where the examples (z1 , . . . , zi , . . .) are fed into our method, a prediction or
anomaly measure is made for each new zi based on its relation to the batch of
previously seen examples. A prediction is as in the SVM scheme: each zi is a pair
(xi , yi ) where the yi are either from a finite set (classification) or a vector of reals
(regression case). This method centers around the concept of non-conformance.
A non-conformance measure can, in analogy with a test statistic, be chosen quite
freely, but a proper choice is also a prerequisite for non-disappointing behavior.
The non-conformance measure measures the difference of an example from a set
of previous examples, an example based on linear regression thinking being:

c({zi }i∈I , z) = (y − argminβ ( (yi − xi β)2 )x)2 ,
i

where the non-conformance of the new item z = (x, y) thus is measured by the
prediction error arising from a least-squares regression predictor found from the
set of old items. The formula is equally applicable when the x, xi and β are
vectors instead of scalars.
It is required that the value of c(Z, z) is invariant to permutations of Z
(actually, we defined Z as a set, so this is implicit in our notation). It is now
possible to define p-values for the members of the sequence (zi )ni=1 : The p-value
for zi in (zi )ni=1 is

(1 − (ri − 1)/(n − 1)), (15)


where ri is the rank of c({z1 , . . . , zi−1 , zi+1 , . . . , zn }, zi ) among the
c({z1 , . . . , zj−1 , zj+1 , . . .}, zj ), 1 ≤ j ≤ n and the p-value of z in {zi } is (1 − (ri −
1)/n) where ri is the rank of c({z1 , . . . , zi−1 , zi , zi+1 , . . . , zn }, z) in the values
c({z1 , . . . , zj−1 , z, zj+1 , . . . , zn }, zj ), 1 ≤ j ≤ n with the former added (thus the
rank among n + 1 values). The p-value is thus low if most zi are less different
from {z1 , . . . , zi−1 , zi+1 , . . . , zn } than z is from {z1 , . . . , zn }.
Given a non-conformance measure, a set {z1 , . . . , zl } and a new item x, a
predictor of confidence level α for the corresponding y is the set of y-values
giving p-value larger than 1 − α to z = (x, y).
Vovk and Gammerman go on to define confidence predictors, mapping a
non-conformance measure, a training set, a new observation x, and a confidence
level 1 − 4 to a set Γ$ of predicted values, namely the set of values Y such that
the p-value of the last element in ((x1 , y1 ), . . . , (xN , yN ), (x, Y )) is larger than 4.
For the classification problem, we will find a finite set of p-values for the
possible classifications of the new item. We predict the one with highest p-value,
and call this p-value our credibility - if it is high (particularly if it is one, its
maximum possible value) it is completely plausible that we gave the right class.
A low credibility tells us that the new instance is not well covered by the training
set - every class is somewhat not in conformance with it. However, even with

57
high credibility there is a possibility that another class is almost equally plausible
(typically in cases with an insufficient training set or non-informative features,
or when the new instance is somewhat a border-line case). The second highest
p-value, subtracted from one, is the confidence of the prediction. This is 1 minus
the largest 4 such that the predictor set Γ$ is a singleton, just the predicted value.
A high confidence value shows that all alternatives are implausible, whereas a
low value shows that one or more alternatives are completely plausible. with
both confidence and credibility high, we can be relatively certain that the class
predicted is correct, as always if the exchangeability assumption is satisfied.
The interesting result giving strong support to this methodology is the fol-
lowing: It is valid for smoothed p-values, obtained by probabilistically (with
probability 1/2) counting the non-conformance values that are ties with that of
the last one (i.e. l + 1), instead of counting them to zero as in (15):
The smoothed p-value is obtained as follows: Suppose αi is the non-conformance
measure of (xi , yi ) in the (l + 1)-sequence for l = 1, . . . , l and αl+1 is that for
(xl+1 , Y ). Then the smoothed p-value for Y is

|i : αi > αl+1 | + η|i : αi = αl+1 |


pY = , (16)
l+1
where η is a standard random variable uniformly distributed in [0, 1]. (This
means that every time a pY is computed, η is obtained as a new standard
random number.)
For an exchangeable sequence, the smoothed p-value and an iterated predic-
tion sequence, where a predictor set Y with confidence 4 is obtained for y = yl+1
from ((x1 , y1 ), . . . , (xl , yl )) and xl+1 , for l = 1, 2, . . ., the frequency of y ∈ Y will
be 4. Likewise, for a classification sequence, the erroneous classifications will be
independent with probability one minus the maximum confidence corresponding
to a singleton predictor set.
Since the smoothed p-values are larger than the non-smoothed ones, the
predictor based on non-smoothed p-values will perform better (on the average)
than indicated by the confidence.
For the case of prediction, one must typically be content with 4-confident sets
for the predicted value, because there are typically infinitely many of them. It is
also not obvious how, in the general case, these sets shall be computed. Indeed,
each conformance measure needs analysis of how to compute or approximate
these sets efficiently.
A generally good way to apply the hedged prediction methodology is to use
the Lagrange multipliers (αi ) in SVM to compute (smoothed) p-values.
A statistician is usually annoyed by application owners’ frequently expressed
desires to find anomalies in an ongoing process. In the application, anomalies
may be signs of some previously unknown phenomenon that, when investigated
and understood, will drive development of the field, or be a warning sign for an
impending disaster. In the framework of conformal prediction, the concept of
anomaly is easy to define: it is a new item with a low p-value, and anomaly is
with respect to the chosen non-conformance measure.
It is also possible to select anomalies based on p-values using the FDR
method. This aloows us to state that a high proportion of the detected anoma-
lies will be ’real’.

58
3 Data models
The formulas in Chapter 2.1 always contain a data probability distribution term,
in parameterized models as a function of (conditioned on) parameters. We will
now go through a number of such distributions often used in Bayesian analyses
and uncertainty management. You have probably seen most of it in standard
statistics courses, but the Bayesian angle of these notes makes the emphasis
different. We start out from simple data types and work towards complex ones.

3.1 Univariate data


Univariate data consist of real numbers, integers, ordinal or categorical values.
The latter two are usually coded as integers, but the magnitudes of these have
no significance – the values can be thought of as members of a set. This set is
ordered for an ordinal variable but not for a categorical one. Ordinal variables
are common in investigations of attitudes of persons. In a hotel room you may
find a form where you are asked among other things if the service provided is
very bad, bad, good or very good - your answer is sometimes considered an
ordinal variable (but more often a numeric score which is then averaged over
customers).

3.1.1 Discrete distributions and the Dirichlet prior


We have already seen the coin-tossing example (section 2.1.10). There the data
are binary categorical (heads or tails), and the parameter is the probability of
heads (or, by symmetry, tails). The inference is about this parameter, and the
priors and posteriors in section 2.1.10 are distributions for the parameter. The
distributions chosen are called Beta distributions, and many of them (specifi-
cally, those with integer parameters, since the number of successes and failures
in our experiment must be integers) are generated from the uniform distribution
by multiplication with likelihoods and normalization.
This can be generalized to general discrete distributions over values for a
categorical variable taking d different values. The distribution in this case is a
probability vector (with the constraints that the component values range from
0 to 1 as probabilities do, and that the components sum to one).
For a discrete distribution over d values, the outcome is a number from 1 to
d. The parameter set is a sequence of probabilities x = (x1 , . . . xd ), (in some
textbooks the last parameter xd is omitted - it is determined by the first d − 1
ones), constrained to lie in the polytope Ld :


d
Ld = {(x1 . . . , xd )|0 ≤ xi , for i = 1, . . . , d , and xi = 1}
i=1

The likelihood after observing a set of outcomes where outcome i occurred


ni times is
Πi xni i .
Normalizing this quantity so that its integral over Ld becomes one, we have an
example of the Dirichlet distribution.
The Dirichlet distribution with parameter set α is

59

Γ( i αi )  (αi −1)  (α −1)
Di(x|α) = xi = cα xi i , (17)
i Γ(αi ) i i

whereΓ(n + 1) = n! for natural number n. The normalizing constant


cα = Γ( i αi )/ i Γ(αi ), although by no means obvious, can be verified us-
ing multiple induction (Exercise 15). It gives a useful mnemonic for integrating
(α −1)
any monomial, like i xi i over the d − 1-dimensional hyperplane segment
Ld . It also gives the normalization constant Γ(d) for the uniform distribution
over Ld (which is often used as a prior). It is very convenient to use Dirichlet
priors, for the posterior is also a Dirichlet distribution: After having obtained
data with frequency count n = (n1 , . . . , nd ) we just add this vector to the prior
parameter vector α to get the posterior parameter vector α + n.
With no specific prior information for x, it is necessary from symmetry
considerations to assume all Dirichlet parameters equal, αi = α. A convenient
prior is the uniform prior with αi = 1. This is, e.g, the prior used by Laplace to
derive the rule of succession, see Ch 18 of [60] or [31]. Other priors have been
used, but there are no strong reasons for them except where they correspond
to prior knowledge. Such knowledge can in principle be fed into the analysis in
the form of an equivalent prior sample.

3.1.2 Estimators and Data probability of Dirichlet distribution


The normalization constant:

Γ( i αi )
cα =
i Γ(αi )
of the Dirichlet distribution makes it usually unnecessary to perform integrals
over Ld . As an example, the Laplace estimator for the probability is the mean
of the distribution, and its ith component is:
 

xi Di(x|α)dx = = αi / αi ,
Ld cα+1i i

where 1i is the vector of zeros except for a one in the ith component. For
inference of a probability from occurrence counts with uniform prior we have
a posterior Di(x|n + 1) and thus the mean estimator is the relative occurrence
counts after 1 has been added to each outcome. Using Lagrange multipliers
it is straightforward to show that the MAP estimator is just the (unmodified)
observed relative frequencies. In many practical applications it has been found
important to use the Laplace (mean) estimator instead of the MAP estimator.
Particularly, the MAP probability estimate for an event that has not happened
is exactly zero, and this is a much too drastic estimate if the total number of
occurrences is small.
We will now find the data probability for a general discrete distribution
(x1 , . . . , xd ), given 
that ni occurrences of the ith outcome were observed, i =
1, . . . , d. Let n = i ni . The probability of observing a particular sequence
with counts ni is Πi xni i , and to obtain the probability
 of the counts we should
n
multiply with the multinomial coefficient ( n1 ,...n d
). Integrating out the xi with
the prior gives the probability of the data given model M (M is characterized

60
by a parameterized probability distribution over the xi and a prior Di(x|α) on
its parameters):


p(n|M ) = p(n|x)p(x)dx
Ld
  
n i −1
= xni i xαi cα dx
L d n 1 , . . . , nd i i
  
n i −1
= cα xni i xαi dx
n 1 , . . . , nd Ld i i
n cα
=
n1 , . . . , nd cα+n
Γ(n + 1)Γ(α. ) i Γ(ni + αi )
= . (18)
i Γ(αi )Γ(n + α. ) i Γ(ni + 1)

As is ’easily’ seen, the uniform prior (αi = 1, all i) gives a probability for
each sample size that is independent of the actual data:

Γ(n + 1)Γ(d)
p(n|M ) = . (19)
Γ(n + d)
The probability p (n|M ) of a sequence of outcomes with counts n is of course
obtained by dividing with the corresponding multinomial coefficient:

Γ(n + 1)Γ(d)
p (n|M ) =  n . (20)
Γ(n + d) n1 ,...,n d

Exercise 15 Derive the normalization constant for the Dirichlet distribution


(i) with αi = 1, all i;
(ii) with integer parameters αi .

61
3.1.3 The normal and t distributions
The most common assumption about a real valued variable is that it has the
normal distribution. This assumption has some motivation as there are several
theorems saying that under certain assumptions a quantity will be normally
distributed. However, in practice most data sets analyzed do not have the
normal distribution. Usually, the tails of empirical distributions are longer than
the very short tails of the normal distribution. So whether or not the normal
assumption is appropriate depends a lot on what kinds of data you have, and
on the purpose of the analysis.
The normal distribution has parameter µ and σ 2 , the mean and variance.
The distribution is

1 1
f (x|µ, σ 2 ) = N (x|µ, σ 2 ) = √ exp − 2 (x − µ)2 .
2πσ 2σ
There is an alternative parameterization using the precision λ instead of vari-
ance, with λ = 1/σ 2 . It is particularly popular in Bayesian statistics[16].
So let us analyze a sequence of real valued measurements (xi )ni=1 , assumed
to be independently drawn from some normal distribution whose parameters we
want to make inference about. The likelihood function is then

1 1 
f ((xi )|µ, σ ) = √
2
exp − 2 (xi − µ) 2
( 2πσ)n 2σ i
 
which, applying the abbreviations x = i xi /n and s2 = i (xi − x)2 /n,
simplifies to

1 1
f ((xi )|µ, σ 2 ) = √ exp − 2 (ns2 − 2nµx + nµ2 ) . (21)
( 2πσ)n 2σ
Since the likelihood is only dependent on data through the summaries x
and s2 , these are so called sufficient statistics - we do not need to retain the
individual values for inference under the normality assumption. However, the
individual values can well be needed for other purposes, e.g., for testing whether
or not the assumption of normality is plausible.
In order to apply (3), we must have priors on µ and σ 2 . But we could also
assume that one of these is known, and that the purpose of analysis is to make
inference on the other. This allows for a more pedagogical progression, but is
also prototypical for several real applications. We could for example assume that
σ 2 is known, like in the case where we make repeated measurements of a fixed
quantity in order to improve precision. We may have access to a statement
of measuring accuracy that can be translated to a known value for σ 2 when
making inference about µ. Or, we may want to make inference on measurement
accuracy by repeatedly measuring a fixed quantity where the ’true’ value is
known, like measuring the length of the archive meter, when making inference
about σ 2 .
Assume we know the value of σ 2 . A natural choice of uninformative prior for
µ is the uniform distribution. However, since µ can vary over the whole real line,
the uniform distribution
 ∞ does not exist as a (normalized) probability distribution
(since the integral −∞ dµ for the normalization constant diverges). It is however
possible to use priors that cannot be normalized to probability distributions,

62
and it is standard for the normal distribution. Such non-normalizable priors are
called improper priors.
The case of fixed σ 2 and improper uniform prior allows us to rewrite the
likelihood (21) as

1
f ((xi )|µ, σ 2 ) = f (µ|(xi ), σ 2 ) ∝ exp − (n(µ − x)2 + C) ,
2σ 2
where the quantity C = x2 + ns2 does not contain the parameter µ. This
quantity will disappear after normalization. This is, regarded as an unnormal-
ized posterior for µ, a normal distribution with mean x and variance σ 2 /n. So
despite the fact that the prior is improper, the posterior, the product of the
improper prior 1 and the likelihood, is a proper distribution for all non-empty
data sets (n > 0), essentially because the likelihood has a convergent integral
with respect to µ over the real line.
One standard prior assumption for the normal distribution is that (µ, log σ)
are uniformly distributed over the whole plane, and thus that σ has a prior
marginal distribution
 proportional to 1/σ. This is also an improper prior since
the integral 1/σdσ diverges, both at 0 and at +∞. Again, the posterior is
a proper distribution,
 ∞ since the normalization constant is the inverse of the
convergent integral 0 σ −n−1 exp(− 2σ1 2 n(µ − x)2 + C)dσ. You may not have
seen this distribution, but looking in a table such as [79] or [16, App.A], it is
readily identified as a Gamma distribution (see exercise 17).
Finally, we may want to make inference on both µ and σ. Their joint pos-
terior under the standard improper prior assumption is given by equation (21)
multiplied by the improper prior σ −1 , which is maybe somewhat unintuitive.
It would be nice to find the marginal probabilities of the two parameters, like
in the case where we are only interested in the mean µ. This is obtained by
integrating out the σ, an instance of a common procedure known as integrat-
ing out nuisance parameters. Somewhat surprisingly, the marginal distribution
of µ is no longer a normal distribution but the somewhat more long-tailed t-
distribution. The reason is that averaging over a long tail of the distribution
over σ makes the posterior a continuous mixture of normal distributions with
varying variances. Indeed,
∞  
f (µ|(xi )) ∝ 0 σ −n−1 exp − 2σ1 2 (n((x − µ)2 + x2 + ns2 )) dσ ∝ (22)
 2
 −n/2
1 + n(µ−x)
ns2 , (23)

where the last line was obtained after a change of variable z = (ns2 + n(µ −
x) /(2σ 2 ). This is an instance of the t distribution, which is usually given the
2

parametric form with parameters called mean (µ), variance (σ) and degrees of
freedom (ν):
−(ν+1)/2
Γ((ν + 1)/2) 1 (x − µ)2
f (x; µ, σ 2 , ν) = √ 1+ (24)
Γ(ν/2) νπσ ν σ2
Exercise 16 A sample y1 , . . . , yn of real numbers has been obtained. It is known
to consist of independent variables with a common normal distribution. This
distribution has known variance σ 2 and the mean is known to be either 1 or 2
(hypotheses H1 and H2 , both cases considered equally plausible).

63
(i) Which is the data probability function P (D|Hi )?
(ii) Describe a reasonable Bayesian method for deciding the mean value.
(iii) Characterize the power of the suggested procedure as a function of σ 2 , as-
suming that the sample consists of a single point.

Exercise 17 Show that the posterior for inference of normal distribution vari-
ance under the standard assumption is a gamma distribution, and find its pa-
rameters.

3.1.4 Nonparametrics and mixtures


A common desire in analysis of univariate (and multivariate) data is to make
inference on the distribution of a sample without assuming anything about the
analytical form of the distribution. In other words we do not want to make infer-
ence about parameters of a schoolbook distribution, but about the distribution
itself. This field of investigation is called non-parametric inference. In principle,
the equation (3) is applicable also in non-parametric analysis. In this case the
parameter set is the set of all distributions we want to consider, and we must
naturally, because this set is overwhelmingly large, have a suitable prior for this
parameter. We can also expect that posterior computation will be completely
different from the simple estimation of a few parameters we saw in the previous
examples. The set of all distributions includes also functions with strange sets
of singularities.
In applying non-parametric Bayesian analysis we feed an observation set
into (3). Here we have the first clue to non-parametrics: if the sample is small,
the points will be far apart and we can never get a good handle on the small
scale behavior of the distribution, so we must content ourselves with inference
among a set of fairly regular (smooth) functions. We will consider a couple
of ways to accomplish this. The first method consists in discretization: divide
the range of the variable into bins and assume a general discrete distribution
for the probabilities of the variable falling into each bin. From such a general
distribution, a pdf that is piece-wise constant over each bin can be used as an
estimate for the actual distribution.
The second example shows how a distribution can be modeled as a mixture of
simpler distributions. If the components are normal distributions, the mixture
can be expressed as
n
f (x) = λi N (x, µi , σi ). (25)
i=1

Here, the mixing coefficients λi form a probability distribution, i.e., they are
non-negative and sum to one. We can think of drawing a value of the mixture
as first drawing the index i according to the distribution defined by the mixing
coefficients and then drawing the variable according to N (x|µi , σi2 ). In this case
we cannot know for certain which component generated a particular data item,
since the support of the distributions overlap.

3.1.5 Piecewise constant distribution


In this section we use the method of section 3.1.1 above. We design an MCMC
trace that approaches the posterior distribution of the breakpoints delineating

64
the bins of constant intensity. Since different ’binnings’ of the variable gives dif-
ferent resolutions of the estimate, we can approach the scale selection problem
by trying different bin sizes and obtaining a posterior estimate for the appropri-
ate number of bins, using (1) to compare the description of an interval either as
one or as two bins in each case assuming that the probability density of values
is constant over each bin.
Consider now a sample of real valued variables (xi ) and a division of the real
line into bins, so that bin i consists of the interval from bi to bi+1 . It does not
matter much to which bin we assign the dividing points bi – these points form a
set of probability zero in this model. Each division into bins can be considered
a composite model, and we compare the posterior probabilities of these models
to find a probability distribution over models. As a preview of the MCMC
method, we will describe an iterative procedure to sample from this posterior
of bin boundaries and bin probabilities with respect to the observed data. We
will also describe an advanced technique for keeping down the dimension of the
state-space, Rao-Blackwellization. This sampling is organized as a sequence of
model visits, where we visit the models in such a way that the chain, taken as
a sample, is a draw from the posterior. When we are positioned at one model,
we consider visiting other models produced either by splitting a bin into two, or
by merging two adjacent bins into one. We are thus in each step comparing two
models, the finer having d bins. We always consider a uniform Dirichlet prior
for the probabilities (x1 , . . . , xd ) of a variable falling into each of the d bins. Bin
i has width wi . The data probability is then Πi xni i . The normalization constant
in (17) tells us that over uniformly distributed xi , the data probability will be

Πi Γ(ni + 1)/Γ(Σi ni + 1). (26)


Consider the adjacent model where bins j and j + 1 have been merged. We
now assume in the model that the probability density is constant over bins j
and j + 1, so the probability xj in the coarser model should be split into αxj
and (1 − α)xj , which probabilities generated the counts nj and nj+1 in the finer
model. The proportion α is the fraction of the combined bin length belonging
to the first constituent, α = wj /(wj + wj+1 ). The data probability for the
nj+2
coarser model will now be the product xn1 1 · · · (αxj )nj ((1−α)xj )nj+1 xj+2 · · · xnd d
Integrating out the xi (now only d−1 variables) leads to the posterior probability

Γ(nj + nj+1 + 1)Πi∈{j,j+1}


/ Γ(ni + 1)
αnj (1 − α)nj+1  . (27)
Γ(( ni + 1) − 1)
We must have some idea of how many bins there should be. A conventional
assumption is that change-points are events with constant intensity, so the num-
ber of change points should follow some Poisson distribution and the probability
of d change points is exp(−λ)λd /d!, with a hyper-parameter λ saying how many
change-points we expect. The Bayes factor in favor of the finer model is thus:

Γ(nj + 1)Γ(nj+1 + 1) λ
α−nj (1 − α)−nj+1 ,
(n + d − 1)Γ(nj + nj+1 + 1) (d + 1)
where the last factor comes from the Poisson prior and the first is obtained
by dividing (26) by (27). The technique used above to integrate out the xi
is known as Rao-Blackwellization[20]. It can improve MCMC-computations
tremendously.

65
While this method of non-parametric inference is very easy to apply for a uni-
variate data inference problem, it does not really generalize to high-dimensional
problems. The assumption of constant probability density over each bin is some-
times appropriate, as in making inference on the intensity of an event over time,
in which case it tries to identify a set of ’change points’. A classical example
of change-point analysis is a data set over times at which coal-mining disasters
occurred in England[80]. After having identified probable change points of the
disaster intensity, one can for example try to identify changes in coal-mining
practice associated with the change-points in intensity. A result from the pop-
ular coal-mining disaster data set is shown in figure 20. This data set has been
analyzed with similar objectives in several papers, e.g., in [51]. Compared to
the analysis there, which uses a slightly different prior on the function space,
we get quite similar change-points and similar density estimates. It is typical
in MCMC-approaches that subtle differences between models that seem equally
plausible give different results. Since programming is involved, it is also useful
to check the model by running it on a similar set of points generated with a
uniform distribution, and check that it proposes a more even density of change
points than the actual coal-mining disaster data does. One random example is
shown in figure 21. The graph can be judged with experience to ’more random’
for the synthetic data than for the real data, since it has a much more diffuse dis-
tribution of change-points, and the probability of one change-point is quite low
(ca 7%). In order to convince ourselves more thoroughly that the coal-mining
disaster data cannot reasonably be a sample from a uniform distribution, we can
generate a large number of random data sets with matching number of points
(simulated accidents uniformly distributed) and compare their cumulative plots
with those of the coal mining data. The result is shown in figure 23. This is a
popular and time-efficient way to do tests with visual inspection of a suitable
plot. If you want a ’real’ p-value you can define a test statistic that is the
number of accidents in the first half of the interval. The p-value is thus approx-
imated by the proportion of simulated plots that lie below the real data plot
in the midpoint (ca 1906). This number is 0, and even with considerably more
simulated data it would be 0, and we can state that the true value is clearly
below any significance level used in practice (0.1% is the lowest in common use).
It may be interesting to know what might have caused the change around
1888: according to the analysis in [80], this time period was characterized by a
fairly steep decline in productivity (yield per hour worked). This was apparently
not caused by state regulation of safety measures in mining, but in the build-up
of trade unions in the mining industry.
In other applications, the assumption of piece-wise constant densities may
be less appropriate. An example is when one estimates a distribution in order
to find clusters in the underlying generation mechanism. For example, we may
be interested in finding possible subspecies in a population where each popula-
tion member has a quantity (weight, height etc) that is explained as a normal
distribution with sub-species specific parameters µ and σ 2 . In this case we can
model the distribution as a mixture of normals (next section 3.1.6).
Exercise 18 (+) We tested above (figure 23) the hypothesis that the coal-mining
disaster density is in fact constant. Is it possible to also test the hypothesis that
the intensity is piece-wise constant? Can we test the hypothesis that there is
exactly one breakpoint, and at 1888? How?

66
200

100

0
1840 1860 1880 1900 1920 1940 1960 1980
1

0.5

0
1840 1860 1880 1900 1920 1940 1960 1980
0.02

0.01

0
1840 1860 1880 1900 1920 1940 1960 1980
1

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4

Figure 20: Coal mining disaster data. plots of occurrence times of disasters,
MC estimate of density of change points, five overlaid density estimates and the
observed number of change points distribution. The second plot shows what is
known as the probability hypothesis density (PHD) in the more complex case
of multi-target tracking[1], expected to be used, e.g., in future intelligent cruise
controls to keep track of other vehicles and obstacles in road traffic.

200

100

0
1840 1860 1880 1900 1920 1940 1960 1980
0.06

0.04

0.02

0
1840 1860 1880 1900 1920 1940 1960 1980
2

−1
1840 1860 1880 1900 1920 1940 1960 1980
1

0.5

0
−0.5 0 0.5 1 1.5

Figure 21: Checking the model: Similar data set with uniformly distributed
points.

67
1960 1940

1930
1940

1920
1920

1910

1900

1900

1880
1890

1860
1880

1840 1870
0 1000 2000 3000 4000 0 500 1000 1500

Figure 22: The trace of change-points for the real (left) and random test data.
The real data trace shows an obvious concentration around 1888, and a less
obvious one around 1944. The latter might be ascribed to the end of a world
war, but it is not at all obviously significant. The creation of mine workers’
unions seems to have had a more concrete effect than WW II. For the random
test data, we can not, as expected, see any obvious structure (it is possible to
compare it with the graph of figure 1, which also has no significant structure).

1980

1960

1940

1920

1900

1880

1860

1840
0 20 40 60 80 100 120 140 160 180 200

Figure 23: Coal mining data against 100 sets of uniformly distributed random
data.The coordinates are cumulative count and year. The lowest curve comes
from the real data and is considerably below those from the 100 random data
sets. The coal-mining disaster data has thus significant non-uniform structure

68
3.1.6 Univariate Gaussian Mixture modeling - The EM and MCMC
ways
Consider the problem of deciding, for a set of real numbers, the most plausible
decompositions of the distribution as a weighted sum (mixture) of a number
of distributions each being a univariate normal (Gaussian) distribution. This
problem has significance when we try to find ’discrete’ circumstances behind a
measured variable which is also influenced by various chance fluctuations. Note
that a single column of discrete data is not decomposable in this way because a
mixture of discrete distributions is again a discrete distribution. But a mixture
of normal distributions is not itself a normal distribution. In the frequently
used Enzyme problem[82], the discovered components, if any, could correspond
to discrete genetic factors in a population. There are quite many approaches to
solve this problem, and many carry over to the more general problem of mod-
eling a matrix of reals as coming from a mixture of multivariate Gaussians[58].
The MCMC method is quite straightforward but somewhat tedious to imple-
ment. For a number of components in the mixture, the algorithm keeps a trace
that contains the current number of components, the weight, mean and variance
of each component, and also the assignment of each data point to one of the
components. A proposal will be the reassignment of a point to another com-
ponent, a change in the mean or variance of a component, a change in weights
of components, the deletion of a component (merging two components) or ad-
dition of a new component (splitting a component into two). In order to keep
the identity of components in the trace, one should not allow the mean of one
to change outside the interval between its neighbors. In order to speed up the
computation, one should keep the sum and square sum of points in each compo-
nent updated. The MCMC method can be rather slow for large data sets, but
it gives a full picture of the posterior: a pdf of the number of components and
for each component an empirical distribution of its weight and parameters. A
detailed analysis of this method can be found in [74], and practical tips in [45].
A more sophisticated way to account for the ’label-switching’ problem is found
in [90].
It is much more common in practice to use Expectation Maximization for
this problem. For a fixed number of components, the data points are assigned
to components and then two steps alternate until convergence:
1: compute the values of weights, means and variances that maximizes the
likelihood of the current assignment of points to components.
2: compute the assignment of points to components that maximize the like-
lihood, i.e., each data point is assigned to the component having highest density
at that point.
The method will converge rapidly in most cases, but it does not give a good
picture of the uncertainty of the MAP estimate. It is possible to get a picture
of the sensitivity using resampling, but this is not equivalent to posterior esti-
mation. A useful package in Matlab was published by Patrick Tsui (Mathworks
file exchange). It covers also multivariate Gaussian mixtures.
For a mixture modelling using MCMC and univariate normal distributions
it can be advantageous to make a Rao-Blackwellization by integrating out the
variances (but not the means) of the component normal distributions. The data
probability for a given component will then be (where summation is over the
data points in the component) obtained by integration over the variance (from

69
0 to ∞) of the data probability (a product of Gaussians) times the standard
prior (1/σ) (see Exercise 19)

Exercise 19 Show that the data probability for a component in a mixture is

Γ(n/2)
,
(2πs)n
 
where n is the number of points and s = ( x2i − 2µ xi + nµ), sums being
taken over the xi of the component. Hint: with the substitution t = 1/σ 2 , the
integral can be pattern matched with a constant times an integral of the gamma
distribution. The integral can be solved using the normalization constant for the
gamma distribution obtained from a table of distributions.

3.2 Multivariate and sequence data models


In principle, a univariate data item or random variable can be generalized by
two recursive operations, forming sets and sequences, respectively. A (multi-
variate) set is formed as a finite collection, or as an indexed sequence. Since
the operations are recursive, we can in principle create sets of sequences, se-
quences of sets, and many more complicated structures, which we will avoid
here. Sequences can be thought of as time series, but they are used also to
describe, e.g., genes and proteins. For time series we may want to predict their
future behavior, which is typically done using various filter operations that can
be collectively described with equation (6). But for other sequences it may be
more important to make inference about a hidden ’latent structure’, using the
retrodiction equation (7).
Sets of variables are typically analyzed using either a graphical dependency
model (common for categorical data), or with various types of normality as-
sumptions, like factor analysis and regression. Also here it may be useful to
make inferences about hidden structure in the set, typically by decomposing
the data set into classes that each have a simpler statistical structure than the
total set.

3.2.1 Multivariate models


Consider a data matrix where rows are cases and columns are variables. In a
medical research application, the row is associated with a person or an investi-
gation (patient and date). In an internet use application the case could be an
interaction session. The columns describe a large number of variables that could
be recorded, such as background data (occupation, sex, age, etc), and numbers
extracted from investigations made, like sizes of brain regions, receptor densities
and blood flow by region, etc. Categorical data can be equipped with a confi-
dence (probability that the recorded datum is correct), and numerical data with
an error bar. Every datum can be recorded as missing, and the reason for miss-
ing data can be related to patients condition or external factors (like equipment
unavailability or time and cost constraints). Only the latter type of missing data
is (at least approximately) unrelated to the domain of investigation. The former
should be coded in a separate column of categorical (binary) data. If the data
do not satisfy these conditions (e.g., normality for a real variable), they may
do so after suitable transformation and/or segmentation. Another approach is

70
to ignore the distribution over the real line and regard a numerical attribute as
an ordinal one, i.e., considering only the ordering between values. Such ordinal
data also appear naturally in applications where subjects are asked to grade a
quantity, like their appreciation of a phenomenon in organized society or their
valuation of their own emotions.
In many applications the definition of the data matrix is not obvious. For
example, in text mining applications, the character sequences of information
items are not directly of interest, but a complex coding of their meaning must
be done, taking into account the (natural) language used and the purpose of
the application.

3.2.2 Dependency tests for categorical variables: Dirichlet modeling


Assume that we observe pairs (a, b) of discrete variables. How can we find out
whether or not they are dependent? There is a large number of procedures pro-
posed to answer this question, but one that is particularly elegant and intuitive
in Bayesian analysis. We want to choose between two parameterized models,
one that captures independence and one that captures dependency. The first
one MI is characterized by two discrete distributions, one for A assumed to
have dA outcomes and one for B assumed to have dB outcomes. Their prob-
ability vectors are (xA A B B
1 , . . . , xdA ) and (x1 , . . . , xdB ), respectively. This model
gives the probability xA B
i xj for outcome (i, j). The second model MD says that
the two variables are generated jointly by one discrete distribution having dA dB
outcomes, each outcome defining both an outcome for A and one for B. This
distribution is characterized by the probability vector (xAB AB
1 , . . . , xdA dB ).
In order to get composite hypotheses we must equip our models with priors
over the parameter spaces. Let us choose uniform priors for all three partici-
pating distributions.
A set of outcomes for a sequence of discrete variables can be conveniently
summarized in a contingency table. In our case this is a matrix n with element
nij giving the number of occurrences of (i, j) in the sample. We can now find
the probabilities of an outcome n for the two models by integrating out the
parameters from the product of likelihood and prior.
As we saw in the derivation of equation (19), the uniform prior (αi = 1, all
i) gives a probability for each sample size that is independent of the actual data:

Γ(n + 1)Γ(d)
p(n|M ) = . (28)
Γ(n + d)
Consider now the data matrix over A and B. Let nij be the number of rows
with value i for A and value j for B. Let n.j and ni. be the marginal
 counts
where we have summed over the ’dotted’ index, and n = n.. = ij nij . The
probability of the data given MD is obtained by replacing the products and
replacing d by dA dB in equation (19):

Γ(n + 1)Γ(dA dB )
p(n|MD ) = . (29)
Γ(n + dA dB )
We now consider the model of independence, MI . Assuming parameters xA
and xB for the two distributions, a row with values i for A and j for B will have
A B
probability xA B
i xj . For discrete distribution parameters x , x , the probability
of the data matrix n will be:

71
p(n|xA , xB ) =
n 
dA ,dB
(xA B nij
i xj )
n11 , . . . , ndA dB i,j=1

n 
dA 
dB
= (xA
i )ni.
(xB
j )
n.j
.
n11 , . . . , ndA dB i=1 j=1

Integration over the uniform priors for A and B, Γ(da ) and Γ(db ), gives the
data probability given the composite model MI :

p(n|MI ) =

p(n|xA xB )p(xA )p(xB )dxA dxB
L da L dB
 
dA 
dB
n
= (xA
i )ni.
(xB
j )
n.j
×
L da L dB n11 , . . . , ndA dB i=1 j=1
A B
Γ(dA )Γ(dB )dx dx
Γ(n + 1)Γ(dA )Γ(dB ) i Γ(ni. + 1) j Γ(n.j + 1)
= .
Γ(n + dA )Γ(n + dB ) ij Γ(nij + 1)

From the above and equation (29) we obtain the Bayes factor in favor of
independence:

p(n|MI )
=
p(n|MD )
Γ(n + dA dB )Γ(dA )Γ(dB ) j Γ(n.j + 1) i Γ(ni. + 1)
. (30)
Γ(n + dA )Γ(n + dB )Γ(dA dB ) ij Γ(nij + 1)

The analysis presented above will be used in, and continued in, section 3.3 on
graphical models. The gamma functions have very large values, and expressions
involving them should normally be evaluated as logarithms to prevent numeric
overflow. For example, Matlab has a useful function gammaln that evaluates the
logarithm of the gamma function directly, as a sum of logarithms for integer
arguments.

Exercise 20 For deriving the data probability in the independence model it was
crucial to observe that


dA ,dB 
dA 
dB
(xA B nij
i xj ) = (xA
i )
ni.
(xB
j )
n.j
.
i,j=1 i=1 j=1

Convince yourself that this is true.

Exercise 21 Design a method to decide whether or not a set of data triplets


is generated as three independent variables. For domain size dA , dB , dC , and
contingency table nijk introduce notation similar to above equation (30).

72
Exercise 22 (+) In Pediatrics 2005;116;1506-1512, a study on hospital orga-
nization is reported where in one case the mortality of admitted patients was 39
out of 1394, and in another case 36 out of 548. Can this result be explained by
random variation, the underlying mortality rate being the same in both cases?
Quantify your conclusion!

3.2.3 The multivariate normal distribution


The multivariate normal distribution is ubiquitous in statistical modeling and
has an abundance of interesting properties. We will describe it very shortly
and indicate some central analysis methods based on it. The distribution is a
distribution over points in a d-dimensional space, and one way to get a multivari-
ate normal distribution is to multiply together a number of univariate normal
distributions, one over each dimension xi . Some of these may have variance
zero, which effectively is a Dirac δ-function and constrains the density to a sub-
space (such distributions are called degenerate). This gives a distribution whose
equidensity surfaces form an ellipsoid in d-space, possibly but not necessarily
with different lengths on the principal axes which are parallel to the coordinate
axes. In case several principal axes have the same length, it is not possible to
tell the orientation of them, we only know that they are mutually orthogonal
and span a subspace. So, e.g., the ball used in American football has only one
well-defined principal direction while the ball used in soccer has none. We have
now the most general form of the multivariate normal distribution, with one im-
portant exception, namely that any distribution obtained by rotating the frame
of reference is also a multivariate normal distribution. The analytic form of the
density function is

1
p(x) = (2π)−d/2 |Σ|−1/2 exp − (x − µ)T Σ−1 (x − µ) , (31)
2
where x varies over Rd , µ is the mean and Σ is the variance (often called co-
variance) matrix, a symmetric positive definite d by d matrix. The inverse of
the variance matrix is sometimes called the precision matrix. The eigenvectors
of Σ are indeed the principal directions and its eigenvector with the largest
eigenvalue shows the direction of largest variation of the distribution.
Models based on the multivariate normal distribution may be oriented to-
wards making inferences about the orientation of the axes particularly in princi-
pal component and factor analysis, and regression, or in expressing an empirical
data set as a mixture of multivariate distributions. The latter can be used both
to explain structure of the data and as a non-parametric inference method for
multivariate data, a generalization of the method described in section 3.1.6.

Exercise 23 The multivariate normal distribution has the following nice recur-
sive characterization: A distribution over d-dimensional space is normal if and
only if either, d = 1 and it is the univariate normal distribution, or, d > 1 and
both all marginals and all conditionals of it are multivariate normal distributions
over d − 1-dimensional space. Show that this is true! Hint: A conditional of a
multivariate distribution is obtained by conditioning on one of its variables. A
marginal is obtained by integrating out one of its variables.

73
3.2.4 Dynamical systems and Bayesian time series analysis
The analysis of time series is one of the central tasks in statistics. We will here
only describe a number of useful but maybe less known methods.
For the case of a series with known likelihoods connecting observations to
system state and known system state dynamics, the task will normally be to
predict the next or a future state of the system. The standard solution method
is conveniently formulated in the Chapman Kolmogoroff equation (6), which can
be solved with a Kalman filter for known linear dynamics and known Gaussian
process noise. For non-linear, non-Gaussian systems, the sequential MCMC or
particle filter method described in section 4.1.2 can be used. There are tricky
cases if the state and observation spaces have non-uniform structure, such as
in the multiple target tracking problem, where the task is to find the probable
number of targets (and their states) from cluttered radar readings. This problem
has an established solution in the FISST[50] method.
Bayesian methods can also be used to make inference, not only about mea-
surement and process noise, but also on the system state space dimension. In
this case we consider a system with a not directly visible state s(t) at time t,
and where some information about the state is obtained by observations x(t)
at time t. For such systems, a basic theorem is Takens[97], considering a sys-
tem governed by a first-order differential equation and an observation process,
contaminated by process and measurement noise χ(t) and ψ(t):

s (t) = G(s(t)) + χ(t) (32)


x(t) = F (s(t)) + ψ(t).

Here the state and observation spaces are assumed to be of finite dimension
(but not finite!), although a major interest in the physical interpretation of (32)
is that the finite-dimensional state space of physical systems is often embed-
ded in an infinite-dimensional state space, as in some laminar flow problems.
The theorem of Takens says that under suitable but general assumptions, the
space of lag-vectors (x(t), x(t − T ), . . . , x(t − (d − 1)T )) is diffeomorphic to the
state space for d large enough, and its interest stems from the possibility of
constructing an approximate state space from this principle, i.e., by estimation
the lowest value of d + 1 that gives a degenerate set of lag-vectors obtained
from experimental data. The exact formulation of the theorem has a number
of regularity assumptions and it is formulated in asymptotic form assuming no
noise. However, it is in many cases possible to construct non-linear predictors
for such systems with extremely good performance. The idea is to find suitable
values for the dimension of a state space and time step, collect a large number of
trajectories, and to find approximations of the functions F and G using numeri-
cal approximation methods. The main problem with finding such predictors are
that the process and measurement noise may be too large for reliable identifica-
tion and that the system may be drifting so that the functions F and G change
with time. An analysis of such methods used to solve challenge problems in
time series prediction in the 1991 Santa Fe competition can be found in [46]. A
more rigorous analysis based on Bayesian analysis and where the functions F
and G of (32) are represented as special types of neural networks can be found
in [102]. The functions F and G are represented parametrically as:

74
F (s) = B tanh(As + a) + b
G(s) = s + D tanh(Cs + c) + d,

where the number of columns of B and rows of A is the number of hidden cells
in the observation function neural network and the number of columns of D and
rows of C is the number of hidden cells in the transition kernel ANN. The com-
ponents of matrices and vectors are the weights of the networks. The method
puts independent Gaussian priors on all weights. In their Matlab implementa-
tion, Valpola and Karhunen do not use MCMC but a version of the variational
Bayes method they claim to be significantly faster than MCMC.

3.2.5 Discrete states and observations: The hidden Markov model


This special case occurs naturally in some applications like speech analysis after
some signal processing (the number of phonemes in speech is finite, around 27),
and is intrinsic in bio-informatics analysis of genes and proteins.
Consider a finite sequence of observations (x1 , x2 , . . . , xn ). Assume that this
sequence was obtained from an unknown Markov chain on a finite state space:
(s1 , s2 , . . . , sn ), with state transitions and observations defined by discrete prob-
ability processes P (st |st−1 ) and P (xt |st ). This is the same setup as in equation
(6). The sizes of the state space and sometimes the observation space is however
often not known, but can be guessed, or subject to inference based on the retro-
diction equation (7). Assume they are fixed to do and ds , respectively. Now the
unknowns are the state transition probabilities, a ds × ds matrix of ds discrete
probability distributions over ds values, and one do × ds observation distribu-
tion matrix of one discrete observation distribution for each system state. If no
particular application knowledge about the parameters is available, it is natural
to assume priors giving uniform Dirichlet priors to the probability distributions,
and the missing observations can be either assumed uniformly distributed, or if
they are few, distributed as the non-missing observations.
An MCMC procedure for making inferences about the two (ds × ds and do ×
ds ) matrices goes as follows: The variables are the state vector (s1 , s2 , . . . , sn )
and the two matrices. We can also accommodate a number of missing observa-
tions as variables. The discrete distributions, the state vector and the missing
values are initialized to essentially arbitrary values. In each step of the chain
we propose a change in one of the 2ds distributions, a state si or one or more
of the missing observations. By considering the change in probability (7) and
symmetry of the proposal distribution we compute the acceptance probability
for the move, draw a standard uniform random number and accept or reject
the proposal depending on whether or not the random number is less than the
acceptance probability.
The probability is the product of all transition and observation probabilities
connecting a state with its successor state in S, or a state st with the corre-
sponding observation. Only a few of these probabilities actually change by the
proposal and it is important to take advantage of this in the computation. Eval-
uation of the run must be made by examination of summaries of a trace of the
computation. Typically, peaked posteriors are a sign that some real structure
of the system has been identified whereas diffuse posteriors leave the possibility

75
that there is no structure in the series (high noise level or even only noise in
relation to the length of the series) still plausible.

3.3 Graphical Models


Given a data matrix, the first question that arises concerns the relationships
between its variables(columns). Could some pairs of variables be considered
independent, or do the data indicate that there is a connection between them -
either directly causal, mediated through another variable, or introduced through
sampling bias? These questions are analyzed using graphical models, directed
or decomposable[67]. As an example, in figure 24 M1 indicates a model where
A and B are dependent, whereas they are independent in model M2 . These are
thus graphical representations of the models called MD and MI , respectively,
in section 3.2.2. In figure 25, we describe a directed graphical model M4 in-
dicating that variables A and B are independently determined, but the value
of C will be dependent on the values for A and B. The similar decomposable
model M4 indicates that the dependence of A and B is completely explained by
the mediation of variable C. We could think of the data generation process as
determining A, then C dependent on A and last B dependent on C, or equiv-
alently, determining first C and then A dependent on C and B dependent on
C. Directed graphical models described here are also known as Markov Ran-
dom Fields(MRF) . Our analysis in this section is oriented towards graphs with
relatively simple structure (tree-like). For such models it is possible to make
inference about edges in the graph. Another family of mathematically identical
statistical models but with a much denser, grid-like, neighborhood structure is
used for example in image processing and spatial statistics. For such models
it is infeasible to make inference about the neighborhood structure, and meth-
ods originating in statistical mechanics can be applied. In this case we call the
models Markov Random Fields (see section 3.6). Technically, our undirected
model family is also a Markov Random Field model family, but it is usually not
described as such.
Bayesian analysis of graphical models involves selecting all or some graphs
on the variables, dependent on prior information, and comparing their posterior
probabilities with respect to the data matrix. A set of highest posterior proba-
bility models usually gives many clues to the data dependencies[66, 67], although
one must - as always in statistics - constantly remember that dependencies are
not necessarily causalities.

76
A B M1

B M1’
A

B M2
A

A B M2’

Figure 24: Graphical models, dependence or independence?

77
A B

M3 B
A
C

M4’

A B C

M3’

C
A B

A B M4’’
C

M4
C

Figure 25: Graphical models, conditional independence?

78
3.3.1 Causality and direction in graphical models
Normally, the identification of cause and effect must depend on ones under-
standing of the mechanisms that generated the data. There are several claims
or semi-claims that purely computational statistical methods can identify causal
relations among a set of variables. What is worth remembering is that these
methods create suggestions, and that even the concept of cause is not unambigu-
ously defined but a result of the way the external world is viewed. The claim
that causes can be found is based on the observation that directionality can in
some instances be identified in graphical models: Consider the models M4 and
M4 of figure 25. In M4 , variables A and B could be expected to be marginally
dependent, whereas in M4 they would be independent. On the other hand,
conditional on the value of C, the opposite would hold: dependence between A
and B in M4 and independence in M4 ! This means that it is possible to identify
the direction of arrows in some cases in directed graphical models. It is difficult
to believe that the causal influence should not follow the direction of arrows in
those cases, and this opens the possibility, e.g., of scavenging through all kinds
of data bases recording important variables in society, looking for causes of un-
wanted things. Since causality immediately suggest the possibility of control
by manipulation of the cause, this is both a promising and a dangerous idea.
Certainly, this is a potentially useful idea, but it cannot be applied in isolation
from the application expertise, as the following example illustrates. It is known
as Simpson’s paradox, although it is not paradoxical at all.

3.3.2 Simpson’s paradox


Consider the application of drug testing. We have a new wonder drug that we
hope cures an important disease. We find a population of 800 subjects who have
got the disease; they are asked to participate in the trial and given a choice be-
tween the new drug and the alternative treatment currently assumed to be best.
Fortunately, half the subjects, 400, chose the new drug. Of these, 200 recover.
Of those 400 who chose the traditional treatment, only 160 recovered. Since the
test population seems large enough, we can conclude that the new drug causes
recovery of 50% of patients, whereas the traditional treatment only cures 40%.
Since this is known to be a nasty disease difficult to cure, particularly for women,
this seems to be a cause for celebration until someone suddenly remembers that
the recovery rate for men using traditional treatment is supposed to be better
than the 50% shown by the trial. Maybe the drug is not advantageous for men?
Fortunately, in this case it was easy to find the sex of each subject and to make
separate judgments for men and women.
So when men and women are separated, we find the following table:

79
recovery no recovery Total rec. rate
men
treated 180 120 300 60%
not treated 70 30 100 70%
women
treated 20 80 100 20%
not treated 90 210 300 30%
total
treated 200 200 400 50%
not treated 160 240 400 40%

Obviously, the recovery rate is lower for the new treatment, both for women
and for men! Examining the table reveals the reason, which is not paradoxical at
all: the disease is more severe for women, and the explanation for the apparent
benefits of the new treatment is simply that it was tried by more men, for
whom the disease is less severe. The sex influences both the severity of the
disease and the willingness to test the new treatment, in other words sex is a
confounder. This situation can always occur in studies of complex systems like
living humans and most biological, engineering or economic systems that are not
entirely understood, and the confounder can be much more subtle than sex. A
first remedy is to balance the design with respect to known confounders. But in
these types of systems there are always unknown confounders, and the only safe
way to test new treatments is to make a ’double blind’ test, where the treatment
is assigned randomly, and neither the subject nor the physician knows which
alternative was chosen for a subject. Needless to say, this design is ethically
questionable or at least debatable, and also one reason why drug testing is
extremely expensive. For complex engineered systems, however, systematically
disturbing the system is one of the most effective ways of understanding its
dynamics, hindered not by ethical but sometimes by economic concerns.
When we want to find the direction of causal links, the same effect can occur.
In complex systems of nature, and even in commercial warehouse data bases, it
is not unlikely that we have not even measured the variable that will ultimately
become the explanation of a causal effect. Such an unknown and unmeasured
causal variable can easily turn the direction of causal influence indicated by the
comparison between models M4 and M4 , at least unless the data is abundant. A
pedagogical example of this effect is when a causal influence from a barometer
reading to tomorrows weather is found. The cause of tomorrows weather is
however not the meter reading, but the air pressure it measures, and one can
even claim that the air pressure is caused by weather in the vicinity that will
be tomorrows weather here. This is not the total nonsense it may appear to be,
because it is probably more effective to control tomorrows weather by controlling
the weather in the vicinity than by controlling the local air pressure or by
manipulating the barometer. In other cases this confounding can be much more
subtle.
Nevertheless, the new theories of causality have attracted a lot of interest,
and if applied with caution they should be quite useful[77, 49]. Their philosoph-
ical content is that a mechanism, causality, that could earlier not or only with
difficulty be formalized, has become available for analysis in observational data,
whereas it could earlier only be accessed in controlled experiments.

80
3.3.3 Segmentation and clustering
A second question that arises concerns the relationships between rows (cases)
in the data matrix. Are the cases built up from distinguishable classes, so that
each class has its data generated from a simpler graphical model than that of
the whole data set? In the simplest case these classes can be directly read off
in the graphical model. In a data matrix where inter-variable dependencies are
well explained by the model M4 , if C is a categorical variable taking only few
values, splitting the rows by the value of C could give a set of data matrices
in each of which A and B might be independent. However, the interesting
cases are where the classes cannot be directly seen in a graphical model because
then the classes are not trivially derivable. If the data matrix of the example
contained only variables A and B, because C was unavailable or unknown to
interfere with A and B, the highest posterior probability graphical model might
be one with a link from A to B. The classes would still be there, but since C
would be latent or hidden, the classes would have to be derived from the A and
B variables only. A different case of classification is where the values of one
numerical variable are drawn from several normal distributions with different
means and variances. The full column would fit very badly to any single normal
distribution, but after classification, each class could have a set of values fitting
well to a normal distribution. The problem of identifying classes is known as
unsupervised classification. One comprehensive system for classification based
on Bayesian methodology is described by Cheeseman and Stutz[21]. In figure
26 we illustrate the use of segmentation: The segment number can be regarded
as a hidden variable, and if the segments are chosen to minimize within-segment
dependencies, and succeeds, then the data is given a much more explanatory
graphical model.
Finally, it is possible that a data matrix with many categorical variables
with many values gives a scattered matrix with very few cases compared to the
number of potentially different cases. Aggregation is a technique by which a
coarsening of the data matrix can yield better insight, such as replacing the age
and sex variables by the categories kids, young men, adults and seniors in a car
insurance application. The question of relevant aggregation is clearly related to
the problems of classification. For ordinal variables, this line of inquiry leads
naturally to the concept of decision trees, that can be thought of as a recursive
splitting of the data matrix by the size of one of its ordinal variables.

3.3.4 Graphical model analysis - overview


Graphical models are quite popular, but the literature is unfortunately some-
what incoherent. It is probably best to approach the problem using our reper-
toire of Bayesian analysis. Both directed and undirected models (Bayesian nets
or Markov Random Fields) describe a multivariate distribution. They are char-
acterized by (i) a graph describing ’dependencies’ in a visually attractive way,
and (ii) probability tables, describing the precise numerical nature of these de-
pendencies. In the choice between directed and undirected models there is really
no general rule to apply; in many areas one of these have become ’standard’ for
no obvious reason. The directed models are the most common today, particu-
larly when it is felt that qualitative causality reasoning can go a long way to
identify ’the right’ graph. The undirected (MRF) graphical models are however

81
L

A
A
B

C C
D
D

Figure 26: Segmentation explains data: The variables themselves are highly
dependent (left). If a segmentation into components with independent variables
can be found, then the variable indicating segment number can be added to get
simpler dependencies. It is called a latent or hidden variable.

simpler and easier to work with. In particular, they do not suffer from the
problem found for Bayesian nets that there are many different graphs defining
the same family of probability distributions, which makes inference about the
graph somewhat error-prone, if the the inference is over a set of graphs where
one edge can have both directions.
After having chosen between a few main families of models, one has to use
either prior information or a past set of outcomes - a training data set to fill in
the items (i) and (ii) above.
The prior information is normally not sufficient for deciding which graph to
use and which probability distributions to enter into the graph. Using Bayesian
inference one can find a posterior distribution over graphs and probability ta-
bles, and hopefully some standard estimation can select one exact model that is
good enough for the application. Typically, the graph structure is found with a
MAP estimate on the marginal distribution (integrating out probability tables
as nuisance parameters), followed by a Laplace (mean) estimate for the proba-
bility tables. This is another example of the Rao-Blackwellization method used
already in Ch 3.1.5. We will now go through some details in this procedure.

3.3.5 Graphical model choice - local analysis


We will analyze a number of models involving two or three variables of cate-
gorical type, as a preparation to the task of determining likely decomposable or
directed graphical models. First, consider the case of two variables, A and B,

82
and our task is to determine whether or not these variables are dependent. The
analysis in section 3.2.2 shows how we can choose between models M1 and M2 :
Let dA and dB be the number of possible values for A and B, respectively.
It is natural to regard categorical data as produced by a discrete probability
distribution, and then it is convenient to assume Dirichlet distributions for the
parameters (probabilities of the possible outcomes) of the distribution.
We will find that this analysis is the key step in determining a full graphical
model for the data matrix.
We could also consider a different model M1 , where the A column is gener-
ated first and then the B column is generated for each value of A in turn. This
corresponds to the directed model M1 : With uniform priors we get:

Γ(n + 1)Γ(dA )Γ(dB )dA  Γ(ni. + 1)


p(n|M1 ) = (33)
Γ(n + dA ) i
Γ(ni. + dB )
Observe that we are not allowed to decide between the undirected M1 and
the directed model M1 based on equations (29) and (33). This is because these
models define the same set of pdf:s involving A and B, the difference lying only
in the structure of parameter space and parameter priors. They overlap on a
set of prior probability one.
In the next model M2 we assume that the A and B columns are indepen-
dent, each having its own discrete distribution. There are two different ways to
specify prior information in this case. We can either consider the two columns
separately, each being assumed to be generated by a discrete distribution with
its own prior. Or we could follow the style of M1 above, with the difference
that each A value has the same distribution of B-values. The first approach
corresponds to the analysis in section 3.2.2, equation (30). From equations (29)
and (30) we obtain the Bayes factor for the undirected data model:

p(n|M2 )
=
p(n|M1 )
Γ(n + dA dB )Γ(dA )Γ(dB ) j Γ(n.j + 1) i Γ(ni. + 1)
. (34)
Γ(n + dA )Γ(n + dB )Γ(dA dB ) ij Γ(nij + 1)

The second approach to model independence between A and B gives the


following:

p(n|M2 ) =
  
Γ(n + 1)Γ(dA ) ni. n
( xj ij )Γ(dB )dxB =
Γ(n + dA ) Ld i
n i1 . . . nidB j
B

Γ(n + 1)Γ(dA )Γ(dB )  ni.  n.j
( ) xj dxB =
Γ(n + dA ) i
n i1 . . . n idB Ld j B

Γ(n + 1)Γ(dA )Γ(dB )  ni.


( )/c(n.j +1) =
Γ(n + dA ) i
n i1 . . nidB
.
Γ(n + 1)Γ(dA )Γ(dB ) i Γ(ni. + 1) j Γ(n.j + 1)
. (35)
Γ(n + dA )Γ(n + dB ) ij Γ(nij + 1)

83
We can now find the Bayes factor relating models M1 (equation 33) and M2
(equation 35), with no prior preference of either:

p(M2 |D) p(n|M2 )


= =
p(M1 |D) p(n|M1 )
j Γ(n.j + 1) i Γ(ni. + dB )
(36)
Γ(dB )dA −1 Γ(n + dB ) ij Γ(nij + 1)
Consider now a data matrix with three variables, A, B and C (figure 25).
The analysis of the model M3 where full dependencies are accepted is very
similar to M1 above (equation 29). For the model M4 without the link between
A and B we should partition the data matrix by the value of C and multiply
the probabilities of the blocks with the probability of the partitioning defined
by C.
Since we are ultimately after the Bayes factor relating M4 and M3 respec-
tively M4 and M3 , we can simply multiply the Bayes factors relating M2 and
M1 (equation 30) respectively M2 and M1 (equation 36) for each block of the
partition to get the Bayes factors sought:

p(M4 |D) p(n|M4 )


= =
p(M3 |D) p(n|M3 )
Γ(dA )dC Γ(dB )dC  Γ(n..c + dA dB ) j Γ(n.jc + 1) i Γ(ni.c + 1)
(37)
Γ(dA dB )dC c
Γ(n..c + dA )Γ(n..c + dB ) ij Γ(nijc + 1)

and the directed case is similar[56]. The values of the gamma function are
rather large even for moderate values of its argument. For this reason the
formulas in this section are always evaluated in logarithm form, where products
like formula (37) translate to sums of logarithms.
The above analysis takes into account the probability of the data to be
generated by the two models. It may give unintuitive results when there are
some data values for C that ’obviously’ makes A and B dependent, and for these
values there are indeed strong evidence in favor of a link between A and B. But
the method is quite objective and if other values of C are frequent, then this
data-dependent dependency will be considered a ’random accident’ by equation
(37). There are several recipes to overcome this problem:(i) When substantive
qualitative knowledge is available it is motivated to force a link between A and
B; (ii) There are more detailed ways to describe dependencies in a multivariate
data set, although at present these methods have no really systematic and at the
same time practical treatment; (iii) The ultimate method is to go into the use
of the model and produce it using a formal utility optimization. Once quality of
model output can be quantified, cross-validation is a possible practical approach
(Ch. 2.1.12).
Exercise 24
Derive a formula similar to (37) for the directed case, p(M4 |D)/p(M3 |D)

3.3.6 Graphical model choice - global analysis


If we have many variables, their interdependencies can be modeled as a graph
with vertices corresponding to the variables. The example of figure 27 is from

84
Mental Work Lipoproteins

Blood Pressure
Anamnesis

Physical Work Smoking

Figure 27: Symptoms and causes relevant to heart problems

[66], and shows the dependencies in a data matrix related to heart disease. Of
course, a graph of this kind can give a data probability to the data matrix
in a way analogous to the calculations in the previous section, although the
formulae become rather involved, and the number of possible graphs increases
dramatically with the number of variables. It is completely infeasible to list and
evaluate all graphs if there is more than a handful of variables. An interesting
possibility to simplify the calculations would use some kind of separation, so
that an edge in the model could be given a score independent of the inclusion or
exclusion of most other potential edges. Indeed, the derivations of last section
show how this works. Let C in that example be a compound variable, obtained
by merging columns {c1 , . . . cd }. If two models G and G differ only by the
presence and absence of the edge {A, B}, and if there is no path between A and
B except through vertex set C, then the expressions for p(n|M4 ) and p(n|M3 )
above will become factors of the expressions for p(n|G) and p(n|G ), respectively,
and the other factors will be the same in the two expressions. Thus, the Bayes
factor relating the probabilities of G and G is the same as that relating M4 and
M3 . This result is independent of the choice of distributions and priors of the
model, since the structure of the derivation follows the structure of the graph
of the model - it is equally valid for Gaussian or other data models, as long as
the parameters of the participating distributions are assumed independent in
the prior assumptions.
We can now think of various ’greedy’ methods for building high probability
interaction graphs relating the variables (columns in the data matrix). It is
convenient and customary to restrict attention to either decomposable(chordal)
graphs or directed acyclic graphs. Chordal graphs are fundamental in many
applications of describing relationships between variables (typically variables in
systems of equations or inequalities). They can be characterized in many dif-
ferent but equivalent ways, see (Rose [84], Rose, Lueker and Tarjan[85]). One

85
simple way is to consider a decomposable graph as consisting of the union of
a number of maximal complete graphs (cliques, or maximally connected sub-
graphs), in such a way that (i) there is at least one vertex that appears in only
one clique (a simplicial vertex), and (ii) if an edge to a simplicial vertex is re-
moved, another decomposable graph remains, and (iii) the graph without any
edges is decomposable. A characteristic feature of a simplicial vertex is that its
neighbors are completely connected by edges. This recursive definition can be
reversed into a generation procedure: Given a decomposable graph G on the
set of vertices, find two vertices s and n such that (i): s is simplicial, i.e., its
neighbors are completely connected, (ii): n is connected to all neighbors of s.
Then the graph G obtained by adding the edge between s and n to G is also
decomposable. We will call such an edge a permissible edge of G. This proce-
dure describes a generation structure (a directed acyclic graph whose vertices
are decomposable graphs on the set of vertices) containing all decomposable
graphs on the variable set. An interesting feature of this generation process is
that it is easy to compute the Bayes factor comparing the posterior probabilities
of the graphs G and G as graphical models of the data: Let s correspond to A,
n to B and the compound variable obtained by fusing the neighbors of s to C
in the analysis of section 5. Without explicit prior model probabilities we have:

p(G|D) p(n|M3 )

= .
p(G |D) p(n|M4 )
A search for high probability graphs can now be organized as follows:
1. Start from the graph G0 without edges.
2. Repeat: find a number of permissible edges that give the highest Bayes
factor, and add it if the factor is greater than 1. Keep a set of highest probability
graphs encountered.
3. Then repeat: For the high probability graphs found in the previous step,
find simplicial edges whose removal increases the Bayes factor the most (or
decreases it the least).
The above finds a high-probability graph that may be globally optimal and
similar to all high-probability graphs, but this is not always the case. It is pos-
sible to get a posterior distribution over the graph by using an MCMC style
approach: Each time the possibility of adding or deleting an edge is contem-
plated, the acceptance criterion of MCMC is used (i.e., accept a proposal that
improves the posterior probability, and accept other proposals by randomiza-
tion depending on the Bayes factor). We may still have convergence problems,
however.
For each graph found in this process, its Bayes factor relative to G0 can be
found by multiplying the Bayes factors in the generation sequence. A procedure
similar to this one is reported by (Madigan and Raftery[67]), and its results on
small variable sets was found good, in that it found the best graphs reported
in other approaches. It must be noted, however, that we have now passed into
the realm of approximate analysis, since we cannot know that we will find all
high probability graphs. For directed graphical models, a similar method of
obtaining high probability graphs is known as the K2 algorithm[9].

86
3.3.7 Categorical, ordinal and Gaussian variables
We now consider data matrices made up from ordinal and real valued data,
and then matrices consisting of both ordinal, real and categorical data. The
standard choice for a real valued data model is the univariate or multivariate
Gaussian or normal distribution. It has nice theoretical properties manifest-
ing themselves in such forms as the central limit theorem, the least squares
method, principal components, etc. However, it must be noted that it is also
unsatisfactory for many data sets occurring in practice, because of its narrow
tail and because many real life distributions deviate terribly from it. Several
approaches to solve this problem are available. One is to consider a variable
as being obtained by mixing several normal distributions. This is a special
case of the classification or segmentation problem discussed below. Another
is to disregard the distribution over the real line, and considering the variable
as just being made up of an ordered set of values. A quite useful and robust
method is to discretize the variables. This is equivalent to assuming that their
probability distribution functions are piecewise constant. Discretized variables
can be treated as categorical variables, by the methods described above. The
method wastes some information, but is quite simple and robust. Typically, the
granularity of the discretization is chosen so that a reasonably large number of
observations fall in each level. It is also possible to assign an observation to
several levels, depending on how close it is to the intervals of adjacent levels.
This is reminiscent of linguistic coding in fuzzy logic[110]. Discretization does
however waste information in the data matrix. It is possible to formulate the
theory of section 3.3.5 using inverse Wishart distributions as conjugate priors
for multivariate normal distributions, but this is leads to fairly complex formu-
las and is seldom implemented. Since most practically occurring distributions
deviate a lot from normality, it is in practice necessary to model the distribu-
tions as mixtures of normal distributions which leads to even higher conceptual
complexity. A compromise between discretization and use of continuous dis-
tributions is analyses of the rankings of the variables occurring in data tables.
When considering the association between a categorical and a continuous vari-
able one would thus investigate the ranks of the continuous variable, which are
independently and uniformly distributed over their range for every category if
there is no association. Using a model where the ranks are non-uniformly dis-
tributed (e.g. with a linearly varying density), we can build the system of model
comparisons of section 3.3.5. The difficulty is that the nuisance parameters can-
not be analytically integrated out, so a numerical quadrature procedure must
be used.

3.4 Missing values and errors in data matrix


Data collected from experiments are seldom perfect. The problem of missing
and erroneous data is a vast field in the statistics literature. First of all there
is a possibility that ’missingness’ of data values are significant for the analysis,
in which case missingness should be modeled as an ordinary data value. Then
the problem has been internalized, and the analysis can proceed as usual, with
the important difference that the missing values are not available for analysis.
A more sceptical approach was developed by Ramoni and Sebastiani[81], who
consider an option to regard the missing values as adversaries (the conclusions

87
on dependence would then be true no matter what the missing values are). A
third possibility is that missingness is known to have nothing to do with the
objectives of the analysis. For example, in a medical application, if data is
missing because of the bad condition of the patient, missingness is significant
if the investigation is concerned with patients. But if data is missing because
of unavailability of equipment, it is probably not - unless maybe if the investi-
gation is related to hospital quality. In Bayesian data analysis, the problem of
missing or erroneous data creates significant complications, as we will see. As
an example, consider the analysis of the two-column data matrix with binary
categorical variables A and B, analyzed against models M1 and M2 of section
5. Suppose we obtained n00 , n01 , n10 and n11 cases with the values 00, 01,
etc. We then have a posterior Dirichlet distribution with parameters nij for the
probabilities of the four possible cases. If we now receive a case where both A
and B are unknown, it is reasonable that this case is altogether ignored. But
what shall we do if a case arrives where A is known, say 0, but B is unknown?
One possibility is to waste the entire case, but this is not orthodox Bayesian,
since we are not making use of information we have. Another possibility is to
use the current posterior to estimate a pdf for the missing value, in our case the
probability that B has value 0 is p0 = n00 /n0. . So our posterior is now either a
Dirichlet with parameters n00 , n01 − 1, n10 − 1 and n11 − 1 (probability p0 ) or
one with parameters n00 − 1, n01 , n10 − 1 and n11 − 1 (probability 1 − p0 ). But
this means that the posterior is now a weighted average of two Dirichlet distri-
butions, in other terms, is not a Dirichlet distribution at all! As the number of
missing values increases, the number of terms in the posterior will increase ex-
ponentially, and the whole advantage with conjugate distributions will be lost.
So wasting the whole case seems to be a reasonable option unless we find a more
clever way to proceed.
Assuming that data is missing at random, it is relatively easy to get an
adequate analysis. It is not necessary to waste entire cases just because they
have a missing item. Most of the analyses made refer only to a small number of
columns, and these columns can be compared for all cases that have no missing
data in these particular columns. In this way it is, e.g., possible to make a
graphical model for a data set even if every case has some missing item, since
all computations of section 3.3.5 refer to a small number of columns. In this
situation it is even possible to impute the values missing, because the graphical
model obtained shows which variables influence the missing one most. So every
missing value for a variable can be guessed by predicting it from values of the
case for neighbors to the variable in the graph of the model. When this is done,
one must always remember that the value is guessed. It can thus never be used
to create a formal significance measure - that would be equivalent to using the
same data twice, which is not permitted in formal inference.
The method of imputing missing values has a nice formalization in the Ex-
pectation Maximization (EM) method: This method is used to create values for
missing data items by using a parameterized statistical model of the data. In
the first step, the non-missing data is used to create an approximation of the
parameters. Then the missing data values are defined (given imputed values)
to give highest probability to the imputed data matrix. We can then refine the
parameter estimates by maximization of probability over parameters with the
now imputed data, then over the missing data, etc. until convergence obtains.
This method is recommended for use in many situations despite the fact that it

88
is not strictly Bayesian and it violates the principle of not creating significance
from guessed (imputed) values. It is important to verify that data are really
missing at random, otherwise the distribution of missing data values cannot be
inferred from the distribution of non-missing data. A spot check that is easy to
perform is to code a variable with many missing items as a 0/1 indicator column
of missingness, and check its association to other columns using equation (30).
The most spectacular use of the EM algorithm is for automatic (unsupervised)
classification in the AUTOCLASS model (see section 3.5).
Imputation of missing values can also be performed with the MCMC method,
by having one variable for each missing value, The trace will give a picture of
their joint posterior distribution, and this has the advantage of not creating
more significance than justified by the model and the data. The method is
strongly recommended over simple imputation by Gelman.
The related case of errors in data is more difficult to treat. How do we
describe data where there are known uncertainties in the recording procedure?
This is a problem worked on for centuries when it comes to real valued quan-
tities as measured in physics and astronomy, and is one of the main features
of interpretation of physics experiments. When it comes to categorical data
there is less help in the literature - an obvious alternative is to relate recorded
vs actual values of discrete variables as a probability distribution, or - which is
fairly expedient in our approach - as an equivalent sample.

3.4.1 Decision trees


Decision trees are typically used when we want to predict a variable - the class
variable - from other - explanatory - variables in a case, and we have a data
matrix of known cases. When modeling data with decision trees, we are usually
trying to segment the data set into ranges - n-dimensional boxes of which some
are unbounded - such that a particular variable - the class variable - is fairly
constant over each box. If the class variable is truly constant in each box,
we have a tree that is consistent with respect to the data. This means that
for new cases, where the class variable is not directly available, it can be well
predicted by the box into which the case falls. The method is suitable where the
variables used for prediction are of any kind (categorical, ordinal or numerical)
and where the predicted variable is categorical or ordinal with a small domain.
There are several efficient ways to heuristically build good decision trees, and it
is a central technique in the field of machine learning[72]. Practical experience
has given many cases were the predictive performance of decision trees is good,
but also many counter-intuitive phenomena have been uncovered by practical
experiments. Recently, several treatments of decision trees have been published
where it is discussed whether or not the smallest possible tree consistent with all
cases is the best one. This turned out not to be the case, and the argument that
a smallest decision tree should be preferred because of some kind of Occam’s
razor argument is apparently not valid, neither in theory nor in practice[105, 15].
The explanation is that the data one wants to classify has usually not been
generated in such a way that the smallness of the tree gives a good indication
of generalizing power. An example is shown in Figure 28. The distribution
of the data is well approximated by two multivariate normal distributions, but
since these lie diagonally, the ortho-linear separating planes of the decision tree
produced from a moderate size sample fit badly to the distribution and depends a

89
lot on the sample. In this example, decision trees with general linear boundaries
perform much better, and in fact the optimal separator of the two distributions
is close to the diagonal line that separates the point sets with a largest possible
margin(section 2.8.

−2

−4

−6
−10 −8 −6 −4 −2 0 2 4 6

Figure 28: Decision tree with bad generalization power

The Bayesian approach gives the right information on the credibility and gen-
eralizing power of a decision tree. It is explained in recent papers by (Chipman,
George and McCullogh[22]) and by (Paass and Kindermann[75]). A decision
tree statistical model is one where a number of boxes are defined on one set
of variables by recursive splitting of one box into two by splitting the range
of one designated variable into two. Data are assumed to be generated by a
discrete distribution over the boxes, and for each box it is assumed that the
class variable value is generated by another discrete distribution. Both these
distributions are given uninformative Dirichlet prior distributions, and thus the
posterior probability of a decision tree can be computed from data. Since larger
trees have more parameters, there is an automatic penalization of large trees,
but the distribution of cases into boxes also enters the picture, so it is not clear
that the smallest tree giving perfect classification will be preferred, or even that
a consistent tree will be preferred over an inconsistent one. The decision trees
we described here do not give a clear cut decision on the value of the decision
variable for a case, but a probability distribution over values. If the probability
distribution is not peaked at a specific class value, then this indicates that pos-
sibly more data must be collected before a decision can be made. Also, since
the name of this data model indicates its use for decision making, one can get
better trees for an application by including information about the utility of the
decision in the form of a loss function and by comparing trees based on the
expected utility rather than model probability.
For a decision tree T with d boxes data with c classes, and where the number

90
of cases in box i with class value k is nik , and n = n.. , we have with uniform
priors on both the assignment of case to box and of class within box,

Γ(n + 1)Γ(d)  Γ(ni. + 1)Γ(c)


p(D|T ) =
Γ(n + d) i
Γ(ni. + c)

However, in order to compare two trees T and T  , we would have to form the
set of intersection boxes and ask about the probability of finding the data with
a common parameter over the boxes belonging to a common box of T relative
to the probability of the data when the parameters are common in boxes of T  .
For the case where T and T  only differ by splitting of one box i into i and i ,
the calculation is easy (ni j + ni j = nij ):

p(D|T  ) Γ(ni. + c)  Γ(ni j + 1)Γ(ni j + 1)


= (38)
p(D|T ) Γ(ni . + c)Γ(ni . + c) j Γ(nij + 1)

A reasonably well generalizing decision tree can now be obtained top-down


recursively by selecting the variable and decision boundary to maximize 38 until
the maximum is less than 1, where we make a terminal deciding the majority
of the labels of training cases falling in the box. This is a slight generalization
of the ’change-point’ problem describe in section 3.1.4.

3.5 Clustering and Mixture Modelling


Clustering is not a central topic in this course, but it is so closely related to
mixture modeling and segmentation that a short discussion seems appropriate.
Clustering is a practice-rich but theory-poor activity where one tries to find
groups of points such that the distances between points are small within clusters
and large between clusters. More correctly, one may say that the points of a
cluster are closer to the cluster center than points in other clusters. With this
view, clustering is apparently a form of mixture modeling where the components
are characterized by a cluster center and a pdf that depends on the distance
measure used - the density is a decreasing function of the distance to the center.
In clustering, however, we are normally not interested in the interpretation in
terms of mixtures and their posteriors, but the classes obtained by an algorithm
are taken as ’true’ - a kind of MAP estimation. The result of clustering must
always be discussed directly in application terms, and quite a lot of work in
preprocessing and defining the distance function is typically required before
results are presentable to application owners.
The main algorithms for clustering are K-means and hierarchical clustering.
K-means has the flavor of Expectation Maximization: The number of clusters
and a distance function is determined, tentative cluster centers are determined,
and then two refinement steps are iterated until convergence:
1. Assign each data point to the closest cluster center.
2. Recompute the cluster centers.
In hierarchical clustering one determines a method to compute the distance
between two clusters (for one-point clusters, the point distance is used). Then
one iteratively determines the two closest clusters and merges them. This pro-
cess ends, with n data points, after n − 1 steps, when there is only one cluster
left. A tree representation of the merging process, and goodness measures for

91
the clusterings on the way, can be used visually to guess which is the ’best’
clustering proposal encountered on the way.
A considerable weakness with clustering compared to Gaussian mixture mod-
eling is that components that overlap cannot be found. An extreme example
is given in figure 29, where the two normal distributions are easily found with
MCMC, but not with any distance-based clustering method.
On the other hand, if we want to model the data as a mixture of multivariate
Gaussian distributions, we would write the probability for a given assignment
of points to mixture components with known variances and means as follows:

 1
p(x, λ, µ, Σ, c) = λci (2π)−d/2 |Σci |−1/2 exp − (xi − µci )T Σ−1
ci (xi − µci ) .
i
2

We can group the factors in the above product by mixture component to


(c)
which data points are assigned, so xi is a sequence of data points for component
c and nc is the number of points assigned to component c:


 1 
(2π)−nd/2 λnk k |Σk |−nk /2 exp − (xi − µk )T Σ−1
(k) (k)
k (xi − µk ) .
2 i
k

The simple Rao-Blackwellization used for the univariate Gaussian mixture


modeling (see Exercise 19) can not immediately be applied to the multivariate
case since it is quite demanding to find a prior for the variance matrix of each
component and to integrate it out as a nuisance parameter. However, with a
little cheating we can integrate out the variance matrix. We do this cheating
by assigning a prior on Σk that depends on the points actually assigned to
component k. We prefer to do our integrals by only matching distributions and
normalization constants for the beautiful but difficult to integrate distributions
in classical tables. From a table, e.g., [16, App A] or [45], we find the Inverse
Wishart distribution which is parametrized by a d by d positive definite and
symmetric scale matrix S and a number of degrees of freedom ν:

−1

d
ν +1−i 1
p(W ) = 2 νd/2 d(d−1)/4
π Γ( ) ×|S|ν/2 |W |−(ν+d+1)/2 ×exp − tr(SW −1 ) .
i=1
2 2

In order to integrate out a nuisance parameter, we must identify the expo-


nentials in the two last formulas, and we must also identify Σk with W . This is
 (k)
done by first identifying tr(SW −1 ) with i (xi − µk )T Σ−1
(k)
k (xi − µk ) giving
 (k) (k) T  (k)
Sk = qk − µk sk − sk µk + nk µk µk , where qk =
T T T
xi xi and sk = xi .
We can now perform the integration over the covariance matrix, and the in-
verse normalization constant for the Inverse Wishart distribution is obtained by
identifying the powers of |W | leading to νk = nk − d − 1.

92
25

20

15

10

−5

−10

−15

−20

−25
−25 −20 −15 −10 −5 0 5 10 15 20 25

Figure 29: An extreme example of two overlapping Gaussian components in


2D. The components are readily identifiable with mixture modeling, but not
with standard clustering methods since the geometrical distance concepts used
in clustering do not capture the components.

93
3.6 Markov Random Fields and Image segmentation
A quite useful statistical model for Bayesian modeling and inference is the
Markov Random Field(MRF). In the discrete version (that is representable on
a finite computer) it is a set of dependent random variables located on the
vertices of a neighborhood graph. The joint distribution of the variables in a
Markov random field can be defined using a set of distributions over its closed
neighborhood sets (each such set consists of a node in the neighborhood graph
together with its neighbors), such that whenever two such sets intersect, the
marginal distributions over the intersection is the same for the two sets. The
defining information for the random properties of the field can also be expressed
with, for each node, its distribution conditional on the values of its neighbors.
With these conditional distributions it is also possible to simulate it, i.e., to
generate a large sample from the field. The MRF is actually the same family
of statistical distributions defined by directed graphical models. However the
name MRF is typically used when the graph is large but has a known structure,
whereas graphical model and Bayesian network are the names used when the
graphs are small or have unknown neighborhood relations. When graphs are
dense we cannot, for example, used the decomposability methods because the
graphs cannot be decomposed by small separators. The most prominent case of
this is a rectangular grid which, despite every vertex having just four neighbors,
cannot be separated in two equally large parts with a small separator.
The family of distributions known as Gibbs Random Fields, GRF, are defined
using the same formula as the state distribution for a physical system: P (λ̄) ∝
exp(−E(λ̄)), where E is the energy, a function of the total state λ̄. For a system
with a neighborhood graph, the state is the vector of vertex states and the energy
decomposes additively into functions of the state projected on the cliques C of
the graph. In other words, there is an energy function Ec on each completely
connected part c ∈ C of the neighborhood graph, and Ec is a function of the
states of its vertices, λ¯c . The distribution of a GRF is thus:

P (λ̄) ∝ exp( Ec (λ̄c ).
c∈C

The following is the central characterization theorem of a MRF:

Proposition 4 (Hammersley, Clifford) For an undirected graph G, a dis-


tribution of its vertices is a MRF of G if and only if it is a GRF of G.

The implication of this theorem is that a MRF distribution has a simple


characterization in terms of an energy function that falls apart in small parts.
This has been used in one of the major applications of MRF, image analysis.
In many imaging instruments, the signal cannot be robustly transformed into
a meaningful image because the noise is too high and the image is underdeter-
mined on a voxel/pixel basis. The solution to this problem is to apply a MRF
prior as a regularization device. An example of this method is the SPECT image
reconstruction described by Green[52].
SPECT and PET cameras aim to reconstruct the distribution in an organ of
a radioactive substance, which is in itself a biochemical compound that binds to
certain molecules in the organ, like oxygenized blood (haemoglobin) or molecules
participating in the signaling system of the brain. Say our aim is to reconstruct,

94
on a voxel basis, the concentration of radioactivity in the organ, xs , where we use
a one-dimensional indexing system (a la Matlab). The organ is surrounded by a
set of detectors, and the coefficients asd measure the effective spatial angle from
a voxel s that leads into a detector d. Many different camera geometries exist,
like an array of detectors moving around the scene(organ), and determining the
coefficients is a non-trivial matter supported by drawings of the camera and
calibration by means of ’phantoms’, synthetic scenes with known geometry.
The signal reaching detector d from voxel s is thus proportional in some
sense to asd . specifically, the detector sees a Poisson
 variable with the intensity
obtained by summing over the scene, Poisson( s∈S ads xs ). The corresponding
likelihood
function is obtained from the registered counts at each detector, where
λ(xs ) = s∈S ads xs :

log(P (Y |xs )) = − log(yd !) + log(λ(yd )) ∗ yd − λ(yd )
d

The prior will inevitably have a certain ad hoc character, and in this case,
since intensities are always non-negative, is usually on a logarithmic scale, with
a scale factor γ that can be fixed or subject to inference, and a non-linear scaling
φ with φ(u) = δ(1 − δ) log cosh(u/δ):

log(P (X)) = − φ(γ(log xs − log xr )).
s∼r

The MCMC procedure is repeatedly proposing changes in values of the in-


tensities xs . Each proposal is tested and conditionally accepted. It is important
to structure the computation so that only those terms that change are recom-
puted, the remainder being stored and remembered. The parameters γ and δ
can also be subject to inference, taking account of what is known about the film
recording process. The simple structure of the computation makes it easy to
adapt to new circumstances, such as, in the case of PET, modern sensors that
can get an estimate of the location of annihilation by precisely measuring the
time difference for detection in the two sensors.
The type of regularizing prior useful for noisy PET images has been used
to good results in numerous applications with a large range of likelihood func-
tions, like medical imaging with X-ray, MR and light, and for reconstructing
the atmosphere using spectral absorption from stars photographed through it
from satellites[54].

95
Figure 30: Josiah Willard Gibbs (1839-1903) founded and created many tools of
thermodynamics and statistical mechanics. Several of these are used or adapted
to use in applied statistics, sometimes mediated by Shannon’s work in informa-
tion theory and Jaynes’ Maximum Entropy ideas. Quite many ideas from theo-
retical physics have been introduced in Computer Science and statistics, but it
typically takes some time to find out how applicable they are and to give them
a goal-oriented motivation in computer applications. The Gibbs distribution,
a fundamental concept in statistical mechanics, is also useful in the analysis of
directed graphical statistical models. It provides a simple characterization of
the corresponding statistical distribution

96
3.7 Regression and the linear model
The most used model in estimation is regression, a model where batches of
similar experiments are used to form inference on effects. One variable (the
response) is assumed to be a linear function of a set of explanatory variables,
with random noise added. The parameter is the set of coefficients of the linear
function, and it is often of interest to decide if one of these, or a linear function
of them, is different from zero. The probability model for the data is thus

y = Xβ + 4, (39)
where each component of 4 is assumed to be from a common normal distri-
bution with zero mean, with known or unknown variance. When an estimate
β̂ is available, the residual vector is r = y − X β̂. The likelihood is the product
of the probabilities of the residuals . The probability density for a batch is a
function of the (Euclidean) norm of the residual vector, and the ML estimates
of the coefficients are given by the least squares method which minimizes the
(squared) length of the residual vector: β̂ = argminβ |y − Xβ|2 , a quantity easy
to compute by solving the normal equation: β̂ = (X T X)−1 X T y, where T de-
notes transposition. It is easy to see that X T X is invertible if the co-variates
are linearly independent (and thus no more than the number of samples). If
X T X is not invertible, the coefficients are not uniquely estimable, and it is not
possible, e.g., to test one of them for zero. However, a beautiful theory devel-
oped by Sheffé and others shows that it is possible to test any linear hypothesis
Kβ = m, where the rows of K are linear combinations of the rows of the matrix
of explanatory variables X. In other words, K must be expressible as AX, for
some matrix A. A possible test statistic is y T Gy where G = K T (KK T )− K, and
where − is any generalized inverse, i.e., a matrix operator with XX − X = X
for all X. The test statistic is distributed as σ 2 χ2s , where s is the rank of G.
So testing depends on the often unknown σ 2 . In the univariate case it is not
very dangerous to estimate σ 2 . But the dependence on σ 2 is better removed
by using the test statistic y T Gy/y T (I − G)y, which has an F (s, n − s) distribu-
tion (this was pointed out, for functional neuro-science studies, in[65]). And of
course the most severe problem is that we do not normally have any proof that
the relationship between response and explanatory variables is linear - indeed,
it is typically not, even when regression models seem to give good results. The
response is to visualize the residuals or test them for normality.
When several (p) responses are considered we have multivariate regression
and (39) is changed to

Y = XB + E, (40)
where Y and B are now n × p matrices and each row of the n × p matrix E is
from a multivariate normal distribution with zero mean and known covariance
Σ. Many methods in univariate regression are specializations of methods in
multivariate regression, like the estimation and testing methods described above.
A possible use of multivariate regression is to make a joint regression on all N
voxels in an experiment. Since Σ is not known, it has to be estimated, and since
it has N (N − 1)/2 different elements, it is difficult to estimate it with good
confidence.

97
3.8 Bayesian regression
Many of the concepts in classical regression have their counterparts in Bayesian
regression. The probability model is normally the same, except that one should
have a prior for the coefficient set and for the noise variance. Estimation can
be made directly for the coefficients even if X is rank-deficient, but of course
rank-deficiency indicates that the coefficients get larger uncertainties, and in
general their estimates and credible intervals will depend a lot on the priors. A
particularly simple prior is a (improper) uniform prior for (β, log σ 2 ). Then the
posterior for β is, see, e.g., [45],

N (β; β̂, Vβ σ 2 ), (41)


Vβ = (X T X)−1 (42)
σ2 ∼ Inv−χ2 (σ 2 ; n − m, s2 ), (43)
2
s = (y − X β̂) (y − X β̂),
T
(44)

if X is n × m. Since σ 2 is a random variable, this is not a normal distribution.


The marginal distribution of the coefficients is not a normal distribution but
the (in general) more vague multivariate t, with parameters mean β̂, variance
s2 Vβ /(n − m) and n − m degrees of freedom. The above holds only for X of full
rank, however, and one of the strengths of Bayesian regression in handling large
sets of explanatory variables will only emerge after a proper prior distribution
has been assumed for the coefficient set and covariance of 4.
In Bayesian applications it is usually required to find the probability of data
given the parameters. This quantity (modulo a normalization constant) is, for
the Bayesian regression model,

p(Y |X, β, Σ) = N (Y ; Xβ, Σ) = (45)


T −1
− Xβ) Σ (Y − Xβ))
c exp(− 12 (Y (46)

The normalization constant c is 1/ (2π)n |Σ|.
A comprehensive analysis of Bayesian regression and related methods, with
quite useful Matlab code, can be found in [33].

4 Approximate analysis with Metropolis-Hastings


simulation
We have seen several simple examples of the Markov Chain Monte Carlo ap-
proach, and we will now explicate in detail their usages and mathematical jus-
tifications – as well as their limitations.
The basic problem solved by MCMC methods is sampling from a multivariate
distribution over many variables. The distribution can be given generically as
p(x, y, z, . . . , w). If some variable, e.g., y, represents measured signals, then
the actual values measured, say a, can be substituted, and sampling will be
from a conditional distribution p(x, a, z, . . . , w). If some other variable, say x,
represents the measured quantity, then sampling and selecting the x variable will
give samples from the posterior of x given the measurements. In other words,

98
we will get a best possible estimation of the quantity given the measurements
and the statistical model of the measurement process.
The two basic methods for MCMC computation are the Gibbs sampler and
the Metropolis-Hastings algorithm. Both generate a Markov chain with states
over the domain of a multivariate target distribution and with the target dis-
tribution as its unique limit distribution. Both exist in several more or less
refined versions. The Metropolis algorithm has the advantages that it does not
require sampling from the conditional distributions of the target distribution
but only finding the quotient of the distribution at two arbitrary given points,
and it can be chosen from a set with better convergence properties. A thorough
introduction is given by Neal[73]. To sum it up, MCMC methods can be used
to estimate distributions that are not tractable analytically or numerically. We
get real estimates of posterior distributions and not just approximate maxima
of functions. On the negative side, the Markov chains generated have high au-
tocorrelation, so a sample over a sequence of steps can give a highly misleading
empirical distribution, much narrower than the real posterior. Although signifi-
cant advances have been made in the area of convergence assessment and choice
of samples, this problem is not yet completely solved.
The Metropolis-Hastings sampling method is, as mentioned before, organized
as follows: given the pdf p(x) of a state variable x over a state space X, and an
essentially arbitrary symmetric proposal function q(x, x ), a sequence of states
is created in a Markov chain. In state x, draw x according to the proposal
function q. If p(x )/p(x) > 1, let x be the new state. Otherwise, let x be
the new state with probability p(x )/p(x), otherwise keep state x. We will
in the next subsection verify that p(x) is a stable distribution of the chain,
and from general Markov chain theory there are several conditions that ensure
that there is only one limiting distribution and that it will always be reached
asymptotically. It is much more difficult to say when we have a provably good
sample, and in practice all the difficulties of hill-climbing optimization methods
must be dealt with in order to assess convergence. Recent developments aim
at finding MCMC procedures which can detect when an unbiased sample has
been obtained by using various variations of ’coupling from the past’[78]. These
methods are however still complex and difficult to apply on the classification
and graphical model problems.
Nevertheless, there have been great successes with MCMC methods for those
cases of Bayesian analysis where closed form solutions do not exist. With var-
ious adaptations of the method, it is possible to express multivariate data as
a mixture of multivariate distributions and to find the posterior distribution of
the number of classes and their parameters [34, 35, 37, 82].

4.1 Why MCMC and why does it work?


The main practical reason for using MCMC is that we often have an unnormal-
ized probability distribution over a sometimes complex space, and we want a
sample of it to get a feel for it, or we might want to estimate the average of a
function on the space under the distribution. This section is a summary of [13,
Ch 6], where a more complete account can be found. A typical way one obtains
a distribution which is not normalized is by using (3), where the product on the
right side is not a normalized distribution and is often impossible to normalize
using analytical methods (even in the case where we have analytical expressions

99
for the prior and the likelihood). So assume we have a distribution π(x) = cf (x),
with unknown normalization constant c, over a space X where we can evaluate
f in every point of X (we will think of cf as a probability density function -
a thorough measure-theoretic presentation is also possible, but this simplified
view is appropriate in most applications). Given a set of N independent and
identically distributed examples xi with pdf π(x), we can estimate the expected
value of any function v(x) over π by taking the average

fN = v(xi )/N. (47)
i
This is a well-known technique known as Monte Carlo estimation. Assuming
that X and π are reasonably well-behaved, like a Euclidean n-dimensional space,
the expected value is I = X π(x)v(x)dx and the estimate fN converges to I
with probability 1 (’almost surely’). Moreover, if the variance of v can be
bounded, the√central limit theorem can be used to give a convergence estimate,
namely that N (fN − I) stays bounded (also with probability one) and indeed
approaches the normal distribution with zero mean and the same variance as v.
It is not easy in general to generate an iid sample for f where its variation
is large and irregular over X. If an upper bound M of f (x) is known, we can
use rejection sampling. In rejection sampling we generate the samples according
to some distribution g(x), where the support of g covers the support of f (in
other words, g can generate any point with non-zero probability according to
cf ), and keep a subset of them (i.e., we reject some of them). Specifically, if a
sample x was generated, we keep it with probability f (x )/(M q(x)). The set of
kept samples will be distributed as cf (x). Obviously, this method is practical
only when the bound M is reasonably accurate and when g(x) is a reasonable
approximation of cf (x), and if this is not true the rejection probability can be
very close to 1 most of the time.
Another technique to obtain a sample from an irregular distribution is impor-
tance sampling. Here we do not get a simple sample but a weighted sample, i.e.,
each point is equipped with a weight and represented by a pair (xi , wi ). If we
can produce a sample of the (preferably close to cf ) distribution g, then setting
wi = f (xi )/g(xi ) and normalizing the weights so that they sum to one, pro-
duced a weighted sample that corresponds to the distribution cf . The  weighted
sample can be used directly for estimation, changing (47) to fN = i v(xi )wi .
The weighted sample can also be changed to an unweighted sample by resam-
pling. An unweighted sample of M elements is obtained by drawing, M times,
an integer j between 1 and N , according to the probability distribution w, and
selecting the corresponding sequence of xj . Obviously, the information about f
is filtered away in the sampling and resampling steps and the accuracy is less
than with iid sampling according to cf . Typically, M should be at least 10N
for obtaining reliable results.
The MCMC method produces a sequence (xi ) distributed asymptotically as
cf , where c is unknown. It is based on a combination of the above sampling
ideas, and it produces a sample in the form of a sequence of non-independent
points with stationary (i.e., limiting) distribution π(x) = cf . We can use equa-
tion (47) for this sample, but only if it is long enough to remove the effect of
local autocorrelation of the xi .
Basically, a Markov Chain is represented with a transition kernel K(xt−1 , xt )
giving the conditional distributions p(xt |xt−1 ). We want to design the transition

100
kernel so that the sequences it produces have a stationary distribution π(x) =
cf (x), where we can evaluate f but not c (and thus not π either). In others
words, f (x) = X K(xt−1 , x)f (xt−1 )dxt−1 . We will achieve this by designing
the chain to be π-reversible, i.e., K(x, y)f (x) = K(y, x)f (y). Reversibility is
also called the detailed balance condition, a term which better explains why
reversibility leads
 to
 stationarity. Namely, if we have two disjoint subsets A and
B, the integral A B K(x, y)cf (x)dxdy is the probability of a sample element in
B being followed by one in A, and this is by the detailed balance condition equal
to the probability that an element in B is followed by one in A. So if the chain is
ergodic, i.e., its ensemble distribution is the same as its path distribution, then
cf is a stationary distribution of the chain. A sufficient condition for this is
that the chain can reach every region in the support of f , i.e., every region with
non-zero probability can be reached from every point with non-zero probability.
We will not go into the detailed conditions and proofs of convergence of Markov
Chains, some more technical detail is given in [13] and a lot in [99, 98].
For simple (finite or of constant dimension) spaces X there is a simple way
to construct a Markov Chain that is obviously both irreducible and satisfies
the detailed balance condition if a relative (i.e., unnormalized as the function
f above) density of the target distribution can be evaluated at any point, and
thus has the required stationary distribution. The main problem, which has not
yet a simple solution, is to design the chain in such a way that it has reasonably
low autocorrelation. Such a chain is said to mix rapidly.
Once state space and target distribution has been defined, the only design
choice is the proposal distribution q(x |x), a distribution for a proposed new state
x dependent on the current state x. We design m by giving a procedure both
to generate a new proposal x given current state x (using a random number
generator), and another to evaluate the density given x and x . Using the latter
procedure and the unnormalized target density f , which can also be computed
for any state, we can compute an acceptance probability for a proposed state
that depends on the current state:
f (z)q(x|z)
α(x, z) = min(1, ) (48)
f (x)q(z|x)

A trace (x0 , . . .) is now produce using the following algorithm:


Start with any x0 in the support of f
for i=1, ... do
generate z by the distribution q(z|xi )
with probability α(xi , z), set xi+1 = z,
otherwise set xi+1 = xi
There are several ways to prove that this gives a chain with stable distribu-
tion cf , despite the fact that we do not know the value of c. The only really
comprehensible argument can be found, e.g., in [13] and goes as follows:
First we need the useful consequence of (48) that

f (xt )q(xt+1 |xt )α(xt , xt+1 ) = f (xt+1 )q(xt |xt+1 )α(xt+1 , xt ). (49)
Equation (49) is proved by noting that exactly one of α(xt+1 , xt ) and α(xt , xt+1 )
is less than one (unless both are equal to one). In each case, a simple algebraic
expression without the minimum operator results from (48), that simplifies to
(49) in both cases (as well as in the case where both alphas are one).

101
We next show that the resulting chain is π-reversible, where π(x) = cf (x).
The distribution of xt+1 given xt is

p(xt+1 |xt ) = q(xt+1 |xt )α(xt , xt+1 ) +



δ(xt+1 − xt )( q(z|xt )(1 − α(xt , z))dz.

The first term above comes from the case where a value for xt+1 is proposed and
accepted, the second term (where δ is Dirac’s delta function) has contributions
from all possible proposal values z weighted by the probability that the value is
generated and also rejected. Multiplying by π(xt ) and using (49) gives

π(xt )p(xt+1 |xt ) = π(xt )q(xt+1 |xt )α(xt , xt+1 +



π(xt )δ(xt+1 − xt )(1 − q(z|xt )α(xt , z)dz) =
...
π(xt+1 )p(xt |xt+1 ),

the detailed balance equation which, together with aperiodicity and irreducibil-
ity means that every chain has π = cf as a limiting distribution.

Exercise 25 We believe that a sequence of real numbers, (yi )N was generated


2 i=1
from the mixture of two normal distributions, f (y) = i=1 λi N (µi , σi ) where
λ1 + λ2 = 1.
Design an MCMC procedure to make inference about the parameters of the
normal distributions and the source (1 or 2) of each yi . Check out what happens
when the sample is too small for reliable inference.

4.1.1 MCMC burn-in and mixing


The theorems stating that a MCMC chain converges to an invariant distribution
are only asymptotic. In practice some target distribution and proposal function
pairs give well behaved chains where it is easy to estimate quantities of interest
to a precision that can be reliably assessed. Such chains are said to have good
mixing behavior.
In other cases there are phenomena in the trace that make it difficult to say
that the chain has converged. There are quite many proposals in the literature
suggesting different approaches to MCMC convergence assessment.
Kolmogorov-Smirnov’s test is originally formulated as a way to test if a given
sample has a given distribution. For a real-valued distribution, the cumulative
distribution of the given distribution is compared with the cumulative empirical
distribution of the sample, and the maximum absolute difference Dn is taken
as the test statistic. Both functions compared have the range 0 to 1, and the
statistic is thus in the interval [0, 1]. This statistic has under the null hypothe-
sis that the sample is actually from the given distribution a distribution that is
independent of which √ the latter distribution is. Moreover, it is asymptotically
proportional to 1/ n for sample size √ n, and in a practical sample range it is
a good rule of thumb that a statistic nDn of 3 has a p-value of circa 0.022,
whereas 4 corresponds to p-value 0.001. If the value is above 4 and n is above

102
2, then the p-value is less than 0.01%. The same idea can be used to test if
two different samples can possibly be from the same distribution. Multiply-
ing
 the largest absolute difference between the two cumulative distributions by
min(n1 n2 ) gives a fairly sample-size independent statistic. No matter what
the sample sizes are, the hypothesis that the two samples are from the same
distribution can be rejected at the 99% level if the computed statistic is 2.2 or
more. The Kolmorov-Smirnov test is fairly strong. A major use today is for
checking the convergence status of an MCMC chain.
The Matlab code for computing the statistic is as follows, where the test
for a d-dimensional variable checks the marginal distribution along each dimen-
sion (checks for proper correlations can be obtained by preceding the test by a
random rotation):

function stat=KS(tr1,tr2);
% Kolmogorov-Smirnov test
% stat=KS(tr1,tr2);
% tr1 and tr2 are d by n matrices
% containing two (not necessarily equal-sized)
% samples from d-dimensional distribution
% Output is d-vector of (scaled) Kolmogorov-Smirnov statistics,
% for any size samples. In each test, a statistic of 2.4
% corresponds to p-value > 0.005. If significantly more than 1 test
% in 1000 is larger than 2.4, the samples are probably not from
% the same distribution. The p-values are:
% p stat
% 0.1 1.72
% 0.05 1.91
% 0.01 2.27
% 0.001 2.70
% 0.0001 3.20(?)
%
d=size(tr1,1);
n1=size(tr1,2); n2=size(tr2,2);
cums=[[tr1’ ones(n1,1)/n1];[tr2’ -ones(n2,1)/n2]];
stat=[];
ord=[ones(n1,1)/n1; -ones(n2,1)/n2];
tr=[tr1;tr2];
[tr,IX]=sort(tr);
[tr,IX]=sort(IX);
OM=ord(IX);
xx=sqrt(min(n1,n2))*cumsum(OM);
ix=find(xx(1:end-1,:)==xx(2:end,:));
xx(ix)=0;
stat=max(abs(xx));

There are several uses of the KS test in MCMC. One example is the break-
point test of section 3.1.5, where of course a rejection of the hypothesis of
uniform accident intensity can be followed by uniformity tests on parts of the
time axis or other non-linear distribution tests. An even more important usage

103
is for MCMC diagnostics, assessing the number of elements to reject at the be-
ginning of a chain (burn-in) and the number of trace points required to estimate
a quantity of interest to given precision.
An example of the use of the KS test in MCMC is shown in Figure 31. For
four different chains of 10000 sample points of dimension 68, blocks of 100 and
200 were compared for empirical distribution[106]. A KS-diagnostic above 3.5
is a test for difference between the distributions. starting numbers of blocks are
coordinates, and a red dot is a pair of blocks with length 100, black of length
200, with a low KS statistic (below 3.5). The chain ’long’ has apparently two
persistent modes indicated by white strips in the figure, the chain ’short’ mixes
slowly, and ’conv’ mixes well. The chain ’X’ is a control chain, generated with
random numbers and should have no autocorrelation (which it also seems not
to have). There are less drastic diagnostics for MCMC burn-in and convergence
oriented to the case where there is a particular quantity of interest that the
analysis wants to estimate, see e.g., [106, 47].

long short
100 100

80 80

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100

conv X
100 100

80 80

60 60

40 40

20 20

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Figure 31: Diagnostics using KS test on four chains, 10000 points, 67 dimen-
sional. Pairs of blocks of length 100 (red dot) and 2000 (black dot) are tested,
starting coordinates for blocks in pair is the coordinate of the dot. The dot is
shown where KS diagnostic is below 3.5. The first chain has two modes that
switch slowly, the second has not at all converged, the third mixes well (many
dots). The fourth is synthetic and has no autocorrelation, gives the appearance
of good mixing.

4.1.2 The particle filter


The particle filter is a computation scheme adapting MCMC to the recursive
and dynamic inference problems (5,6). It is described concisely in [13, Ch 6]
and a state-of-the-art survey is given in [36]. In the first case (5), we try to
estimate a static quantity and expect to get better estimates as more data flows
in. In the second case (6), we try to track a moving target. The method

104
maintains a sample that is updated for each time step and can be visualized as
a moving swarm of particles. The sample represents our information about the
posterior. The particle filter can handle highly non-Gaussian measurement and
process noise, and the likelihoods can be chosen in very general ways, the only
essential restriction being that relative likelihoods can be computed reasonably
efficiently, and that process states can be simulated stochastically from one time
step to the next. It is not necessary to have constant time steps, so jittered
observations can be easily handled provided the measurement times are known.
It is not certain that the method performs better than sophisticated non-linear
numerical methods, the attractivity of the methods stems more from ease of
implementation and flexibility.
In the steady state our inference of the state at time t is represented by
(t)
a set of particles (xi )N i=1 . Each particle is brought to time t + 1 by drawing
(t+1) (t+1) (t)
state ui using the process kernel f (ui |xi ). In practice, we produce
(t+1) (t)
many (say 10) ui from each xi , so that the resampling will produce a
(t+1)
better result. Now the data dt+1 are considered and used to weight the ui
(t+1) (t+1)
particles: wi = ft+1 (dt+1 |ui ). In order to get back to the situation at
time t + 1, we now use the weights to draw N indices (i1 , . . . , iN ) from the
probability distribution defined by the (normalized) weights. Finally, the set of
particles defining our belief of system state at time t + 1 is the set (xij )N j=1 .

4.2 Basic classification with categorical attributes


For data matrices with categorical attributes, the categorical model of Cheese-
man and Stutz is appropriate. We assume that rows are generated as a finite
mixture with a uniform Dirichlet prior for the component (class) probabilities,
and each mixture component has its rows generated independently, according
to a discrete distribution also with a uniform Dirichlet prior for each column
and class. Assume that the number of classes C is given, the number of rows is
n and the number of columns is K, and that there are dk different values that
can occur in column k. For a given classification, the data probability can be
(c,k)
computed; let ni be the number of occurrences of value i in column k of rows
(c,k)
belonging to class c. Let xi be the probability of class c having the value i in
(c)
column k. Let ni be the number of occurrences in class c of the row i, and n(c)
the number of rows of class c. By (19) the probability of the class assignment
depends only on the number of classes and the table size, Γ(n+1)Γ(C)/Γ(n+C).
The probability of the data in class c is, if i = (i1 , . . . , ik ):
  K  
 dk
(c,k) (c) (c,k) (c,k)
(xik )ni dx = (xi )ni dx(c,k)
ki k=1 i=1

The right side integral can be evaluated using the normalization constant of
the Dirichlet distribution, giving the total data probability of a classification:
Γ(n + 1)Γ(C)   (c,k)
f rac Γ(ni + 1)Γ(n(c) + dk )
Γ(n + C) i
c,k

The posterior class assignment distribution is obtained normalizing over all


class assignments. This distribution is intractable, but can be approximated

105
by searching for a number of local maxima and estimating the weight of the
neighborhood of each maximum.
Here the EM algorithm is competitive to MCMC-calculations in many
cases, because of the difficulty of tuning the proposal distribution of MCMC-
computations to avoid getting stuck in local minima. The procedure is to ran-
(c,k)
domly assign a few cases to classes, estimating parameters xi , assign remain-
ing cases to optimum classes, recomputing the distribution of each case over
classes, reclassifying each case to optimum class, and repeating until conver-
gence. Repeating this procedure one typically finds after a while a single most
probable class assignment for each number of classes. The set of local optima
so obtained can be used to guide a MCMC simulation giving more precise es-
timates of the probabilities of the classifications possible. But in practice, a
set of high-probability classifications is normally a starting point for application
specialists trying to give application meaning to the classes obtained.

5 Two cases
5.1 Case 1: Individualized media distribution - signifi-
cance of profiles
The application concerns individualized presentation of news items(and is cur-
rently confidential in its details). The data used are historical records of individ-
ual subscribers to an electronic news service. The purpose of the investigation
is to design a presentation strategy where an individual is first treated as the
’average customer’, then as his record increases he can be included in one of a
set of ’customer types’, and finally he can also get a profile of his own. Several
methodological problems arise when the data base is examined: customers are
identified by an electronic address, and this address can sometimes be used by
several persons, either in a household or among the customers of a store or
library where the terminal is located. Similarly, the same customer might be
using the system from several sites, and unless he has different roles on different
sites, he might be surprised when finding has his profile is different depending
on where he is. After a proposed switch to log-in procedures for users, these
problems may disappear, but on the other hand the service may also become less
attractive. An individual can also be in different modes with different interests,
and we have no channel for a customer to signal this, except by seeing which
links he follows. On the other hand, recording the access patterns from different
customers is very easily automated and the total amount of data is overwhelm-
ing although many customers have few accesses recorded. The categorization of
the items to be offered is also by necessity coarse. Only two basic mechanisms
are available for evaluating an individuals interest in an item: To which degree
has he been interested in similar items before, and to which degree have similar
individuals been interested in this item? This suggest two applications of the
material of this chapter: Segmentation or classification of customers into types,
and evaluating a customer record against a number of types, to find out whether
or not they can confidently be said to differ. We will address these problems
here, and we will leave out many practical issues in the implementation.
The data base consists of a set of news items with coarse classification (into
news, sport, culture, entertainment, economy and with a coarse ’hotness in-

106
dicator’ and a location indicator of local, national or international, all in the
perspective of Stockholm, Sweden); a set of customers, each with a type, a
’profile’ and a list of rejected and accepted items; and a set of customer types,
each with a list of members. Initially, we have no types or profiles, but only
classified news items and the different individuals access records. The produc-
tion of customer types is a fairly manual procedure: even if many automatic
classification programs can make a reasonable initial guess, it is inevitable that
the type list will be scrutinized and modified by media professionals – the types
are of course also used for direct advertising. The AUTOCLASS model has the
attractive property that it produces classes where the attributes are indepen-
dent: we will not get a class where a customer of the class is either interested
in international sports events or local culture items, but not both. In the end,
there is no unique objective criterion that the type list is ’right’, but if we have
two proposed type lists we can distinguish them by estimating the fraction of
accepted offers, and saying that the one giving a higher estimate is ’better’. In
other words, a type set giving either high or low acceptance probability to each
item is good. But this can be easily accomplished with many types, leading to
impreciseness in assigning customers to types – a case of over-fitting.
The assignment of new individuals to types cannot be done manually because
of the large volumes. Our task is thus now to say, for a new individual with
a given access list, to which type he belongs. The input to this problem is a
set of tables, containing for each type as well as for the new individual, the
number of rejected and accepted offers of items from each class. The modeling
assumptions required are that for each news category, there is a probability of
accepting the item for the new individual or for an average member of a type.
Our question is now if these data support the conclusion that the individual
has the same probability table as one of the types, or if we can say that he is
different from every type (and thus should get a profile of his own).
We can formulate the model choice problem by a transformation of the access
tables to a dependency problem for data tables that we have already treated
in depth. For a type t with ai accepts and ri rejects for a news category i, we
imagine a table with three columns and (ai + ri ) rows: a 1 in column 1 to
indicate an access of a type, the category number i in column 2 of (ai + ri ) rows,
ai of which contain 1 (for accept) and ri a 0 (for reject) in column 3. We add a
similar set of rows for the access list of the individual, marked with 0 in column
1. If the probability of a 0 (or 1) in column 3 depends on the category (column
2) but not on column 1, then the user cannot be distinguished from the type.
But columns 1 and 2 may be dependent if the user has seen a different mixture
of news categories compared to the type. In graphical modeling terms we could
use the model choice algorithm. The probability of the customer belonging to
type t is thus equal to the probability of model M 4 against M 3, where variable
C in figure 25 corresponds to the category variable (column 2).
Although this is an elegant way to put our problem in a graphical model
choice perspective, it is a little over-ambitious and a simpler method may be
adequate: the types can in practice be considered as sets of reasonably precise
accept probabilities, one for each news category, and an extension of the analysis
of the coin tossing experiment is adequate. The customers can be described as an
outcome of a sampling, where we record the number of accepted and rejected
items. If type i accepts an offer for category j with probability pij , and a
customer has accepted aj out of nj offers for category j, then the probability

107
a
of this is, if the customer belongs to type i, pijj (1 − pij )(nj −aj ) . For the general
model Hu of an unbalanced probability distribution (the accept probability
is unknown but constant, and is uniformly distributed between 0 and 1), the
probability of the outcome is aj !(nj − aj )!/(nj + 1)!. Taking the product over
category, we can now find the data probabilities {p(D|Hj )} and p(D|Hu ), and
a probability distribution over the types and the unknown type by evaluating
formula (2). In a prototype implementation we have the following customer
types described by their accept probabilities:

category Typ1 Typ2 Typ3 Typ4


news-int 0.9 0.06 0.82 0.23
news-loc 0.88 0.81 0.34 0.11
sports-int 0.16 0 0.28 0.23
sports-loc 0.09 0.06 0.17 0.21
cult-int 0.67 0.24 0.47 0.27
cult-loc 0.26 0.7 0.12 0.26
tourism-int 0.08 0.2 0.11 0.11
tourism-loc 0.08 0.14 0.2 0.13
entert 0.2 0.25 0.74 0.28

Three new customers have arrived, with the following access records of pre-
sented(accepted) offers:

category Ind1 Ind2 Ind3


news-int 3(3) 32(25) 17(8)
news-loc 1(1) 18(9) 25(14)
sports-int 1(1) 7(2) 7(3)
sports-loc 0(0) 5(5) 6(1)
cult-int 2(2) 11(4) 14(6)
cult-loc 1(1) 6(2) 10(3)
tourism-int 0(0) 4(4) 8(8)
tourism-loc 1(1) 5(1) 8(3)
entert 1(1) 17(13) 15(6)
sum 10 105 110

We compute the data probabilities under the hypotheses that each new cus-
tomer belongs to either type 1 to 4, or to Hu . Then we compute the probability
distribution of the individuals types, under the uniform prior assumption:

Typ1 Typ2 Typ3 Typ4 Hu


Ind1 0.0639 0.0000 0.0686 0.0001 0.8675
Ind2 0.0000 0.0000 0.0002 0.0000 0.9980
Ind3 0.0000 0.0000 0.0000 0.0000 1.0000

Dropping the Hu possibility, we have:

108
Typ1 Typ2 Typ3 Typ4
Ind1 0.4818 0.0000 0.5176 0.0005
Ind2 0.0000 0.0000 1.0000 0.0000
Ind3 0.0000 0.0000 0.9992 0.0008

Clearly, quite few observations are adequate for putting a new customer
confidently into a type. The profile of the third individual is significantly not
in one of the four types(p(Hu ) = 1). For the first individial, the access record
is too short to confidently put him in a class, but the first and third classes fit
reasonably.
Actually, the model for putting individuals into types may be inappropriate
when considering what the application models: A type should really be consid-
ered a set of individuals with a clustered span of profiles, whereas the model we
used considers a type to be the average accept profile over the individuals in the
type. This is a too precise concept. We can make the types broader in two ways:
either by spanning them with a set of individuals and computing the probability
that the new individual has the same profile as one of the individuals defining
the type, or we can define a type as a probability distribution over accept pro-
files. The second option is quite attractive, since the type accept probabilities
can be defined with a sample that defines a set of Beta distributions of accept
probabilities. In graphical modeling terms, we test the adequacy of a type for
an individual by putting their access records into a three column table with
values individual/type, category of item and accept indicator. The adequacy of
the type for the individual is measured by the conditional independence of type
and accept given category, evaluated by formula (37), with news category corre-
sponding to variable C. The computation is of course done on the contingency
table without actually expanding the data table. The results in our example, if
the types are defined by a sample of 100 items of each category, is:

Typ1 Typ2 Typ3 Typ4


Ind1 0.2500 0.2500 0.2500 0.2500
Ind2 0.0000 0.0000 1.0000 0.0000
Ind3 0.0000 0.0001 0.9854 0.0145

It is now even more clear than before that the access record for individual 1 is
inadequate, and that the third individual is not quite compatible with any type.
It should be noted that we have throughout this example worked with uniform
priors. These priors have no canonic justification but should be regarded as
conventional. If specific information justifying other priors is available they can
easily be used, but this is seldom the case. The choice of prior will affect the
assignment of individual to type in rare cases, but only when the access records
are very short and when the individual does not really fit to any type.

5.2 Case 2: Schizophrenia research–hunting for causal re-


lationships in complex systems
This application is directed at understanding a complex system - the human
brain. Similar methods have been applied to understanding of complex engi-
neered systems like paper mills, and micro-economic systems. Many investi-

109
gations on normal subjects have brought immense new knowledge about the
normal function of the human brain, but mental disorders still escape under-
standing of their causes and cures (despite a large number of theories, it is not
known why mental disorders develop except in special cases, and it is not known
which physiological and/or psychological processes cause them). In order to get
handles on the complex relationships between psychology, psychiatry, and phys-
iology of the human brain, a data base is being built with many different types of
variables measured for a large population of schizophrenia patients and control
subjects. For each volunteering subject, a large number of variables are obtained
or measured, like age, gender, age of admission to psychiatric care; volumes of
gray and white matter as well as cerebrospinal fluid in several regions of the
brain (obtained from MR images), genetic characterization and measurements
of concentrations in the blood of large numbers of substances and metabolites.
For the affected subjects, a detailed clinical characterization is recorded.
In this application one can easily get lost. There is an enormous amount of
knowledge on the relationships and possible significances of these quite many (ca
150) variables in the medical profession. At the same time, the data collection
process is costly, so the total number of subjects is very small compared, for
example, with national registers that have millions of persons but relatively few
variables for each of them. This means that statistical significance problems
become important. A test set of 144 subjects, 83 controls and 61 affected
by schizophrenia, was obtained. This set was investigated with most methods
described in this course, giving an understanding of the strongest relationships
(graphical model), possible classifications into different types of the disease,
etc.. In order to find possible causal chains, we tried to find variables and
variable pairs with a significant difference in co-variation with the disease, i.e.,
variables and tuples of variables whose joint distribution is significantly different
for affected person relative to control subjects. This exercise exhibited a very
large number of such variables and tuples, many of which were known before,
others not. All these associations point to probable mechanisms involved in
the disease, which seems to permeate every part of the organism. But it is
not possible to see which is the effect and what is the cause, and many of
the effects can be related to the strong medication given to all schizophrenia
patients. In order to single out the more promising effects, a graphical model
approach was tried: Each variable was standardized around its mean, separately
for affected and controls. Then the pairs of variables were detected giving the
highest probability to the leftmost graph in figure 32. Here D stands for the
diagnosis (classification of the subject as affected or control), and A and B
are the two variables compared. In the middle graph, the relationship can be
described as affecting the two variables independently, whereas in the left graph
the relationship can be described as the disease affecting one of them but with
a relationship to the other that is the same for affected and for controls. Most
of the pairs selecting the first graph involved some parts of the vermis (a part
of cerebellum). Particularly important was the pair subject age and posterior
superior vermis volume. As shown in figure 33, this part of the brain decreases in
size with age for normal subjects. But for the affected persons the size is smaller
and does not change with age. Neither does the size depend significantly on
the duration of the disease or the medication received. Although these findings
could be explained by confounding effects, the more likely explanation presently
is that the reduction occurred before outbreak of the disease and that processes

110
leading to the disease involve disturbing the development of this part of the
brain.
Several other variables were linked to the vermis in the same way: there
was an association for control subjects but not for the affected ones, indicating
that the normal co-variation is broken by the mechanisms of the disease. For
variables that were similarly co-varying with the vermis for controls and affected,
there is a possibility that these regions are affected by the disease similarly as
the vermis.
In order to get an idea of which physiological and anatomical variables
that give best predictors of the disease, a decision tree was produced using
the Bayesian approach of section 3.4.1, choosing the most significant variable
and decision boundary in a top-down fashion. This tree is shown in figure 34.
Note that it does not accurately classify all cases, but leaves 6 of the 144 cases
misclassified. A bigger decision tree would classify all cases correctly but also
over-fit to the training set, giving probably less accuracy on new cases.

A A A

D D D

B
B B

Figure 32: Graphical models detecting co-variation

6 Conclusions
I hope that you now agree with me that applied statistics in computer science is
a cool topic, and that you are familiar enough with it to start applying it. Large
numbers of computer systems with uncertainty management will be realized the
coming decades. These systems differ a lot in their requirements: in many cases
standard application of known methods are adequate, but for many, significant
development of method and theory will be required.

111
0.35

0.3

0.25
PSV

0.2

0.15

0.1

0.05
20 25 30 35 40 45 50 55 60
Age

Figure 33: Association between age and posterior inferior vermis volume de-
pends on diagnosis. The principal directions of variation for controls(o) and
affected subjects (+) are shown.

PSV

TemCSF

VenWhite
C SubWhite

26(1)
PIV
C
A C
CH 11
41(1) 6(2)
C C
A
4
5(1) 16(1)
Figure 34: Robust decision tree. Numbers on terminals indicate number of cases
in the training set, misclassified subjects in parentheses.

112
Index
p-value, 57 exchangeability, 16, 17

EM algorithm, 106 fair betting odds, 38


soft constraints, 54 false discovery rate control, 34
family-wise error control, 34
acceptance probability, 101 FDR, 34
Adaline, 49 feature space, 48
anomaly, 58 focal elements, 41
artificial intelligence, 3 frequentist statistics, 31
atoms, 41 fuzzy logic, 47, 87
fuzzy observation, 15
Bayes factor, 14 FWE, 34
Bayesian analysis, 12
Bonferroni correction, 34 generalized function, 5
Gibbs Random Fields, 94
capacity, 41 good mixing behavior, 102
causality, 79 GRF, 94
classification problem, 48
classifier error, 48 Hammersley-Clifford theorem, 94
combination rules, 43 hypotheses, 7
composite model, 16
conditional independence, 79 iid, 17
confidence, 58 importance sampling, 100
confidence interval, 33 imprecise probabilities, 41
confidence predictors, 57 improper priors, 63
confounder, 80 inference, 6
contingency table, 71 information fusion, 3
convex closure, 24 inverse probability, 7
cost matrix, 22 Inverse Wishart distribution, 92
Cournot’s principle, 31
credal belief, 42 Knowledge discovery, 3
credibility, 57 Kolmogorov-Smirnov, 102
credible interval, 15
credible set, 15 Laplace estimator, 22
Laplace’s parallel combination, 21
data mining, 3 likelihood, 12
Dempster’s combination rule, 43 loss function, 22
Dempster- Shafer, 41
detailed balance condition, 101 machine learning, 3
Dirichlet distribution, 59 MAP, 22
discrete distribution, 59 marginal dependency, 79
DS-structure, 41 marginal distribution, 16
dynamic inference, 17 Markov Random Fields(MRF), 76
mass assignment, 41
empirical error, 48 mathematical probability, 6
ergodic, 101 Maximum A Posteriori, 22
estimate, 22 mixing coefficients, 64
evidence theory, 41 mixture, 64

113
mixture of normals, 66 simple model, 16
model-closed perspective, 24 simplicial vertex, 86
model-completed perspective, 24 smoothed, 58
model-open perspective, 24 state space, 12
models, 6, 16 subjective probability, 14
MRF, 76 sufficient statistics, 62
symmetric proposal, 28
non-conformance measure, 57
non-parametric, 64 test statistic, 33
nuisance parameter, 16 transition kernel, 17
nuisance parameters, 63
null hypothesis, 33 uncertainty management, 3, 47
utility, 22
objective probability, 31
observation space, 12 variational Bayes method, 28
ordinal, 71
weighted sample, 100
p-value, 33 weights, 49
parameter space, 15 wide margin classifier, 49
parameterized model, 15, 16
particle filter, 104
pattern recognition, 3
pdf, 7
perceptron, 49
permissible edge, 86
pignistic belief, 42
pignistic transformation, 42, 43
pignistic transformations, 43
possible world space, 15
posterior, 13
posterior odds, 14
precision, 62
precision matrix, 73
primal optimization problem, 51
prior, 13
prior odds, 14
probability space, 7
proposal distribution, 101
publication bias, 25

recursive inference, 16
regression problem, 48
rejection sampling, 100
relative plausibility transformation, 43
resampling, 100
residual vector, 97
retrodiction, 17
reversible, 101
robust Bayesian analysis, 47
rough set theory, 47

114
References
[1] Pablo Arambel, Matthew Antone, Constantino Rago, Herbert Landau,
and Thomas Strat. A multiple-hypothesis tracking of multiple ground tar-
gets from aerial video with dynamic sensor control. In Per Svensson and
Johan Schubert, editors, Proceedings of the Seventh International Confer-
ence on Information Fusion, volume II, pages 1080–1087, Mountain View,
CA, Jun 2004. International Society of Information Fusion.
[2] S. Arnborg. In J. Wang, editor, Data Mining: Opportunities and Chal-
lenges, chapter 1: A Survey of Bayesian Data Mining. Idea Group Pub-
lishing, 2003.
[3] S. Arnborg. Robust Bayesian fusion: Imprecise and paradoxical reasoning.
In FUSION 2004, pages 407–414, 2004.
[4] S. Arnborg and G. Sjödin. Bayes rules in finite models. In Proc. European
Conference on Artificial Intelligence, pages 571–575, Berlin, 2000.
[5] S. Arnborg and G. Sjödin. On the foundations of Bayesianism. In Ali
Mohammad-Djarafi, editor, Bayesian Inference and Maximum Entropy
Methods in Science and Engineering, 20th International Workshop, Gif-
sur-Yvette, 2000, pages 61–71. American Institute of Physics, 2001.
[6] Stefan Arnborg. Robust Bayesianism: Relation to evidence theory. ISIF
Journal of Advances in Information Fusion, 1(1):75–90, 2006.
[7] M.J. Beal. Variational Algorithms for Approximate Bayesian Inference.
Phd thesis, Gatsby Computational Neuroscience Unit, University College
London, 2003.
[8] S. Benferhat, D. Dubois, and H. Prade. Nonmonotonic reasoning, condi-
tional objects and possibility theory. Artificial Intelligence, 92:259–276,
1997.
[9] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A
practical and powerful approach to multiple testing. J. of the Royal sta-
tistical Society B, 57:289–300, 1995.
[10] Y. Benjamini and D. Yeuketeli. The control of the FDR multiple testing
under dependency. Annals of Statistics, 29(4):1165–1188, 2001.
[11] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer-
Verlag, 1985.
[12] J. O. Berger. An overview of robust Bayesian analysis (with discussion).
Test, 3:5–124, 1994.
[13] N. Bergman. Recursive Bayesian Estimation. PhD thesis, Linköping Uni-
versity, 1999.
[14] G. Berkeley. The Analyst. In James Newman, editor, The World of
Mathematics, New York, 1956. Simon and Schuster.

115
[15] N. Berkman and T. Sandholm. What should be optimized in a decision
tree? Technical report, University of Massachusetts at Amherst, 1995.
[16] J.M. Bernardo and A.F. Smith. Bayesian Theory. Wiley, 1994.
[17] S. Canu, Y. Grandvalet, V. Guigue, and A. Rakotomamonjy. Svm and ker-
nel methods matlab toolbox. Perception Systemes et Information, INSA
de Rouen, Rouen, France, 2005.
[18] Barbara Caputo. A new kernel method for object recognition:spin glass-
Markov random fields. PhD thesis, KTH, Sweden, October 11 2004.
[19] B. P. Carlin and T. A. Louis. Bayes and Empirical Bayes Methods for
Data Analysis. Chapman & Hall, 1997.
[20] G. Casella and C. P. Robert. Rao-blackwellization of sampling schemes.
Technical report, Cornell University, 1995.
[21] P. Cheeseman and J. Stutz. Bayesian classification (AUTOCLASS): The-
ory and results. In U. M. Fayyad, G. Piatetsky-Shapiro, P Smyth, and
R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Min-
ing. 1995.
[22] H. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART. Tech-
nical report, University of Chicago, 1995.
[23] V. Chvátal. Linear Programming. Freeman and Company, 1983.
[24] J. Corrander and M. Sillanpää. A unified approach to joint modeling
of multiple quantitative and qualitative traits in gene mapping. J.theor.
Biol., 218:435–446, 2002.
[25] R. Courant and D. Hilbert. Methods of Mathematical Physics V.I. Inter-
science Publishers, New York, 1953.
[26] D. R. Cox and Nanny Wermuth. Multivariate Dependencies. Chapman
and Hall, 1996.
[27] R.T. Cox. Probability, frequency, and reasonable expectation. Am. Jour.
Phys., 14:1–13, 1946.
[28] N. Cristianini and J. Shawe-Taylor, editors. Support Vector Machines and
other kernel based methods. Cambridge University Press, 2000.
[29] A. Dale. A history of Inverse Probability: from Thomas Bayes to Karl
Pearson. Springer, Berlin, 1991.
[30] B. de Finetti. Theory of Probability. London:Wiley, 1974.
[31] P. de Laplace. On probability. In James Newman, editor, The World of
Mathematics, New York, 1956. Simon and Schuster.
[32] A.P. Dempster. Upper and lower probabilities induced by a multi-valued
mapping. Annals of Mathematical Statistics, 38:325–339, 1967.

116
[33] David G. T. Denison, Christopher C. Holmes, Bani K. Mallick, and Adrian
F. M. Smith. Bayesian Methods for Nonlinear Classification and Regres-
sion. Wiley, 2002.
[34] D. K. Dey, L. Kuo, and S. K. Sahu. A Bayesian predictive approach to de-
termining the number of components in a mixture distribution. Technical
report, University of Connecticut, 1993.
[35] J. Diebolt and C. Robert. Estimation of finite mixture distributions
through bayesian sampling. Journal of the Royal Statistical Society, Series
B, 56:589–590, 1994.
[36] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods
in Practice. Springer, 2001.
[37] M. D. Escobar and M. West. Bayesian density estimation and inference
using mixtures. Journal of the American Statistical Association, 90:577–
588, 1995.
[38] W. Feller. An introduction to probability theory and its applications, vol-
ume 1. Wiley, New York, 1956.
[39] S. Fienberg. When did Bayesian inference become ”Bayesian”? Bayesian
analysis, 1(1):1–40, 2005.

[40] R.A. Fisher. Inverse probability. Proceedings of the Cambridge Philosoph-


ical Society, 26:528–535, 1930.
[41] R.A. Fisher. The fiducial argument in statistical inference. Annals of
Eugenics, 6:391–398, 1935.
[42] D. Fixsen and R.P.S. Mahler. The modified Dempster-Shafer approach to
classification. IEEE Trans. SMC-A, 27(1):96–104, January 1997.
[43] A. Gammerman and V. Vovk. Hedging predictions in machine learning.
The Computer Journal, 50(2):164–172, 2007.
[44] B. Gärtner. Fast and robust smallest enclosing balls. In ESA, volume
1643 of LNCS, pages 325–338, Berlin, 1999. Springer.
[45] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data
Analysis (Second Edition). Chapman & Hall, New York, 2003.
[46] N. Gershenfeld and A. Weigend. The future of time series: Learning and
understanding. In A. Weigend and N. Gershenfeld, editors, Time Series
Prediction: Forecasting the Future and Understanding the Past, pages 1–
66. Addison-Wesley, 1993.
[47] S.G. Giakoumatos, I.D. Vrontos, P. Dellaportas, and D.N. Politis. A
Markov Chain Monte Carlo convergence diagnostic using subsampling.
Journal of Computational and Graphical Statistics, 8:431–451, 1999.
[48] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte
Carlo in Practice. Chapman and Hall, 1996.

117
[49] C. Glymour and G. Cooper, editors. Computation, Causation and Dis-
covery. MIT Press, 1999.
[50] I.R. Goodman, R. Mahler, and H.T. Nguyen. The Mathematics of Data
Fusion. Kluwer Academic Publishers, 1997.
[51] P. J. Green. Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination. Biometrika, 82:711–732, 1995.
[52] P. J. Green. Markov Chain Monte Carlo in image analysis. In W. R. Gilks,
S. Richardson, and D. J. Spiegelhalter, editors, Markov Chain Monte
Carlo in Practice. Chapman and Hall, 1996.
[53] P.D. Grünwald and A.P. Dawid. Game theory, maximum entropy, mini-
mum discrepancy, and robust Bayesian decision theory. Annals of Statis-
tics, 32(4), 2004.
[54] H. Haario, M. Laine, M. Lehtinen, E. Saksman, and J. Tamminen. Markov
Chain Monte Carlo methods for high dimensional inversion in remote sens-
ing. J. of the Royal statistical Society B, 66:591–607, 2004.
[55] David J. Hand. Rejoinder: Classifier technology and the illusion of
progress. Statistical Science, 21:30, 2006.
[56] David Heckerman. Bayesian networks for data mining. Data Mining and
Knowledge Discovery, 1:79–119, 1997.
[57] Y. Hochberg. A sharper Bonferroni procedure for multiple tests of signif-
icance. Biometrika, 75:800–803, 1988.
[58] A. Holst. The Use of a Bayesian Neural Network Model for Classification
Tasks. Norstedts tryckeri AB, 1997. Ph D Thesis, Stockholm University.
[59] E. Hüllermeier. Toward a probabilistic formalization of case-based in-
ference. In D. Thomas, editor, Proceedings of the 16th International
Joint Conference on Artificial Intelligence (IJCAI-99-Vol1), pages 248–
253, S.F., July 31–August 6 1999. Morgan Kaufmann Publishers.
[60] E. T. Jaynes. Probability Theory: The Logic of Science. Preprint: Wash-
ington University, 1996.
http://bayes.wustl.edu/etj/prob.html.
[61] E.T. Jaynes. Probability Theory: The Logic of Science. Cambridge Uni-
versity Press, 2003.
[62] F. B. Jensen. Bayesian Networks and Decision Graphs. Springer, 2001.
[63] H. W. Kuhn and A. Tucker. Non-linear programming. In H. W. Kuhn
and A. W. Tucker, editors, Proceedings of 2nd Berkeley Symposium on
Mathematical Statistics and Probabilities, pages 481–492. University of
California Press, 1951.
[64] Steffen L. Lauritzen. Graphical Models. Clarendon Press, 1996.

118
[65] A. Ledberg. Measuring Brain Functions: Statistical Tests for Neuroimag-
ing Data. Phd thesis, Karolinska Institute, Stockholm, Sweden, September
2001.
[66] D. Madigan and A. E. Raftery. Model selection and accounting fro model
uncertainty in graphical models using occam’s window. Technical report,
University of Washington, 1993.
[67] D. Madigan and A. E. Raftery. Model selection and accounting for model
uncertainty in graphical models using occams window. J. American Sta-
tistical Ass., 428:1535–1546, 1994.
[68] Heikki Mannila, Hannu Toivonen, Atte Korhola, and Heikki Olander.
Learning, mining, or modeling? — a case study from paleoecology.
[69] V. Megalooikonomou, J. Ford, Li Shen, F. Makedon, and A. Saykin. Data
mining in brain imaging. Statistical Methods in Medical Research, 9:359–
394, 2000.
[70] G. Mendel. Experiments in plant-hybridization. Verh. naturf. Ver. in
Brunn, 4, 1865.
[71] M. Minsky and S. Papert. Perceptrons. MIT Press, Cambridge, Mass,
1969.

[72] T.M. Mitchell. Machine Learning. McGraw-Hill, 1997.


[73] R M. Neal. Probabilistic inference using markov chain monte carlo meth-
ods. Technical report, Department of Computer Science, University of
Toronto, 1993. CRG-TR-93-1.
[74] M.-S. Oh and A. Raftery. Model-based clustering with dissimilarities: A
Bayesian approach. Technical Report 441, Dept of Statistics, University
of Washington, 2003.
[75] G. Paass and J. Kindermann. Bayesian classification trees with overlap-
ping leaves applied to credit-scoring. Lecture Notes in Computer Science,
1394:234–245, 1998.
[76] Z. Pawlak. Rough Sets. Kluwer, Dordrecht, 1992.

[77] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge


University Press, Cambridge, England, 2000.
[78] J.G. Propp and D.B. Wilson. How to get a perfectly random sample from
a generic Markov chain and generate a random spanning tree of a directed
graph. Journal of Algorithms, 27(2):170–217, May 1998.
[79] Lennart Råde and Bertil Westergren. Beta, Mathematical Handbook.
Chartwell-Bratt, Old Orchard, Bickley Road, Bromley, Kent BR1 2NE,
England, second edition, 1990.
[80] A. E. Raftery and V.E. Akman. Bayesian analysis of a Poisson process
with a change-point. Biometrika, 73:85–89, 1986.

119
[81] M. Ramoni and P. Sebastiani. Parameter estimation in Bayesian networks
from incomplete databases. Intelligent Data Analysis, 2, 1998.
[82] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an
unknown number of components (with discussion). Journal of the Royal
Statistical Society, Series B, 59:731–792, 1997.
[83] B. Ripley. Pattern Recognition and Neural Networks. Cambridge Univer-
sity Press, 1996.
[84] D. J. Rose. Triangulated graphs and the elimination process. J. Math.
Anal. Appl., 32:597–609, 1970.
[85] D. J. Rose, R. E. Tarjan, and G. S. Lueker. Algorithmic aspects of vertex
elimination on graphs. SIAM J. Comput., 5:266–283, 1976.
[86] F. Rosenblatt. The perceptron: a probabilistic model for information
storage and organization in the brain. Psychological Review, 65:386–408,
1958.
[87] R. Royall. On the probability of observing misleading statistical evidence
(with discussion). J. American Statistical Ass., 95:760–780, 2000.
[88] C. Saunders, M. O. Stitson, J. Weston, L. Bottou, B. Schoelkopf, and
A. Smola. Support vector machine - reference manual. Technical Re-
port CSD-TR-98-03, Department of Computer Science, Royal Holloway,
University of London, Egham, UK, 1998.
[89] L.J. Savage. Foundations of Statistics. John Wiley & Sons, New York,
1954.
[90] Valeriu Savcenko. Bayesian methods for mixture modelling. Master’s
thesis, Royal Institute of Technology, 2004.
[91] G. Shafer. A mathematical theory of evidence. Princeton University Press,
1976.
[92] G Shafer, A. Gammerman, and V. Vovk. Algorithmic Learning in a Ran-
dom World. Springer, 2005.
[93] G Shafer and V. Vovk. Probability Theory – it’s only a Game. MIT Press,
2001.
[94] H. Sidenbladh. Probabilistic tracking and reconstruction of 3D hu-
man motion in monocular video sequences. Doctoral Dissertation, ISRN
KTH/NA/P–01/14–SE, Numerical Analysis and Computer Science, Oc-
tober 2001.
[95] D. S. Sivia. Bayesian Data Analysis, A Bayesian Tutorial. Clarendon
Press: Oxford, 1996.
[96] P. Smets and R Kennes. The transferable belief model. Artificial Intelli-
gence, 66:191–234, 1994.

120
[97] F. Takens. Detecting strange attractors in turbulence. In D. A. Rand and
L. S. Young, editors, Dynamical systems and turbulence, pages 366–381.
Spinger-Verlag, New York, 1980.
[98] L. Tierney. Markov chains for exploring posterior distributions. Annals
of Statistics, 22:1701–1762, 1994.
[99] L. Tierney. Theoretical perspectives. In W. R. Gilks, S. Richardson,
and D. J. Spiegelhalter, editors, Markov Chain Monte Carlo in Practice.
Chapman and Hall, 1996.
[100] Michael E. Tipping. Sparse bayesian learning and the relevance vector
machine. Journal of Machine Learning Research, 1:211–244, 2001.
[101] Leslie G. Valiant. A theory of the learnable. Communications of the ACM,
27(11):1134–1142, November 1984.
[102] H. Valpola and J. Karhunen. An unsupervised ensemble learning
method for nonlinear dynamic state-space models. Neural Computation,
14(11):2647–2692, 2002.
[103] V. N. Vapnik. Estimation of Dependences Based on Empirical Data.
Springer-Verlag, New York, 1982.
[104] F. Voorbraak. Partial probability: Theory and applications. In Gert
de Cooman, Fabio G. Cozman, Serafin Moral, and Peter Walley, editors,
Proceedings of the First International Symposium on Imprecise Probabili-
ties and their Applications. Gent University, 1999.
[105] G.I. Webb. Further experimental evidence against the utility of occams
razor. Journal of AI research, 4:397–417, 1996.

[106] Helga Westerlind. Diagnosing mcmc burn-in and mixing. Master’s thesis,
Royal Institute of Technology, 2008.
[107] B. Widrow and M. E. Hoff. Adaptive switching circuits. In Proceedings
WESCON, pages 96–104, 1960.
[108] N. Wilson. Extended probability. In Proceedings of the 12th European
Conference on Artificial Intelligence, pages 667–671. John Wiley and Sons,
1996.
[109] M. W. Woolrich, C. F. Beckmann, M. Jenkinson, and S. M. Smith. Mul-
tilevel linear modelling for fMRI group analysis using Bayesian inference.
Neuroimage, 21(4):1732–1747, 2004.
[110] L. A. Zadeh. The roles of fuzzy logic and soft computing in the conception,
design and deployment of intelligent systems. Lecture Notes in Computer
Science, 1198:183–210, 1997.

121

View publication stats

You might also like