Bayesian Methods in Structural Bioinformatics
Bayesian Methods in Structural Bioinformatics
Series Editors:
M. Gail
K. Krickeberg
J. Samet
A. Tsiatis
W. Wong
Bayesian Methods
in Structural Bioinformatics
123
Editors
Thomas Hamelryck Jesper Ferkinghoff-Borg
University of Copenhagen Technical University of Denmark
Department of Biology Department of Electrical Engineering
Bioinformatics Centre Lyngby
Copenhagen Denmark
Denmark
Kanti Mardia
University of Leeds
School of Mathematics
Department of Statistics
Leeds
United Kingdom
ISSN 1431-8776
ISBN 978-3-642-27224-0 e-ISBN 978-3-642-27225-7
DOI 10.1007/978-3-642-27225-7
Springer Heidelberg Dordrecht London New York
vii
viii Foreword
This is an extremely ambitious goal, as this area comprises no less than the Protein
Folding Problem, arguably the most fundamental riddle in the whole of the Life
Sciences.
The book begins with an up-to-date coverage, in Part I, of the concepts of
Bayesian statistics and of the versatile computational tools that have been developed
to make their use possible in practice. A particular highlight of the book is found
in Part II, which presents a review of the time-honored use of knowledge-based
potentials obtained by data mining, along with a critical reassessment of their exact
relationship with potentials of mean force in statistical physics. Its conclusion is
a highly original and welcome contribution to the solution of this long-standing
problem. Another highlight is the combined use in Part V of the type of directional
statistics that can be compiled by the methods of Part IV with the technology of
Bayesian networks to produce very efficient methods for sampling the space of
plausible conformations of protein and RNA molecules.
It is fitting that the book should return in its final Part VI to the interface between
Bayesian methods of learning the structural regularities of macromolecules from
known 3D structures on the one hand, and experimental techniques for determining
new 3D structures on the other. As is well known, most of these techniques
(with the exception perhaps of ultra-high resolution X-ray crystallography with
plentiful sources of experimental phase information) need to be supplemented
with some degree of low-level a priori knowledge about bond lengths and bond
angles to enforce sensible stereochemistry in the resulting atomic models. The
Bayesian picture turns this conventional approach on its head, making the process
look instead like structure prediction assisted by X-ray (or NMR) data, with the
measurements delivered by each experimental technique providing the likelihood
factor to supplement the prior probability supplied by previous learning, in order to
cast the final result of this inference into the form of a posterior distribution over
an ensemble of possible structures. It is only by viewing structure determination in
this overtly Bayesian framework that the classical problems of model bias and map
overinterpretation can be avoided; and indeed crystallographers, for example, are
still far from having made full use of the possibilities described here. Dr. Hamelryck
is thus likely to have paved the way for future improvements in his original field of
research, in spite of having since worked in a related but distant area, using tools
first developed and applied to address problems in medical diagnosis. This instance
of the interconnectedness of different branches of science through their methods,
and this entire book, provide a splendid illustration of the power of mathematical
approaches as the ultimate form of re-useable thought, and of Bayesian methods in
particular as a repository of re-useable forms of scientific reasoning.
I am confident that the publication of this book will act as a rallying call for
numerous investigators using specific subsets of these methods in various areas
of Bioinformatics, who will feel encouraged to connect their own work to the
viewpoints and computational tools presented here. It can therefore be expected
that this will lead to a succession of enlarged new editions in the future. Last but not
Foreword ix
least, this book should act as a magnet in attracting young researchers to learn these
advanced and broadly adaptable techniques, and to apply them across other fields of
science as well as to furthering the subject matter of the book itself.
The protein folding problem is the loose denominator for an amalgam of closely
related problems that include protein structure prediction, protein design, the
simulation of the protein folding process and the docking of small molecules and
biomolecules. Despite some, in our view, overly optimistic claims,1 the development
of an insightful, well-justified computational model that routinely addresses these
problems is one of the main open problems in biology, and in science in general,
today [461]. Although there is no doubt that tremendous progress has been made
in the conceptual and factual understanding of how proteins fold, it has been
extraordinary difficult to translate this understanding into corresponding algorithms
and predictions. Ironically, the introduction of CASP2 [400], which essentially
evaluates the current state of affairs every two years, has perhaps lead to a
community that is more focussed on pragmatically fine-tuning existing methods
than on conceptual innovation.
In the opinion of the editors, the field of structural bioinformatics would
benefit enormously from the use of well-justified machine learning methods and
probabilistic models that treat protein structure in atomic detail. In the last 5 years,
many classic problems in structural bioinformatics have now come within the scope
of such methods. For example, conformational sampling, which up to now typically
involved approximating the conformational space using a finite set of main chain
fragments and side chain rotamers, can now be performed in continuous space using
graphical models and directional statistics; protein structures can now be compared
and superimposed in a statistically valid way; from experimental data, Bayesian
methods can now provide protein ensembles that reflect the statistical uncertainty.
All of these recent innovations are touched upon in the book, together with some
more cutting edge developments that yet have to prove the extent of their merits.
1
See for example Problem solved* (*sort of) in the news section of the August 8th, 2008 issue of
Science, and the critical reply it elicited.
2
CASP stands for “Critical Assessment of protein Structure Prediction”.
xi
xii Preface
the equivalent amino acids are not known is subsequently discussed in the chapter
by Mardia and Nyirongo. They review a Bayesian model that is built upon a Poisson
process and a proper treatment of the prior distributions of the nuisance variables.
Part V is concerned with the use of graphical models in structural bioinformatics.
The chapter by Boomsma et al. introduces probabilistic models of RNA and protein
structure that are based on the happy marriage of directionals statistics and dynamic
Bayesian networks. These models can be used in conformational sampling, but
are also vital elements for the formulation of a complete probabilistic description
of protein structure. Yanover and Fromer discuss belief propagation in graphical
models to solve the classic problem of side chain placement on a given protein main
chain, which is a key problem in protein design.
In the sixth, final part, the inference of biomolecular structure from experimental
data is discussed. This is of course one of the most fundamental applications
of statistics in structural biology and structural bioinformatics. It is telling that
currently most structure determination methods rely on a so-called pseudo-energy,
which combines a physical force field with a heuristic force field that brings in the
effect of the experimental data. Only recently methods have emerged that formulate
this problem of inference in a rigorous, Bayesian framework. Habeck discusses
Bayesian inference of protein structure from NMR data, while Hansen discusses
the case of SAXS.
Many of the concepts and methods presented in the book are novel, and have
neither been tested, honed or proven in large scale applications. However, the editors
have little doubt that many concepts presented in this book will have a profound
effect on the incremental solution of one of the great challenges in science today.
We thank Justinas V. Daugmaudis for his invaluable help with typesetting the book
in LATEX. In addition, we thank Christian Andreetta, Joe Herman, Kresten Lindorff-
Larsen, Simon Olsson and Jan Valentin for comments and discussion.
This book was supported by the Danish Research Council for Technology
and Production Sciences (FTP), under the project “Protein structure ensembles
from mathematical models – with application to Parkinson’s ˛-synuclein”, and
the Danish Research Council for Strategic Research (NaBiIT), under the project
“Simulating proteins on a millisecond time-scale”.
xv
i
Contents
Part I Foundations
xvii
xviii Contents
References . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 343
Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 377
Contributors
xix
xx Contributors
xxi
xxii Acronyms
Thomas Hamelryck
1.1 Introduction
T. Hamelryck ()
The Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark
e-mail: [email protected]
The term Bayesian refers to Thomas Bayes (1701?–1761; Fig. 1.1), a Presbyterian
minister who proved a special case of what is now called Bayes’ theorem [27, 31,
691, 692]. However, it was the French mathematician and astronomer Pierre-Simon
Laplace (1749–1827; Fig. 1.1) who introduced the general form of the theorem and
used it to solve problems in many areas of science [415, 692]. Laplace’s memoir
from 1774 had an enormous impact, while Bayes’ paper posthumously published
paper from 1763 did not address the same general problem, and was only brought
to the general attention much later [183, 692]. Hence, the name Bayesian statics
follows the well known Stigler’s law of eponymy, which states that “no scientific
discovery is named after its original discoverer.” Ironically, the iconic portrait of
Thomas Bayes (Fig. 1.1), that is often used in textbooks on Bayesian statistics, is of
very doubtful authenticity [31].
For more than 100 years, the Bayesian view of statistics reigned supreme, but this
changed drastically in the first half of the twentieth century with the emergence of
the so-called frequentist view of statistics [183]. The frequentist view of probability
gradually overshadowed the Bayesian view due to the work of prominent figures
such as Ronald Fisher, Jerzy Neyman and Egon Pearson. They viewed probability
not as a measure of a state of knowledge or a degree of belief, but as a frequency:
an event’s probability is the frequency of observing that event in a large number
of trials.
Fig. 1.1 (Left): An alleged picture of Thomas Bayes (1701?–1761). The photograph is reproduced
from the Springer Statistics Calendar, December issue, 1981, by Stephen M. Stigler (Springer-
Verlag, New York, 1980). The legend reads: “This is the only known portrait of him; it is taken
from the 1936 History of Life Insurance (by Terence O‘Donnell, American Conservation Co.,
Chicago). As no source is given, the authenticity of even this portrait is open to question”. The
book does not mention a source for this picture, which is the only known picture of Thomas Bayes.
The photo appears on page 335 with the caption “Rev. T. Bayes: Improver of the Columnar Method
developed by Barrett.” (Right): Pierre-Simon Laplace (1749–1827). Engraved by J. Pofselwhite
1 An Overview of Bayesian Inference and Graphical Models 5
The difference between the two views is illustrated by the famous sunrise
problem, originally due to Laplace [342]: what is the probability that the sun will
rise tomorrow? The Bayesian approach is to construct a probabilistic model of the
process, estimate its parameters using the available data following the Bayesian
probability calculus, and obtain the requested probability from the model. For a
frequentist, the question is meaningless, as there is no meaningful way to calculate
its probability as a frequency in a large number of trials.
During the second half of the twentieth century the heterogeneous amalgam of
methods known as frequentist statistics became increasingly under pressure. The
Bayesian view of probability was kept alive – in various forms and disguises – in
the first part of the twentieth century by figures such as John Maynard Keynes,
Frank Ramsey, Bruno de Finetti, Dorothy Wrinch and Harold Jeffreys. Jeffreys’
seminal book “Theory of probability” first appeared in 1939, and is a landmark
in the history of the Bayesian view of statistics [343]. The label “Bayesian” itself
appeared in the 1950s [183], when Bayesian methods underwent a strong revival.
By the 1960s it became the term preferred by people who sought to escape the
limitations and inconsistencies of the frequentist approach to probability theory
[342, 444, 509]. Before that time, Bayesian methods were known under the name of
inverse probability, because they were often used to infer what was seen as “causes”
from “effects” [183]; the causes are tied to the parameters of a model, while the
effects are evident in the data.1
The emergence of powerful computers, the development of flexible Markov
chain Monte Carlo (MCMC) methods (see Chap. 2), and the unifying framework of
graphical models, has brought on many practical applications of Bayesian statistics
in numerous areas of science and computing. One of the current popular textbooks
on machine learning for example, is entirely based on Bayesian principles [62]. In
physics, the seminal work of Edwin T. Jaynes showed that statistical mechanics
is nothing else than Bayesian statistics applied to physical systems [335, 336].
Physicists routinely use Bayesian methods for the evaluation of experimental data
[154]. It is also becoming increasingly accepted that Bayesian principles underly
human reasoning and the scientific method itself [342], as well as the functioning of
the human brain [334, 391]. After this short historical introduction, we now turn to
the theory and practice of the Bayesian view on probability.
1.3.1 Overview
1
It should be noted that this is a very naive view. Recently, great progress has been made regarding
causal models and causal reasoning [569].
6 T. Hamelryck
introduce the elements of the full Bayesian approach to statistical inference, before
we introduce the various approximations.
The goal of the Bayesian probability calculus is to obtain the probability distribution
over a set of hypotheses, or equivalently, of the parameters of a probabilistic model,
given a certain set of observations. This quantity of interest is called the posterior
distribution. One of the hallmarks of the Bayesian approach is that the posterior
distribution is a probability distribution over all possible values of the parameters of
a probabilistic model. This is in contrast to methods such as maximum likelihood
estimation, which deliver one ’optimal’ set of values for the parameters, called a
point estimate.
The posterior distribution is proportional to the product of the likelihood, which
brings in the information in the data, and the prior distribution or, in short, prior,
which brings in the knowledge one had before the data was observed. The Bayesian
probability calculus makes use of Bayes’ theorem:
p.d j h/p.h/
p.h j d/ D (1.1)
p.d/
where:
• h is a hypothesis or the parameters of a model
• d is the data
• p.h j d/ is the posterior distribution, or in short the posterior.
• p.d j h/ is the likelihood.
• p.h/ is the prior distribution, or in short the prior.
• p.d/ is the
R marginal probability of the data or the evidence, with
p.d/ D p.d; h/dh.
p.d/ is a normalizing constant that only depends on the data, and which in most
cases does not need to be computed explicitly. As a result, Bayes’ theorem is often
applied in practice under the following form:
Conceptually, the Bayesian probability calculus updates the prior belief associ-
ated with a hypothesis in the light of the information gained from the data. This
prior information is of course embodied in the prior distribution, while the influence
of the data is brought in by the likelihood. Bayes’ theorem (Eq. 1.1) is a perfectly
8 T. Hamelryck
acceptable and rigorous theorem in both Bayesian and frequentist statistics, but this
conceptual interpretation is specific to the Bayesian view of probability.
Another aspect of the Bayesian approach is that this process can be applied
sequentially, or incrementally. Suppose that one has obtained a posterior distribution
from one data set, and that a second, additional data set that is independent of the
first data set now needs to be considered. In that case, the posterior obtained from the
first data set serves as the prior for the calculation of the posterior from the second
data set. This can be easily shown. Suppose that in addition to the data d, we also
have data d0 . If d and d0 are indeed independent conditional on h, applying the the
usual rules of the Bayesian probability calculus leads to:
Obviously, the last two factors correspond to the posterior distribution obtained from
d alone, and hence the ’posterior becomes the prior’. The second step is due to
the so-called product rule of probability theory, which states for any probability
distribution over M variables x1 ; x2 ; : : : ; xM :
The third step is due to the assumed independence between d and d0 given h; in
general, if random variables a and b are independent given c, by definition:
Note that it does not matter whether d or d0 was observed first; all that matters is
that the data are conditionally independent given h.
In the next section, a simple example illustrates these concepts. At this point, we
also need to recall the two fundamental rules of probability theory:
X
Sum rule: p.a/ D p.a; b/
b
These two rules lead directly to Bayes’ theorem. Their justification will be discussed
briefly in Sect. 1.4.
1 An Overview of Bayesian Inference and Graphical Models 9
In practice, n is typically given as part of the experimental design, and left out of
the expression. Now, suppose we observe k successful trials out of n draws, and we
want to infer the value of . Following the Bayesian probability calculus, we need
to obtain the posterior distribution, which is equal to:
Obviously, the first factor, which is the likelihood, is the binomial distribution
given in Eq. 1.2. Now, we also need to specify the second factor, which is the prior
distribution, that reflects our knowledge about before k was observed. Following
Laplace and Bayes, we could adopt a uniform distribution on the interval Œ0; 1. In
that case, we obtain the following result for the posterior:
p. j k/ / k .1 /nk
Fig. 1.2 The Beta distribution Be. j ˛1 ; ˛2 / for ˛1 D ˛2 D 6. This corresponds to the posterior
distribution of the binomial parameter resulting from five successes in ten trials (k D n k D 5)
combined with a uniform prior on . As expected, the posterior density of is symmetric and
reaches a maximum at 0.5
p.x j h/ D N .x j h ; h /
simply integrated away, by summing over all possible components. For a Gaussian
mixture model with H components, this corresponds to:
X
H
p.x/ D N .x j h ; h /p.h/
hD1
For p.h/, one typically uses a multinomial distribution. In practice, this distribu-
tion is parameterized by a vector that assigns a probability to each value of h. This
simple variant of the multinomial distribution – where h is a single outcome and not
a vector of counts – is often called the categorical or the discrete distribution.
Let us now consider Bayesian parameter estimation for such a model, and focus
on the parameters of the multinomial distribution, (see Fig. 1.3b). The goal of the
Bayesian inference is to obtain a probability distribution for [697]. Following the
rules of the Bayesian probability calculus, we need to specify a prior distribution
over , with given, fixed parameters. As we will discuss in Sect. 1.6.2, the Dirichlet
distribution [164] is typically used for this purpose:
1 Y ˛h 1
H
Di. j ˛/ D h
Z.˛/
hD1
Fig. 1.4 The density of the Dirichlet distribution for H D 3, for different values of
˛ D .˛1 ; ˛2 ; ˛3 /. For H D 3, the Dirichlet distribution is a probability distribution over three-
dimensional probability vectors D .1 ; 2 ; 3 /; the support of the distribution is thus a triangle
with vertices at .1; 0; 0/; .0; 1; 0/ and .0; 0; 1/. The figure shows 1 on the X-axis and 2 on the Y -
axis; 3 is implied by the constraint 1 C 2 C 3 D 1. Increasing probability density is shown
as darker shades of gray, and the black lines indicate equiprobability contours. Note that the
distribution is uniform on the triangle for ˛ D .1; 1; 1/
whose support is not the familiar infinite N -dimensional Euclidean space, for which
the prominent Gaussian distribution is commonly used. In Chaps. 6, 7 and 10 we
will see that probability distributions on unusual manifolds such as the sphere or
the torus are important for the formulation of probabilistic models of biomolecular
structure. Such distributions also arise in probabilistic methods to superimpose
protein structures, as discussed in Chaps. 8 and 9.
1 An Overview of Bayesian Inference and Graphical Models 13
We only discussed the inference of ; of course, one would also use priors for
the means and the standard deviations of the Gaussian distributions. Hierarchical
models are simple examples of graphical models, which will be discussed in more
detail in Sect. 1.9.
The posterior distribution contains all the information of interest on the parameters
or the hypothesis under consideration. However, sometimes one may prefer to use a
point estimate: a specific estimate of the parameter(s) that is in some way optimal.
There are two possible reasons for this way of proceeding.
First, the calculation of the posterior might be computationally intractable or
impractical, and therefore one settles for a simple point estimate. Whether this
produces meaningful results depends on the shape of the posterior, and the intended
use of the point estimate. Roughly speaking, point estimates will make most sense
when the posterior is sharply peaked and unimodal.
Second, one can be facing a decision problem: an optimal decision needs to be
made. In the latter case, consider buying a bottle of wine: having an idea of which
wines in a shop are good and which aren’t is a good start, but in the end one wants to
bring home a specific bottle that goes well with the dinner in the evening. For such
decision problems, one needs to define a loss function L.; O /, that measures the
O
price of acting as if the point estimate is true, when the real value is actually .
The point estimate that minimizes the expected loss is called a Bayes estimator for
O is for given data d is:
N /
that loss function. The expected loss L.
Z
O D
N /
L. O /p. j d/d
L.;
Two point estimates are especially common. First, one can simply use the
maximum of the posterior distribution, in which case one obtains the maximum a
posteriori (MAP) estimate:
MAP D arg max p.d j /p./
The MAP estimate essentially follows from a specific loss function, called the zero-
one loss function:
L.;O / D 0; if
O "
L.;O / D 1; if
O > "
14 T. Hamelryck
This loss function is zero if the point estimate O is in a ball with radius " close to 0
around , and one otherwise.
Second, if one assumes that the prior is uniform, one obtains the maximum
likelihood (ML) estimate:
ML and MAP estimation provide a point estimate, and not a full posterior
distribution over the parameters. However, under some general assumptions, it is
still possible to obtain an approximation of the posterior distribution from these
point estimates. For this, we need the Fisher information [211].
For large datasets and under some general assumptions, the posterior distribution
will converge to a Gaussian distribution, whose mean is the ML estimate. The
variance 2 of this Gaussian distribution is proportional to the inverse of the Fisher
information, I./:
1
2 D
M I./
where M is the number of observations, for M ! 1. The Fisher information
I./ is minus the expectation of the second derivative with respect to of the log-
likelihood ( )
d2 log p.d j /
I./ D Ep.d j /
d 2
The expectation is with respect to d under the likelihood p.d j /. The Fisher
information has an interesting interpretation: it measures how much information is
contained in the data about the parameter by considering the curvature of the log-
likelihood landscape around the ML estimate. A low value indicates that the peak
of the log-likelihood function around the maximum likelihood estimate is blunt,
1 An Overview of Bayesian Inference and Graphical Models 15
Empirical Bayes methods are nothing else than ML estimation applied to certain
hierarchical models, so the name is rather misleading. Essentially, the probabilistic
model consists of what looks like a likelihood function and a corresponding prior
distribution. In a true Bayesian approach, the parameters of the prior would be
specified beforehand, reflecting the prior state of knowledge and without being
influenced by the data. In the empirical Bayes approach however, a ML point
estimate obtained from the data is used for the parameters of the prior. In Chap. 8, an
empirical Bayes approach is used to develop a probabilistic method to superimpose
protein structures.
A classic example of empirical Bayes is the Beta-binomial distribution, which
represents the probability of observing a pair of counts a; b for two events A and B.
The nature of this model (see Fig. 1.3b) is probably best understood by considering
the process of obtaining samples from it. First, the binomial parameter is drawn
from the Beta distribution with parameters ˛1 ; ˛2 . Note that the Beta distribution
is nothing else then the two-dimensional Dirichlet distribution. Then, a set of n
outcomes – falling into the two events A and B – are drawn from the binomial
distribution with parameter . More explicitly, this can be written as:
1
p. j ˛1 ; ˛2 / D ˛1 1 .1 /˛2 1
Z.˛1 ; ˛2 /
aCb
a; b p.a; b j / D a .1 /b
a
In the empirical Bayes method applied to this hierarchical model, the parameters
˛1 ; ˛2 are estimated from the data by maximum likelihood [510]. In a true
Bayesian treatment, the values of ˛1 ; ˛2 would be chosen based on the prior belief,
independent of the data.
Empirical Bayes methods also lie at the theoretical heart of another type of point
estimation methods: the shrinkage estimators, of which the James-Stein estimator
is the most well known [169, 170, 688, 689]. The statistical community was shocked
when Charles Stein showed in 1955 that the conventional ML estimation methods
for Gaussian models are suboptimal – in dimensions higher than two – in term of
the expected squared error.
Suppose we have a set of observations y1 ; y2 ; : : : ; ym with mean yN drawn from
an N -dimensional Gaussian distribution (N > 2) with unknown mean and given
covariance matrix 2 IN . The maximum likelihood estimate O ML of is simply the
mean yN , while the James-Stein estimate O JS is:
2 .N 2/
O JS D 1
yN
kNyk2
Stein and James proved in 1961 that the latter estimate is superior to the ML estimate
in terms of the expected total squared error loss [689].
James-Stein estimation is justified by an empirical Bayes approach in the
following way [169, 170]. First, one assumes that the prior distribution of the
mean is a Gaussian distribution with mean 0 and unknown covariance matrix
2 IN . Second, following the empirical Bayes approach, one estimates the unknown
parameter 2 of the prior from the data. Finally, the James-Stein estimate is obtained
as the Bayes estimator under a quadratic loss function for the resulting empirical
Bayes “posterior”. The poor performance of the ML estimate can be understood as
the result of the de facto use of a uniform prior over the mean, which in this case is
actually overly informative.
Because the prior distribution has mean 0, the James-Stein estimate “shrinks” the
ML estimate towards 0, hence the name shrinkage estimators. Shrinkage estimators
can be extended beyond the Gaussian case discussed above, and have proven to be
very useful for obtaining point estimates in high dimensional problems [90, 628].
For point estimation, ML and MAP estimation are by far the most commonly used
methods. However, even these approximations to a full Bayesian analysis can be
intractable. In such cases, estimation based on the pseudolikelihood or the method
of moments can be an alternative. These methods are for example used in Chaps. 6
and 7, for parameter estimation of probability distributions over angular variables.
In the pseudolikelihood approach [55, 480, 696], the likelihood is approximated
by the product of the marginal, conditional probabilities. For example, suppose the
1 An Overview of Bayesian Inference and Graphical Models 17
p.d j / D p.d1 ; d2 ; : : : ; dM j /
p.d1 j d2 ; : : : ; dM ; /p.d2 j d1 ; d3 ; : : : ; dM ; /
: : : p.dM j d1 ; : : : ; dM 1 ; /
Y
M
D p.dm j fd1 ; : : : ; dM g n dm ; /
mD1
D ˛ˇ
2 D ˛ˇ 2
Hence, we can estimate ˛; ˇ by making use of the mean and variance calculated
from the data:
2
˛D
2
2
ˇD
The Bayesian view of probability can be justified in several ways. Perhaps the most
convincing justification is its firm axiomatic basis: if one adopts a small set of
requirements regarding beliefs including respecting the rules of logic, the rules of
probability theory2 necessarily follow [127]. Secondly, de Finetti’s theorem states
that if a dataset follows certain common conditions, an appropriate probabilistic
model for the data necessarily consists of a likelihood and a prior [49]. Finally,
another often used justification, also due to de Finetti, is based on gambling, where
the use of beliefs that respect the probability calculus avoids situations of certain
loss for a bookmaker. We will take a look at these three justifications in a bit more
detail.
The Cox axioms [127,342], first formulated by Richard T. Cox in 1946, emerge from
a small set of requirements; properties that clearly need to be part of any consistent
calculus involving degrees of belief. From these axioms, the Bayesian probability
calculus follows. Informally, the Cox axioms and their underlying justifications
correspond to:
1. Degrees of belief are expressed as real numbers. Let’s say the belief in event a is
written as B.a/.
2. Degrees of belief are ordered: if B.a/ > B.b/ and B.b/ > B.c/ then B.a/ >
B.c/.
3. There is a function F that connects the beliefs in a proposition a and its negation
a:
B.a/ D F ŒB. a/
4. If we want to calculate the belief that two propositions a and b are true, we can
first calculate the belief that b is true, and then the belief that b is true given that
a is true. Since the labelling is arbitrary, we can switch a and b around in this
statement, which leads to the existence of a function G that has the following
property:
B.a; b/ D G ŒB.a j b/; B.b/ D G ŒB.b j a/; B.a/
These axioms imply a set of important requirements, including:
2
In 1933, Kolomogorov formulated a set of axioms that form the basis of the mathematical
theory of probability. Most interpretations of probability, including the frequentist and Bayesian
interpretations, follow these axioms. However, the Kolomogorov axioms are compatible with many
interpretations of probability.
1 An Overview of Bayesian Inference and Graphical Models 19
• Consistency with logic, when beliefs are absolute (that is, true or false).
• Different ways of reasoning for the same problem within the rules of the calculus
lead to the same result.
• Identical states of knowledge, differing by labelling only, lead to the assignment
of identical degrees of belief.
Surprisingly, this simple set of axioms is sufficient to pinpoint the rules of proba-
bilistic inference completely. The functions F and G turn out to be F .x/ D 1 x
and G.x; y/ D xy, as expected. In particular, the axioms lead to the two central rules
of probability theory. To recall, these rules are the product rule:
Another argument, due to de Finetti and Ramsey [705], is based on linking beliefs
with a willingness to bet. If a bookmaker isn’t careful, he might propose a set of
bets and odds that make it possible for a gambler to make a so-called Dutch book.
A Dutch book guarantees a profit for the gambler, and a corresponding loss for
the bookmaker, regardless of the outcome of the bets. It can be shown that if the
bookmaker respects the rules of the probability calculus in the construction of the
odds, the making of a Dutch book is impossible.
A second argument often invoked to justify the Bayesian view of probability is
also due to de Finetti, and is called de Finetti’s representation theorem [49]. The
theorem deals with data that are exchangeable: that is, any permutation of the data
does not alter the joint probability distribution. In simple words, the ordering of the
data does not matter.
Let us consider the case of an exchangeable series of N Bernoulli random
variables, consisting of zeros and ones. For those data, de Finetti’s theorem
essentially guarantees that the joint probability distribution of the data can be
written as:
Z 1(Y
N
)
p.x1 ; : : : ; xN / / .1 /
xn 1xn
p./d
0 nD1
The two factors in the integral can be interpreted as a likelihood and a prior.
The interpretation is that exchangeability leads to the existence of a likelihood and a
prior, and can thus be interpreted as an argument in favor of the Bayesian viewpoint.
20 T. Hamelryck
Although the theorem is stated here as it applies to binomial data, the theorem
extends to many other cases.
Information theory was developed by Claude Shannon at the end of the 1940s [342,
645]. It has its origins in the study of the transmission of messages through noisy
communication channels, but quickly found applications in nearly every branch of
science and engineering. As it has many fundamental applications in probability
theory – such as for example in the construction of suitable prior distributions [52] –
we give a quick overview.
Informally speaking, information quantifies the “surprise” that is associated with
gaining knowledge about the value of a certain variable. If the value was “expected”
because its probability was high, the information gain is minimal. However, if the
value x was “unexpected” because its probability was low, the information gain is
high. Several considerations lead to the following expression for the information I
associated with learning the value of a discrete variable x:
The entropy is at its maximum when p.x/ is the uniform distribution. The
information entropy becomes zero when x adopts one specific value with probability
one. Shannon’s information entropy applies to discrete random variables: it cannot
be extended to the continuous case by simply replacing the sum with an integral
[337]. For the continuous case, a useful measure is the Kullback-Leibler divergence,
which is discussed in the next section.
symmetric and does not respect the triangle inequality, which is why it is called a
’divergence’. In the discrete case, the KL divergence is defined as:
X p.x/
KLŒp k q D p.x/ log
x
q.x/
In the continuous case, the sum is replaced by an integral:
Z
p.x/
KLŒp k q D p.x/ log dx
q.x/
The mutual information measures the mutual dependence of the two random
variables. Intuitively, it tells you how much information you gain about the first
random variable if you are given the value of the other one. The mutual information
Ix;y of two random variables x and y is:
XX p.x; y/
Ix;y D p.x; y/ log
x y
p.x/p.y/
In the corresponding version of the mutual information for continuous random vari-
ables, the sums are again replaced by integrals. The mutual information is nonneg-
ative and symmetric. If x and y are independent, their mutual information is equal
to zero.
One of the most disputed and discussed aspect of the Bayesian view is the
construction of the prior distribution. Typically, one wants to use a so-called non-
informative prior. That is, a prior that correctly represents the “ignorance” in a
22 T. Hamelryck
particular situation and does not obscure the information present in the data. In many
cases the choice of a non-informative prior is clear. For the finite, discrete case,
the maximum entropy principle and the related principle of indifference apply.
For the univariate, continuous case, Jeffreys’ priors are appropriate. In many other
situations, the choice of a suitable prior is less clear. The construction of suitable
non-informative priors is still the object of much discussion and research. We will
now briefly look at some of these methods to decide which prior distributions to use,
including some of a more pragmatic nature.
The earliest principle for the construction of a prior is due to Laplace and Bayes,
and is called the principle of insufficient reason or the principle of indifference. If a
variable of interest can adopt a finite number of values, and all of these values are
indistinguishable except for their label, then the principle of indifference suggests
to use the discrete uniform distribution as a prior. Hence, if the variable can adopt K
values, each value is assigned a probability equal to K1 . For continuous variables
the principle often produces unappealing results, and alternative approaches are
necessary (see Sect. 1.6.1.3). The principle of indifference can be seen as a special
case of the maximum entropy principle, which we consider next.
The principle of indifference can be seen as the result of a more general principle:
the principle of maximum entropy [337, 342], often called MaxEnt. Let’s again
consider a variable A that can adopt a finite number of discrete values a1 ; : : : ; aK .
Suppose now that some information is available about A; for example its mean
value a:
XK
aD p.ak /ak
kD1
The classic illustration of this problem is known as the Brandeis dice problem [339].
Suppose we are given the following information, and nothing else, about a certain
dice: the average outcome a of throwing the dice is 4.5, instead of the average of
an “honest” dice, which is 3.5. The question is now, what are the probabilities we
assign to throwing any of the K D 6 values? Clearly, these probabilities will differ
from the honest case, where the probability of throwing any of the six values is 1/6.
The problem is to come up with a plausible distribution p.A/ that is compatible
with the given information. The solution is to find the probability distribution with
maximum entropy that is compatible with the given information – in this case the
mean a. The problem can be easily solved using the method of Lagrange multipliers,
which can be used to maximize a function under a given set of constraints. In this
case we want to maximize the information entropy:
1 An Overview of Bayesian Inference and Graphical Models 23
X
K
SA D p.ak / log p.ak /
kD1
X
K
p.ak / D 1 (1.4)
kD1
The resulting Lagrangian function for this problem is then, using pk p.ak / and
p .p1 ; : : : ; pK / for simplicity:
X
K X
K X
K
L.p; ˛; ˇ/ D pk log pk ˛. pk 1/ ˇ. pk ak a/ (1.5)
kD1 kD1 kD1
where ˛ and ˇ are the Lagrange multipliers. The second and third term impose the
constraints of Eqs. 1.3 and 1.4, respectively. The solution is found by setting the
partial derivatives with respect to p1 ; : : : ; pK to zero:
@L.p; ˛; ˇ/
D 0 D log.pk / 1 ˛ ˇak
@pk
1
pk D exp.ˇak /
Z
X
K
ZD exp.ˇak /
kD1
dishonest dice with an average equal to 4.5, we obtain the following (approximate)
probabilities [339]:
For discrete, finite cases, the maximum entropy framework provides a convenient
method to construct prior distributions that correctly reflect any prior knowledge.
For the continuous case, the situation is not so clear [337]. The definition of
information entropy does not simply generalize to the continuous case, and other
ways to construct priors are typically needed.
The principle of indifference and the MaxEnt method are intuitively reasonable,
but applying the principle to cases that are not finite and discrete quickly leads to
problems. For example, applying the principle to a variable x on the positive real
line RC leads to a uniform distribution over RC . However, consider a non-linear
monotone y.x/ transformation of x. Surely, ignorance on x should imply equal
ignorance on y, which leads to the demand that p.y/dy should be equal to p.x/dx.
Clearly, the uniform distribution on RC does not fulfill the demands associated
with these invariance considerations. In 1946, Jeffreys proposed a method to obtain
priors that take these invariance considerations into account. In the univariate case,
Jeffreys’ prior is equal to: p
p.x/ / I.x/
where I.x/ is the Fisher information of the likelihood. We also encountered the
Fisher information in Sect. 1.3.5.2. The Jeffreys’ priors are invariant under one-to-
one reparameterization.
If the likelihood is a univariate Gaussian with known mean and unknown
standard deviation , Jeffreys’ rule leads to a prior over equal to p./ / 1 [428].
If the likelihood is the binomial distribution with parameter , the rule leads to a
prior over equal to p./ / p.11
/
[428]. Jeffreys’ rule typically leads to good
results for the continuous univariate case, but the extension to the multivariate case
is problematic. We will encounter Jeffreys’ priors in Chap. 12, where they serve as
priors over nuisance variables in protein structure determination from NMR data.
Maximum entropy and Jeffreys’ priors can be justified in a more general frame-
work: the reference prior [52]. Reference priors were proposed by José Bernardo
[43, 50], and provide an elegant and widely applicable procedure to construct non-
informative priors.
1 An Overview of Bayesian Inference and Graphical Models 25
The central idea behind reference priors is that one aims to find a prior that has a
minimal effect on the posterior distribution, resulting in a reference posterior where
the influence of the prior is minimal. Again, information theory comes to the rescue.
Recall that the Kullback-Leibler divergence is a natural measure of the distance
between two probability distributions. The reference prior is defined [43] as the
prior distribution that maximizes the expected KL divergence between the prior p./
and the posterior distribution p. j D/. The expectation is with respect to marginal
probability of the data p.d /:
Z
p. j d /
Ep.d / p./ log d
p./
Conjugate priors are priors that can arise from any of the above considerations,
but that have the computationally convenient property that the resulting posterior
distribution has the same form as the prior. A well known and widely used example
of a conjugate prior is the Dirichlet prior for the multinomial distribution, which
we already encountered in Sect. 1.3.5.3 as the Beta-binomial distribution for two
dimensions. Conjugate priors are typically used because they make the calculation
of the posterior easier, not because this is the prior that necessarily best describes
the prior state of knowledge.
The binomial distribution with parameters n; (with n a positive integer and
0 < < 1) gives the probability of a discrete variable x (with 0 x n) according
to:
n
Bi.x j ; n/ D x .1 /nx
x
The Binomial distribution is typically interpreted as giving the probability of x
successes in n trials, where each trial can either result in success or failure.
Now, we want to infer given x and n, where n is the length of the sequence,
and thus typically known. Following the standard approach, and using the Binomial
distribution as the likelihood, we obtain:
For the prior, we now use a Beta distribution. The Beta distribution use parame-
ters ˛1 > 0; ˛2 > 0 is the two-dimensional version of the Dirichlet distribution. The
latter distribution is a probability distribution on the N -dimensional simplex, or
alternatively, a probability distribution over the space of probability vectors – p
strictly positive real numbers that sum to one. In the case of the Beta distribution
used as a prior for , this becomes:
Be. j ˛1 ; ˛2 / / ˛1 1 .1 /˛2 1
Evidently, the obtained posterior has the same functional form as the prior.
In addition, this prior has an interesting interpretation as a so-called pseudocount.
Given x and n, the ML estimate for is simply xn . Using a Beta prior with
xC˛1
parameters ˛1 ; ˛2 the MAP estimate for becomes nC˛ 1 C˛2
. Hence, the use of a
Beta prior can be interpreted as adding extra counts – often called pseudocounts –
to the number of observed failures and successes. The whole approach can be easily
extended to the higher dimensional case, by combining a multinomial likelihood
with a Dirichlet prior.
Improper priors are priors that are not probability distributions, because they are
not properly normalized. Consider for example a parameter that can adopt any
value on the real line (from 1 to C1). A uniform prior on such a parameter
is an improper prior, since it cannot be properly normalized and hence is not a
true probability density function. Nonetheless, such priors can be useful, provided
the resulting posterior distribution is still a well defined and properly normalized
probability distribution. Some of the above methods to construct priors, such as the
reference prior method or Jeffreys’ rule, often result in improper priors.
So far we have assumed that the problem of inference is limited to the parameters
of a given model. However, typically it is not known which model is best suited
for the data. A classic example is the mixture model, as discussed in Sect. 1.3.4.
1 An Overview of Bayesian Inference and Graphical Models 27
The Bayes factor B is the ratio of the respective probabilities of the data d given the
two different models M1 and M2 :
p.d j M1 /
BD
p.d j M2 /
Hence, the Bayes factor gives an idea how the data affects the belief in M1 and M2
relative to the prior information on the models.
This Bayes factor is similar to the classic likelihood ratio, but involves integrating
out the model parameters, instead of using the maximum likelihood values. Hence:
R
p.d j M1 / p.d j 1 ; M1 /1 . 1 j M1 /d 1
BD DR (1.6)
p.d j M2 / p.d j 2 ; M2 /2 . 2 j M2 /d 2
where 1 and 2 are the respective model parameters and and 1 ./ and 2 ./ are
the corresponding priors over the model parameters.
The Bayes factor is typically interpreted using a scale introduced by Jeffreys
[343]:
• If the logarithm of B is below 0, the evidence is against M1 and in favor of M2 .
• If it is between 0 and 0.5, the evidence in favor of M1 and against M2 is weak.
• If it is between 0.5 and 1, it is substantial.
• If it is between 1 and 1.5, it is strong.
28 T. Hamelryck
Often, the calculation of the Bayes factor is intractable. In that case, one resorts
to an approximation. In order to calculate the Bayes factors, we need to calculate
p.d j M / for the models involved. This can be done using the following rough
approximation:
1
log p.d j M / log p.d j ML ; M / Q log R (1.7)
2
where ML is the ML point estimate, R is the number of data points and Q is the
number of free parameters in ML . This approximation is based on the Laplace
approximation [62], which involves representing a distribution as a Gaussian
distribution centered at its mode.
Equation 1.7 is called the Bayesian Information Criterion (BIC) or the Schwartz
criterion [638]. Model selection is done by calculating the BIC value for all models,
and selecting the model with the largest value. Conceptually, the BIC corresponds to
the likelihood penalized by the complexity of the model, as measured by the number
of parameters. The use of the ML estimate ML in the calculation of the BIC is of
course its weak point: the prior is assumed to be uniform, and any prior information
is not taken into account.
Often, the posterior distribution is only available in the form of a set of samples,
typically obtained using an MCMC algorithm. For the calculation of the AIC
and the BIC, the ML or MAP estimate of is needed. These estimates are not
readily obtained from MCMC sampling. In that case, the deviance information
criterion (DIC) [683] can be used for model selection. All quantities needed for
the calculation of the DIC can be readily obtained from a set of samples from the
posterior distribution. Another advantage of the DIC is that it takes the prior into
account.
Like the BIC and AIC, the DIC balances the complexity of the model against the
fit of the data. In the case of the DIC, these qualities are numerically evaluated as
the effective number of free parameters pD and the posterior mean deviance D. The
DIC is subsequently given by:
DIC D pD C D
These quantities a calculated as follows. First, the deviance D is defined as:
The posterior mean deviance D is given by the expectation of the deviance under
the posterior:
D./ D Ep.jd/ ŒD./
The effective number of parameters is defined as:
pD D D./ D./
The Bayes factor, BIC, AIC or DIC can be used to select the best model among
a set models whose parameters were inferred previously. However, it is also
possible to sample different models, together with their parameters, following the
Bayesian probability calculus using an MCMC algorithm. This can be done with a
method introduced by Peter Green in 1995 called reversible jump MCMC [253].
30 T. Hamelryck
1.8.1 Overview
X
K
U D EN D E.sk /p.sk /
kD1
Given only the average energy over all the states, and with the probabilities
of the states unknown, the question is now how do we assign probabilities to
all the states? A simple solution to this question makes use of the maximum
entropy principle [150,336]. The solution is a probability distribution over the states
which results in the correct average value for the energy and which maximizes the
entropy. Computationally, the solution can be found by making use of the following
Lagrangian function, using pk p.sk /, p .p1 ; : : : ; pK / and E.sk / Ek for
simplicity:
1 An Overview of Bayesian Inference and Graphical Models 31
( ) ( )
X
K X
K X
K
L.p; ˛; ˇ/ D pk log pk ˛ pk 1 ˇ pk Ek EN
kD1 kD1 kD1
The first term concerns the maximum entropy demand, the second term ensures the
probabilities sum to one, and the last term ensures that the correct average EN is
obtained. Of course, this Lagrangian function and its solution is formally identical
to the one obtained for the Brandeis dice problem in Sect. 1.8.2 and Eq. 1.5. As
mentioned before, the solution of the Lagrangian turns out to be an exponential
distribution, called the Boltzmann distribution that is independent of ˛:
1
pk D exp.ˇEk /
Z
where Z is a normalization factor:
X
K
ZD exp.ˇEk /
kD1
In statistical mechanics, Z is called the partition function; the letter stems from the
German Zustandssumme. In physical systems, ˇ has a clear interpretation: it is the
inverse of the temperature.3 In addition, ˇ is determined by its relation to E:N
X
K
1 X
K
EN D pk E k D exp .ˇEk / Ek
Z
kD1 kD1
3
Here, we set ˇ D T1 for simplicity and without loss of generality. In physics, ˇ D 1
kT
where k is
Boltzmann’s constant.
32 T. Hamelryck
1
F D log Z
ˇ
As the partition function Z is a function of p.sk /, the free energy function thus
assigns a value to a probability distribution over all possible microstates. Note that
the energy function E mentioned above assigns an energy value to each microstate,
as opposed to a probability distribution over all microstates.
The Helmholtz free energy has the useful property that the Boltzmann distri-
bution is recovered by minimizing F , for the energies E.sk / given, in function of
p.sk /. The free energy F can be expressed in function of the internal energy U and
the entropy S :
X
U D p.sk /E.sk /
k
X
S D p.sk / log p.sk /
k
F D U TS
The Helmholtz free energy can be used to obtain much useful information on a
system. For example, the internal energy U can be obtained in the following way:
@F @ log.Z/
ˇ D
@ˇ @ˇ
P
@ log k exp.ˇE.sk //
D
@ˇ
DU
A very useful concept for inference purposes is the variational free energy
G [786]. The variational free energy is used to construct approximations, when
working with the Helmholtz free energy directly is intractable. This energy can
be considered as a variant of the Helmholtz free energy F in the presence of some
constraints on the form of the probability distribution, which make the problem more
tractable. In the absence of any constraints, G becomes equal to F . Minimizing the
variational free energy corresponds to finding the probability distribution b.x/ that
respects the constraints, and is ’close’ to the unconstrained probability distribution
p.x/, where x D x1 ; : : : ; xN is discrete and N -dimensional. In practice, b.x/ is
chosen so that it is a tractable approximation of p.x/. For example, one can use the
mean field approach which insists on a fully factorized joint probability distribution
for b.x/:
YN
p.x/ b.x/ D bn .xn /
nD1
1 An Overview of Bayesian Inference and Graphical Models 33
where E.x/ is the energy of x; Sb and Ub are the variational entropy and the
variational internal energy; and Fp is the Helmholtz free energy. For Sb ; Ub and Fp ,
the subscripts indicate which probability distribution is involved. The variational
free energy Gb is then equal to:
Gb D Ub T Sb
1.9.1 Introduction
Graphical models [62, 126, 345, 568] have their roots in a research field that used to
be called artificial intelligence – often abbreviated to AI. One of the main outcomes
of artificial intelligence were so-called expert systems: these systems combined
a knowledge base from a particular branch of human activity with methods to
34 T. Hamelryck
perform inference. Expert systems for the diagnosis of disease are a classic example.
Here, the knowledge base is a set of diseases and their associated symptoms, and
the inference problem is the actual diagnosis of a patient’s disease based on his
particular symptoms. Such systems were developed in academia in the 1960s, and
were commercialized and applied to practical problems in the 1980s.
Expert systems were initially based on rules that make use of logical deduction:
If A is true, therefore, B is true
If B is false, therefore, A is false
However, it quickly became clear that systems solely based on deduction were
severely limited. For example, a single unexpected or absent symptom can corrupt
disease diagnosis completely. In addition, it is difficult to deal with unobserved
variables – for example, a patient’s smoking status is unknown – and the set of
rules quickly becomes huge for even simple applications.
Therefore, attention shifted towards expert systems based on logical induction:
B is true, therefore, A becomes more plausible
A is false, therefore, B becomes less plausible
Such expert systems were meant to reason in the face of uncertainty, by assigning
degrees of plausibility or certainty factors to various statements. For that, one
needs an algebra or calculus to calculate the plausibility of statements based on
those of other statements. Various approaches were tried, including fuzzy logic
and belief functions, all of which turned out to exhibit serious inconsistencies and
shortcomings. In retrospects, it is of course clear that such reasoning should be
performed according to the rules of the Bayesian probability calculus, but this was
seen as problematic by the AI community. The first reason was that joint probability
distributions over many variables were seen as intractable. The second reason
was more subtle: at that time the ruling paradigm in statistics was the frequentist
interpretation, which claims it is meaningless to assign probabilities to hypotheses.
Eventually both objections were overcome, thanks to pioneering work done
by researchers such as Finn V. Jensen, Steffen Lauritzen, Judea Pearl and David
Spiegelhalter [126, 345, 568]. The first problem was solved by making use of
conditional independencies in joint probability distributions. The second perceived
problem slowly faded away by the increasing acceptance of the Bayesian interpreta-
tion of probability. Finally, the availability of cheap and powerful computers brought
many computationally intensive Bayesian procedures within practical reach.
Graphical models represent a set of random variables and their conditional indepen-
dencies [62, 568]. This representation consists of a graph in which the nodes are the
variables and the edges encode the conditional independencies. The goal is to solve
problems of inference efficiently by taking the structure of the graph into account,
1 An Overview of Bayesian Inference and Graphical Models 35
As mentioned above, the graph of a Bayesian network is directed, that is, its edges
are arrows. In addition, cycles are not allowed: it is not allowed to encounter the
same node twice if one traverses the graph in the direction of the arrows. A directed
arrow points from a parent node to a child node. Let’s consider a general joint
probability distribution p.a; b; c/ over three variables a; b; c: Using the product
rule, this joint probability distribution can be factorized in different ways, of which
we consider two possibilities:
Each factorization gives rise to a Bayesian network, as shown in Fig. 1.5. This
example illustrates how a BN encodes a joint probability distribution. The joint
Y
N
p.x1 ; : : : ; xN / D p.xn j pa.xn //
nD1
where pa.xn / denotes the parents of node xn . The conditional probability distribu-
tions are most often categorical distributions for discrete nodes and Gaussian nodes
for continuous variables, but many other distributions are of course possible. In
practice, a categorical distribution is specified as a conditional probability table
(CPT), which tabulates the probability of a child’s values, given the values of the
parents. For example, consider a discrete node which can adopt two values that has
two parents which can adopt three and four values, respectively. The resulting CPT
will be a 4 3 2 table, with 12 unique parameters because the probabilities need
to sum to one. The case of a Gaussian node with one discrete parent is typically
identical to the mixture model described in Sect. 1.3.4.
The example also clarifies that different BNs can give rise to the same probability
distribution: in Fig. 1.5, both BNs correspond to different factorizations of the same
joint probability distribution. In our example, the graph is fully connected: all nodes
are connected to each other. The strength of BNs lies in the fact that one can leave
out connections, which induces conditional independencies in the joint probability
distribution. For example, if a is independent of b given c then:
p.a j b; c/ D p.a j c/
p.a; b j c/ D p.a j c/p.b j c/
Hence, the number of terms that need to be calculated – in this case 2,000 – is
dramatically lower. The strength of BNs lies in the fact that it is possible to construct
algorithms that perform these clever ways of inference in an automated way. This
will be discussed in Sect. 1.9.6.1.
Dynamic Bayesian networks (DBNs) are Bayesian networks that represent sequen-
tial variables [219, 220]. The name “dynamic” is misleading, as it refers to the
fact that these sequential variables often have a temporal character, such as for
example in speech signals. However, many sequential variables, such as protein
38 T. Hamelryck
Fig. 1.7 Bayesian network diagram versus an HMM state diagram. (left) An HMM with three
slices shown as a Bayesian network diagram. (middle) An example of a possible state diagram for
an HMM where the hidden nodes can adopt three possible values (states 1, 2 and 3). The arrows
denote non-zero transition probabilities, which are shown next to the arrows. (right) The transition
matrix associated with the shown state diagram. This conditional probability table (CPT) specifies
p.hn1 j hn / for all n > 1. Zeros in the matrix correspond to missing arrows in the state diagram
sequences, do not have a temporal character. Thus, there is nothing “dynamic” about
a DBN itself, and the name “sequential Bayesian network” would have been more
appropriate.
Formally, DBNs are identical to ordinary BNs. One of the simplest DBNs is the
hidden Markov model (HMM) [164,594]. An example of an HMM with length three
is shown in Fig. 1.7. Each sequence position, which corresponds to one slice in the
DBN, is represented by one hidden or latent node h that has one observed node x as
child. The use of a Markov chain of hidden nodes is a statistical “trick” that creates
conditional dependencies along the whole length of the observed sequence – and not
just between consecutive slices as is often erroneously claimed [757,794,795]. This
can be clearly observed when considering the marginal probability of the observed
sequence x, with length N , according to an HMM:
X X Y
N
p.x/ D p.x; h/ D p.x1 j h1 /p.h1 / p.xn j hn /p.hn j hn1 /
h h nD2
for all possible states a and b. Why are two slices needed, instead of one? Consider
node h1 at the start of the HMM. This node has no parents, and thus its conditional
probability distribution will have a different number of parameters than the –
shared – probability distribution of the consecutive hidden nodes, which do have
a parent. Hence, we need to specify a starting slice and one consecutive slice to
specify the model fully.
The conditional probability distribution p.hnC1 j hn / can be interpreted as the
probability of evolving from one state at position n to another state at position
n C 1. If the sequence position n is interpreted as an indicator of time, it is thus
natural to represent p.hnC1 j hn / as a state diagram. Such a diagram is shown in
Fig. 1.7. Here, the nodes do not represent random variables, but states; that is, values
adopted by the hidden nodes; arrows do not represent conditional independencies,
but connect states for which p.hnC1 j hn / is larger than zero. Figure 1.7 also
gives the corresponding conditional probability table p.hnC1 j hn / for the given
state diagram: a zero in the table corresponds to an arrow that is absent. Given the
superficial similarity of a DBN graph and the HMM state diagram, it is important to
understand the distinction.
Another interesting conceptual difference is that “structure learning” in HMMs
corresponds to inference of the conditional probability distribution p.hnC1 j hn / in
DBNs. Typically, the state diagram of an HMM, which specifies the possible state
transitions but not their probabilities, is decided on before parameter estimation.
Inference of the state diagram itself, as opposed to inference of the transition
probabilities, is called structure learning in the HMM community; see for example
[770] for structure learning in secondary structure prediction. However, in DBNs,
the possible transitions, which correspond to non-zero entries in the conditional
probability distribution p.hnC1 j hn /, are typically inferred from the data during
parameter estimation. In HMMs, parameter estimation and structure learning are
often considered as distinct tasks, and the state diagram is often decided using “prior
insights”. It is of course possible to include prior information on the properties of
p.hnC1 j hn / for DBNs, up to the point of specifying the possible “state transitions”,
just as is typical for HMMs.
Numerous extensions to the basic HMM architecture are possible, including
higher order Markov models, where hidden nodes at slice n C 1 are not only
connected to those in slice n, but also to one or more previous slices. Such models
quickly become intractable [219,220]. The great advantage of considering HMMs as
Bayesian networks is a unified view of standard HMMs and their numerous variants.
There is no need to develop custom inference algorithms for every variant under
this view. Inference and learning in DBNs and HMMs will be briefly discussed in
Sects. 1.9.6 and 1.9.7.
are generative models; they provide a full joint probability distribution over all
variables, and can thus be used to generate “artificial data”. In addition, several
important sampling operations are trivial in BNs.
Generating samples from the joint probability distribution encoded in the BN can
be done using ancestral sampling [62]. In order to do this, one first orders all the
nodes such that there are no arrows from any node to any lower numbered node.
In such an ordering, any node will always have a higher index than its parents.
Sampling is initiated starting at the node with the lowest index (which has no
parents), and proceeds in a sequential manner to the nodes with higher indices.
At any point in this procedure, the values of the parents of the node to be sampled
are available. When the node with the highest index is reached, we have obtained a
sample from the joint probability distribution. Typically, the nodes with low indices
are latent variables, while the nodes with higher indices are observed.
Ancestral sampling can easily be illustrated by sampling in a Gaussian mixture
model. In such a model, the hidden node h is assigned index zero, and the observed
node o is assigned index one. Ancestral sampling thus starts by sampling a value for
the hidden node from p.h/: a random number r in Œ0; 1 is generated, and a value h
for the hidden node is sampled according to:
8 9
< Xh0 =
h D arg min r p.i / (1.8)
h0 : ;
i D1
This amounts to sampling from a multinomial distribution. For most BNs with
nodes that represent discrete variables, this will be the most common node type
and associated sampling method. Next, we sample a value for the observed node o,
conditional upon the sampled values of its parents. In a Gaussian mixture model,
this marginal distribution is equal to a Gaussian distribution with mean h and
standard deviation h specified by h. Hence, we sample a value y from the Gaussian
distribution N .y j h ; h /, and we are done. The pair .y; h/ is a sample from the
joint probability distribution represented by the Gaussian mixture model.
Ancestral sampling can also be used when some nodes are observed, as long as
all observed nodes either have no parents, or only observed parents. The observed
nodes are simply clamped to their observed value instead of resampled. However,
if some observed nodes have one or more unobserved parents, sampling of the
parent nodes needs to take into account the values of the observed children, which
cannot be done with ancestral sampling. In such cases, one can resort to Monte
Carlo sampling techniques [62, 223] such as Gibbs sampling [212], as discussed in
Sect. 1.9.6.2 and Chap. 2.
Markov random fields or Markov networks [54, 62] are graphical models in which
the graph is undirected: there is no direction associated with the edges. In general,
1 An Overview of Bayesian Inference and Graphical Models 41
such models are much harder to work with than BNs, and problems of inference
easily become intractable. In MRF, the joint probability distribution is a normalized
product of potential functions, which assign a positive value to a set of nodes. Each
potential function is associated with a maximal clique in the graph; a clique is a
subgraph of the MRF in which all nodes are connected to each other, and a maximal
clique ceases to be a clique if any node is added. For the MRF in Fig. 1.8, the joint
probability distribution is given by:
1
p.a; b; c; d / D acd .a; c; d / ab .a; b/
Z
The normalization factor Z is called the partition function, and is equal to:
XXXX
ZD acd .a; c; d / ab .a; b/
a b c d
Many models that arise in statistical physics, such as the Ising model, can be readily
interpreted as MRFs.
Factor graphs provide a general framework for both Bayesian networks and Markov
random fields [62,402]. Factor graphs consists of two nodes types: factor nodes and
variable nodes. The factor nodes represent (positive) functions that act on a subset
of the variable nodes. The edges in the graph denote which factors act on which
variables. The joint probability distribution encoded in a factor graph is given by:
1 Y
p.x/ D fk .xk /
Z
k
For example, the factor graph in Fig. 1.9 represents the following joint probability
distribution:
1
p.a; b; c/ D f1 .b/f2 .a; b; c/f3 .a; c/
Z
To represent a BN as a factor graph, each node in the BN is represented by a variable
node (just as in the case of the MRF). Then, each conditional distribution in the
BN gives rise to one factor node, which connects each child node with its parents
(Fig. 1.10). A MRF can be represented as a factor graph by using one variable node
in the factor graph for each node in the MRF, and using one factor node in the factor
graph for each clique potential in the MRF. Each factor node is then connected to
the variables in the clique.
The goal of inference is often – but not always – to estimate p.h j x; /; the
distribution of the hidden nodes h given the observed nodes x and the model
parameters . In other cases, one wants to find the hidden node values that
maximize p.h j x; /. In some cases, the goal of inference is to obtain quantities
such as entropy or mutual information, which belong to the realm of information
1 An Overview of Bayesian Inference and Graphical Models 43
Belief propagation algorithms are complex, especially when the BN contains nodes
that are not discrete or Gaussian. An attractive alternative is to turn to sampling
methods, which are typically easy to implement and general due to the independence
relations that are encoded in the graph structure of a graphical model [62, 223]. We
will briefly discuss Gibbs sampling in BNs.
Gibbs sampling [212] in BNs makes use of Markov blanket sampling [62]. The
goal of Markov blanket sampling is to sample a value for one node, conditional
upon the values of all the other nodes. An important application of this type of
sampling is to generate samples when some node values are observed. In that case,
ancestral sampling typically does not apply as it assumes that all observed nodes
have observed or no parents. In Gibbs sampling, we sample a value for node xi ,
conditional upon the values of all other nodes, fx1 ; : : : ; xN g n xi :
xi p.xi j fx1 ; : : : ; xN g n xi /
where the sum runs over all possible values of xi . For a BN with N discrete nodes,
this distribution can be written as:
p.x1 ; : : : ; xN /
p.xi j fx1 ; : : : ; xN g n xi / D P
xi p.x1 ; : : : ; xN /
QN
p.xn j pa.xn //
D P nD1QN
xi nD1 p.xn j pa.xn //
It can easily be seen that all factors that do not contain xi can be taken out of
the sum, and cancel in the numerator and the denominator. The factors that are
remaining are p.xi j pa.xi // and any factor in which the node xi itself is the parent
of a node. In simple words, Gibbs sampling of a node depends on a node’s parents, a
node’s children and the parents of the node’s children. This set of nodes is called the
1 An Overview of Bayesian Inference and Graphical Models 45
Markov blanket of a node. For example, the Markov blanket of node b in Fig. 1.11
is fa; c; f; gg. Here, a and c are parents; f is a child; and g is a parent of the child.
Gibbs sampling for node b is thus done by sampling from the following distribution:
b p.b j a; c/p.f j b; g/
For a discrete BN, one simply loops through all values of node b D f1; : : : ; Bg
and calculates the value of the expression above. One then obtains a vector of
probabilities fp.b D 1/; : : : ; p.b D B/g by normalization. Finally, a value for node
b is obtained by sampling from the normalized probabilities according to Eq. 1.8.
The disadvantage is that sampling methods can take a long time to converge,
and that the diagnosis of convergence itself is difficult. Popular sampling methods
include Gibbs sampling, Metropolis-Hastings sampling and importance sampling
[223]. These and other much more refined Monte Carlo methods, are explained in
detail in Chap. 2.
So far we have assumed that the parameters of the graphical model are known, and
that we are interested in inferring the values of the hidden variables, or in integrating
out hidden nuisance variables to obtain the probability of the observed variables.
Naturally, in most cases we also need to estimate the values of the parameters
themselves. We point out again that from a Bayesian point of view, there is no
distinction between inference of hidden node values and inference of the parameters
of the conditional probability distributions, the potentials or the factors. However,
in practice, these two tasks often give rise to different algorithms and methods
[62, 219, 223].
In most cases of parameter learning, only a subset of nodes is observed, which
complicates parameter estimation. Specifically, in parameter estimation we are
46 T. Hamelryck
interested in: X
p. j d/ D p.; h j d/
h
The sum becomes an integral in the continuous case. The fact that we need to
integrate the hidden variables away complicates things. It is impossible to give an
overview of all possible learning methods for graphical models; for many models,
parameter estimation is a topic of intense research. In this section, we will discuss
two of the most popular learning algorithms for Bayesian networks: ML estimation
using Expectation Maximization (EM) [144, 219, 353] and approximative Bayesian
estimation using Gibbs sampling [62, 212, 223].
Expectation maximization corresponds to simple maximum likelihood estima-
tion of the parameters, in the presence of hidden nodes.
X
ML D arg max p.h; d j /
h
The E-step can be performed in several ways. In tractable models such as the
HMM, it is possible to calculate the probability distribution over the hidden nodes,
which is done using the well known forward-backward algorithm [164, 594]. The
resulting distribution is then used to update the parameters of the HMM in the M-
step. For HMMs, the EM algorithm is called the Baum-Welch algorithm [62, 164,
594].
For many Bayesian networks, the probability distribution of the hidden variables
can only be approximated, as deterministic methods quickly become intractable.
In general Bayesian networks, one can approximate this distribution by sampling,
as described in Sect. 1.9.6.2. The E-step thus consists of Gibbs sampling, which
corresponds to blanket sampling in BNs. The sampled values are then used in the M-
step. Such an EM algorithm is called a Monte Carlo EM algorithm [223]. Typically,
one approximates the distribution over the hidden nodes using a large amounts of
samples.
1 An Overview of Bayesian Inference and Graphical Models 47
However, it is also possible to draw a single value for each hidden, thus
effectively creating a ’completed’ dataset, where all the values of the hidden nodes
are explicitly filled in. Such a version of the EM algorithm is called stochastic EM
(S-EM) [223]. The S-EM method has many advantages. Classic EM and Monte
Carlo EM algorithms are notorious for getting stuck in local maxima or saddle
points; the S-EM method is much less prone to this problem [544]. In addition,
S-EM is extremely computationally efficient, as one only needs to sample a single
value for each hidden node.
Once the completed dataset is obtained, parameter estimation becomes trivial: the
parameters of the conditional probability distributions associated with the nodes and
their parents are simply estimated by ML or MAP estimation. In the case of a CPT,
for example, all occurrences of a node’s value together with its parent’s values in
the completed data are collected in a suitable table of counts, which is subsequently
normalized in a table of probabilities. We will encounter the S-EM algorithm again
in Chap. 10, where it is used to estimate the parameters of a probabilistic model of
the local structure of proteins.
The EM algorithm provides a ML estimate of the parameters, and makes the –
rather artificial – distinction between parameters and hidden variables. An analytic
Bayesian treatment is intractable for most models, but good approximations can be
obtained using stochastic methods based on sampling, typically using Markov chain
Monte Carlo methods [223]. Note that sampling is also used in the S-EM algorithm,
but in that case one still ends up with a ML point estimate of the parameters. Here
we want to approximate the probability distribution p. j d/ by a set of samples.
A classic way to do this is by Gibbs sampling [62, 212, 223]. Here one creates
a BN where the prior distributions are explicitly represented as nodes with given
parameters. Gibbs sampling, and other Markov chain Monte Carlo methods that can
be used for the same purpose, are discussed at length in Chap. 2.
1.10 Conclusions
This introductory chapter provides a bird’s eye view on Bayesian inference and
graphical models, and points out some interesting references for further reading.
We end with mentioning some books that we found particularly useful. A short,
accessible introduction to Bayesian statistics can be found in Lee’s Bayesian
statistics [428]. Bernardo and Smith’s Bayesian theory [48], and Robert’s The
Bayesian choice [606] provide more in-depth discussion; the latter from the point
of view of decision theory. Jaynes’ classic book Probability theory: The logic of
science [342] provides a thorough defense of the Bayesian view, and presents many
examples of the paradoxes that arise when alternative views are adopted. Lemm’s
Bayesian field theory [433] gives a physicist’s view on Bayesian statistics. Bishop’s
Machine learning and pattern recognition [62] presents a timely, Bayesian view
on machine learning methods, including graphical models and belief propagation.
Introductory books on Bayesian networks include Pearl’s seminal Probabilistic
reasoning in intelligent systems [568], Neapolitan’s Learning Bayesian networks
[538] and Cowell, Dawid, Lauritzen and Spiegelhalter’s Probabilistic networks and
expert systems [126]. Jordan’s Learning in graphical models [353] is one of the
few books that addresses approximate algorithms. Zuckerman’s Statistical physics
of biomolecules [801] and Dill and Bromberg’s Molecular driving forces [150] are
excellent introductions to statistical physics, with a firm basis in probability theory
and a focus on biomolecules.
Finally, the fascinating history of the rise, fall and resurrection of the Bayesian
view of probability is presented in a very accessible manner in McGrayne’s The
theory that would not die [494], while Stigler’s The history of statistics: The
measurement of uncertainty before 1900 [692] provides a scholarly view on its
emergence.
Chapter 2
Monte Carlo Methods for Inference
in High-Dimensional Systems
Jesper Ferkinghoff-Borg
2.1 Introduction
Modern Monte Carlo methods have their roots in the 1940s when Fermi, Ulam,
von Neumann, Metropolis and others began to use random numbers to examine
various problems in physics from a stochastic perspective [118, 413]. Since then,
these methods have established themselves as powerful and indispensable tools in
most branches of science. In general, the MC-method represents a particular type of
numerical scheme based on random numbers to calculate properties of probabilistic
models, which cannot be addressed by analytical means. Its wide-spread use derives
from its versatility and ease of implementation and its scope of application has
extended considerably due to the dramatic increase within the last 2–3 decades in
accessible computer power. In this chapter we shall mainly focus on the Markov
Chain Monte Carlo method (MCMC) as a tool for inference in high-dimensional
probability models, with special attention to the simulation of bio-macromolecules.
In Sect. 2.2.1 we briefly review the conceptual relations between probabilities,
partition functions, Bayes factors, density of states and statistical ensembles. We
outline central inference problems in Sect. 2.2.2, which will be the focus of
the remaining chapter. This concerns calculation of expectation values, marginal
distributions and ratios of partition functions. In particular, we discuss why these
calculations are not tractable by analytical means in high-dimensional systems.
Basic concepts from sampling theory are delineated in Sect. 2.3. Section 2.4
focuses on the MCMC-method with emphasis on the central detailed balance equa-
tion, the Metropolis-Hastings algorithm [293, 501] (Sect. 2.4.1), Gibbs sampling
[212] (Sect. 2.4.2) and specific concerns regarding the use of the MCMC-method
for continuous degrees of freedom (Sect. 2.4.3). In Sect. 2.5 we discuss how the
MCMC-simulation can be used to address the inference problems outlined in
J. Ferkinghoff-Borg ()
Department of Electrical Engineering, Technical University of Denmark, Lyngby, Denmark
e-mail: [email protected]
where the weights, !.x/, are assumed to be known explicitly. In graphical models in
general and in Random Markov fields in particular ! would represent the product of
the potential functions defining the model [62]. Part IV of this book include several
examples of such type of models for protein and RNA structures.
In statistical physics, ! is a particular function of the total energy, E.x/, of the
system and possibly other extensive variables as well, see Table 2.1. With respect
to the table, the main focus will be on the microcanonical and canonical ensemble.
For sampling and inference in other ensembles we refer to the excellent overviews
given in the textbooks [5, 198, 413]. The interest in the first ensemble derives from
the fact that all thermodynamic potentials can be calculated once the number of
states (or density of states g in the continuous case), , is known. As we shall see,
many advanced Monte Carlo techniques also make direct use of to enhance the
efficiency of the sampling. The interest in the second ensemble is due to the fact that
any probability distribution can be brought to the canonical form by a suitable (re-)
definition of the energy function. For a continuous systems this ensemble is defined
by the probability density distribution
Z
exp.ˇE.x//
pˇ .x/ D ; Zˇ D exp.ˇE.x//dx; (2.2)
Zˇ ˝
where ˇ D kT 1
is the inverse temperature times the Boltzmann constant. For
notational convenience we have suppressed the dependence of V and N in Eq. 2.2,
compared to the expression given in Table 2.1. The weights in the canonical
ensemble, !ˇ .x/ D exp.ˇE.x//, are referred to as the Boltzmann weights. Note
that the other statistical ensembles in Table 2.1 can be brought on the canonical
form by a change of energy function. In the isobaric-isothermal ensemble, for
example, the enthalpy function, H D E C pV , leads to the Boltzmann weights,
!ˇ .x/ D exp.ˇH.x//.
Comparing Eq. 2.2 to the general form, Eq. 2.1, one observes that by simply
associating an “energy”, E, to a state x according to the probability weights,
E.x/ D lnŒ!.x/; (2.3)
and subsequently setting ˇ D 1, the two distributions become identical. In keeping
with this construction, we will in the following refer to both Eqs. 2.2 and 2.1 as the
canonical distribution, in distinction to various modified or extended ensembles to
be discussed later. Similarly, we shall refer to “energies” and “temperatures” without
limiting ourselves to distributions of thermal nature.
As indicated in Table 2.1, the partition functions, Z, play a central role in thermal
physics due to their relation to thermodynamic potentials. In Bayesian modeling, Z
corresponds to the evidence of the model, a quantity which forms an essential part of
52 J. Ferkinghoff-Borg
Table 2.1 The most commonly encountered distributions in statistical mechanics, assuming
a discrete state space. Here, x represents a specific microstate of the physical system, and
˝ is defined from the thermodynamical variables specifying the boundary conditions. The
minimum set of relevant conjugated pairs of thermodynamic variables typically considered is
f.E; ˇ/; .V; p/; .N; /g, where E is the energy, ˇ D kT 1
is the inverse temperature times the
Boltzmann constant k, V is the volume, p is the pressure, N is the particle number and is
the chemical potential. Note the special role played by the number of states, .E; V; N /, the
knowledge of which allows the calculation of all thermodynamic potentials. As further discussed
in the text, the continuous version of these ensembles is obtained by replacing with the density
of states g and summation with integrals. Consequently, p.x/ becomes a probability density in ˝
Microcanonical ensemble
Definition Fixed E; V; N
Domain ˝ = fx j E.x/ D E; V .x/ D V; N.x/ D N g
P
Partition function .E; V; N / = x2˝ 1
Potential Entropy:
S.E; V; N / = k ln
Distribution p.x/ = 1
Canonical ensemble
Definition Fixed ˇ; V; N
Domain ˝ = fx j V .x/ D V; N.x/ D N g
P
Partition function Z.ˇ; V; N / = Px2˝ exp.ˇE.x//
= E .E; V; N / exp.ˇE/
Potential Helmholtz free energy:
F .ˇ; V; N / = ˇ 1 ln Z
exp.ˇE.x//
Distribution p.x/ = Z
Isobaric-isothermal ensemble
Definition Fixed ˇ; p; N
Domain ˝ = fxP j N.x/ D N g
Partition function Y .ˇ; p; N / = x2˝ exp.ˇE.x/ ˇpV .x//
P
= E;V .E; V; N / exp.ˇE ˇpV /
Potential Gibbs free energy:
G.ˇ; p; N / = ˇ 1 ln Y
exp.ˇE.x/ˇpV .//
Distribution p.x/ = Y
considerations it is not Z per se that is of interest but rather the ratio, Z=Z , to
some other partition function Z , and these ratios are tractable – as we shall see.
Here, is a particular probability distribution defined on the same domain ˝
Z
! .x/
.x/ D ; Z D ! .x/dx: (2.4)
Z ˝
Bayesian model comparison, for instance, involves two distributions, p and , and
their relative evidence is precisely given by the Bayes factor Z=Z . The evidence
function of a single model can also be made tractable by expressing it as a partition
function ratio, Z=Z , provided that the normalization of the likelihood function is
known. This is accomplished by identifying .x/ with the prior-distribution over
the model parameters x and p.x/ with the posterior distribution, p.x/ D p.xjD/,
where D are the acquired data. The posterior probability weights ! become
where the manifold ˝ is non-Euclidean, implying that the proper integration is non-
uniform, ! ¤ const.
Provided the correctness of the uniform measure, tot can be identified with the
total number of states in ˝. Similarly, the density of states g.E/ is given as
Z
g.E/ D ı.E E.x//dx; (2.7)
˝
where ı./ is the Dirac delta-function and E.x/ is the physical energy function. The
analogue of the discrete micro-canonical partition function, .E/ (see Table 2.1),
then becomes the statistical weight [440], .E/, defined as
Z
0 0 E E
.E/ D g.E /dE g.E/E; ED E ;E C ; (2.8)
E 2 2
where the last expression holds true when E is chosen sufficiently small.
In order to treat inference in thermal and non-thermal statistical models at a uni-
fied level even when the latter models involve non-uniform reference distributions,
it will prove convenient to generalize the definition of the non-thermal “energy”
associated with a state x, Eq. 2.3, according to
Note that when is identified with the prior distribution in a Bayes model, Eq. 2.9
is the “energy” associated with the likelihood-function. Similarly, if a uniform
reference distribution is used, Eq. 2.9 reduces to the original definition, Eq. 2.3.
Z
Ep Œf .x/ D f .x/p.x/dx: (2.10)
˝
The second, a more elaborate inference problem is to determine the full marginal
distribution, pf .y/, defined as
Z
pf .y/ D ı .f .x/ y/ p.x/dx: (2.11)
˝
R
Since Ep Œf .x/ D R pf .y/ydy, it is straight-forward to calculate the expectation
value, once pf .y/ is known. For a multi-valued function f D .f1 ; ; fd /T the
corresponding formula becomes
Z Y
d
pf .y/ D ı .fi .x/ yi / p.x/dx: (2.12)
˝ i D1
The final inference problem we wish to address is how to estimate the ratio of
partition functions, Z=Z , as presented in Sect. 2.2.1.
Common to all three problems outlined above is that the calculation involves an
integral over ˝. If the dimension, D, of ˝ is low, analytical approaches or direct
numerical integration schemes can be used. However, in most cases of interest D
is large, which leaves these approaches unfeasible. While approximate inference is
partly tractable with deterministic approaches such as variational Bayes and belief
propagation [62] as further discussed in Chap. 11, the Monte Carlo based sampling
approach offer a particular versatile and powerful computational alternative to solve
inference problems for high-dimensional probability distributions.
As an introduction to Monte Carlo methods in general and the Markov Chain Monte
Carlo methods (MCMC) in particular, let us first focus on Eq. 2.10 for the average
of a given function, f . The simplest possible stochastic procedure of evaluating the
integral over ˝ involves an unbiased choice of points, fxi gN i D1 in the state space
and use
1 X
N
Ep Œf .x/ p.x i /f .x i /: (2.13)
N i D1
This procedure is known as random sampling or simple sampling. However, this
procedure is highly inefficient since p.x/ typically vary many orders of magnitude
in ˝, so there is no guarantee that the region of importance for the average will be
sampled at all. Furthermore, it is often the case that p.x/ can only be evaluated
up to a normalization constant, cf. Eq. 2.1, leaving the expression on the r.h.s.
56 J. Ferkinghoff-Borg
Zq 1 X
N
ri f .x i /; (2.16)
Z N i D1
where ri D !!.x i/
q .x i /
are known as the importance weights. The ratio Z
Zq
can be
evaluated as
Z Z
1 X
N
Z 1 !.x/
D !.x/dx D q.x/dx ri :
Zq Zq ˝ ˝ !q .x/ N i D1
Consequently,
X
N
ri !.x i /=!q .x i /
Ep Œf .x/ rQi q.x i /; rQi D P D P :
i D1 j rj j !.x j /=!q .x j /
Note that in the case when q.x/ D p.x/, Eq. 2.16 reduces to a simple arithmetic
average
1 X
N
Ep Œf .x/ f .x i /: (2.17)
N i D1
2 Monte Carlo Methods for Inference in High-Dimensional Systems 57
The efficiency of the importance sampling relies crucially on how well q.x/
approximates p.x/. Often q.x/ will be much broader than p.x/ and hence the set
of importance weight fri g will be dominated by a few weights only. Consequently,
the effective sample size can be much smaller than the apparent sample size N .
The problem is even more severe if q.x/ is small in regions where p.x/ is large.
In that case the apparent variances of ri and ri q.x i / may be small even though
the expectation is severely wrong. As for random sampling it has the potential to
produce results that are arbitrarily in error with no diagnostic indication [62].
The merit the importance sampling method in its iterative form is the ease by which
in principle any distribution, p.x/ can be sampled. These algorithms can broadly be
ordered in two categories: the Population Monte Carlo algorithms and the Markov
Chain Monte Carlo (MCMC) algorithms. This distinction is not strict, however,
as the individual iteration steps of population MC-methods may involve a Markov
chain type of sampling, and conversely MCMC-methods may involve the use of a
“population” of chains.
We will not be concerned with the population Monte Carlo algorithms any
further, but simply give a list of relevant references. The population Monte Carlo
method is called by a lot of different terms depending of the specific context:
“quantum Monte Carlo” [634], “projection Monte Carlo” [104], “transfer-matrix
Monte Carlo” [545, 648] or “sequential Monte Carlo” ([155] and references
herein). The method has also been developed for polymer models by different
groups ([210, 250, 251, 553]). Most notably is the PERM-algorithm developed by
Grassberger et al. [250, 251], which has been shown to be very efficient to compute
finite temperature properties of lattice polymer models [194]. Finally, the population
Monte Carlo method has interesting analogies to Genetic Algorithms [304].
One recent variant of the Population Monte Carlo method which deserves special
attention is the nested sampling by Skilling [666, 669]. This sampling, developed
in the field of applied probability and inference, provides an elegant and straight-
forward approximation for the partition function and calculation of expectation
values becomes a simple post-processing step. Therefore, the inference problems
outlined in Sect. 2.2.2 can all be addressed by the nested sampling approach. In
this respect, it offers the same merits as the extended ensemble techniques to be
discussed in Sects. 2.7 and 2.8. Furthermore, nested sampling involves only few
free parameters the settings of which seems considerably less involved than the
parameter setting in extended ensembles [562]. On the other hand, nested sampling
relies on the construction of a sequence of states with strictly decreasing energies.
While the method compares favorably to the popular parallel-tempering extended
ensemble for Lennard-Jones cluster simulations [562] and has also found use in
the field of astrophysics [181], it it yet not clear if the restriction of this sequence
construction can be effectively compensated by the choice of the population size
58 J. Ferkinghoff-Borg
for systems with more complicated energy landscapes. We shall briefly return to the
method in Sect. 2.7.4 and this discussion in Sect. 2.10.
The principle of the Markov chain Monte Carlo (MCMC) algorithm is to represent
the probability distribution, p.x/, by a chain of states fx t gt . Formally, a (time-
independent) Markov chain is a sequence of stochastic variables/vectors fXt gt ,
for which the probability distribution of the t’th variable only depends on the
preceding one [77]. While the MCMC-method can be applied equally well to
continuous state spaces (see Sect. 2.4.3), we shall in the following assume ˝ to
be discrete for notational convenience. Consequently, the stochastic process in ˝ is
fully specified by a fixed matrix, W , of transition probabilities. Here, each matrix
element, W .x 0 jx/, represents the conditional probability that the next state becomes
x 0 given that the current state is x. This matrix must satisfy
X
W .x 0 jx/ D 1; W .x 0 jx/ 0: (2.18)
x0
Let p0 .x/ be the probability distribution of the initial state. Starting the Markov
chain from a unique configuration, x 0 , the probability distribution will be 1 for this
particular state and 0 otherwise. The distribution after t steps will then be given by,
X
pt .x/ D W t .x 0 jx/p0 .x 0 /; (2.19)
x0
where W t represents the t’th power of W viewed as a matrix in the space of config-
urations. The so-called Perron-Frobenius theorem of linear analysis guarantees that
an unique invariant limit distribution exists,
X
p1 .x/ D lim pt .x/ D lim W t .x 0 jx/p0 .x 0 /; (2.20)
t !1 t !1
x0
The Perron-Frobenius theorem states that when the normalization (2.18) and these
two conditions are fulfilled, the maximum eigenvalue of W will be 1 with p1 .x/
as the corresponding unique eigenvector;
X
W .xjx 0 /p1 .x 0 / D p1 .x/: (2.21)
x0
The power of the Markov Chain Monte Carlo method is that it offers a simple yet
general recipe to generate samples from any distribution q.x/, as defined through
the limiting behavior of the Markov chain, q.x/ D p1 .x/. As before, we will
focus on the case where the aim is to draw samples from a predefined probabilistic
model, p.x/, i.e. we set q.x/ D p.x/. From Eqs. 2.21 and 2.18 it follows that a
sufficient requirement to insure that the sampled distribution will converge to the
equilibrium distribution, p1 .x/ D p.x/, is that the transition matrix elements
satisfy the detailed balance equation,
The chance of staying in the same state,P W .xjx/, is automatically defined by the
rejection probability, W .xjx/ D 1 x0 ¤x q.x 0 jx/a.x 0 jx/. According to Eq. 2.24
any pair of valid acceptance probabilities must have the same ratio as any other
valid pair. Therefore they can be obtained by multiplying the Metropolis-Hastings
probabilities by a quantity less than one. This implies a larger rejection rate, which
harms the asymptotic variance of the chain. In this sense (although not by all
measures) using Eq. 2.25 is optimal [530, 573].
Although the MCMC-procedure guarantees convergence to p.x/ when the
condition of detailed balance is satisfied, it is important to pay attention to the degree
of correlations between the generated states in any particular implementation of the
t
method. Typically, a certain number, tskip of the initial generated states, fx t gtskip
D0 , are
discarded from the statistics in order to minimize the transient dependence of the
sampling on the (arbitrary) choice of x 0 . This stage of the sampling is known as the
“equilibration” or “burn-in” stage. If T denotes the total number of generated states
a standard MCMC-algorithm using the Metropolis-Hastings acceptance criteria then
proceeds through the following steps:
Metropolis-Hastings sampling
0. Choose T and tskip < T . Generate x tD0 .
1. for t D 1 : : : T
2. Propose x 0 q.jxnt1 / o
!.x 0 /q.x t1 jx 0 /
3. Compute a D min 1; !.x/q.x 0 jx t1 /
.
4. Draw r uniformly in Œ0I 1/.
0
x if r < a
5. Set x t D
x t1 otherwise
6. If t > tskip add x t to statistics.
7. end for
According to Eq. 2.25 a high acceptance can be achieved when the proposal function
0/
is chosen so q.xjx
q.x 0 jx/
p.x/
p.x 0 / . Although the MCMC-procedure is brought in to
play only in situations where one cannot directly sample from the joint distribution
p.x/ of some multivariate quantity, x D .x1 ; ; xD /T , it is in many applications
possible to sample from the conditional distributions, p.xi jx ni /, where x ni denotes
the set of all variables except the i ’th, x ni D .x1 ; x2 ; ; xi 1 ; xi C1 ; ; xD /T .
Since
p.x/ D p.xi jx ni /p.x ni /;
the use of the i ’th conditional distribution as proposal distribution leads to the
Metropolis-Hastings acceptance rate of a D 1. In effect, by choosing i D 1; ; D
in turn (or selecting i 2 f1; ; Dg at random) and sample a new x D .xi ; x ni /
according to xi p.xi jx ni / a Markov chain will be generated that converges to
p.x/. This sampling procedure is known as Gibbs sampling [212]. Examples of the
use of Gibbs sampling in the context of inferential protein structure determination
and in directional statistics can be found in Chaps.12 and 6, respectively.
309, 505, 730]. This condition, known as the loop-closure condition [231], imposes
particular constraints in the degrees of freedom which lead to a non-trivial volume
factors, a point first realized by Dodd et al. [153].
Whether the state-space itself ˝ involves a particular non-trivial metric (case
I) or such metric arises from the geometrical constraints involved in a particular
proposal function (case II), the relative change of volume elements can be deter-
mined by the ratio of the associated Jacobians. The requirement of detailed balance,
Eq. 2.22, implies that the Metropolis-Hastings acceptance becomes
!.x 0 /q.xjx 0 /J.x 0 /
a.x 0 jx/ D min 1; ; (2.28)
!.x/q.x 0 jx/J.x/
where J.x/ and J.x 0 / are the Jacobians evaluated in x and x 0 respectively. In
the following we shall assume that the target weights !.x/ includes the proper
geometrical distribution ! .x/ D J.x/ (case I). This implies that the general
expression for the thermal probability weights becomes
Similarly, we assume that the Jacobian of the proposal distribution (case II) is
“absorbed” into q.x/. This will allow us – in both cases – to express the acceptance
Q
rate on the standard form, Eq. 2.25 and use the Euclidean measure, i dxi , in
expressions defined by integrals over ˝. We refer to the excellent review by Vitalis
and Pappu [741] and references therein for more details on this subject.
2.5 Estimators
First, the expectation value, Eq. 2.10, of any function f .x/ can be estimated as
the arithmetic time average of the Markov chain, once this has converged to its
equilibrium distribution, pt .x/ p1 .x/ D p.x/. Specifically, since we assume
pt .x/ p.x/ when t > tskip the MCMC-estimate, E O p Œf .x/ of the expectation
value of f .x/ follows Eq. 2.17,
2 Monte Carlo Methods for Inference in High-Dimensional Systems 63
1 X
EO p Œf .x/ D fN.x/ D f .x t /: (2.30)
N t
Here, N D T tskip is the total number of states used for the statistics and the
bar over a quantity indicates
P Pan MCMC “time”-average. The summation is over
recorded states only, t D TtDtskip C1 .
When the aim is to estimate the full probability distribution of f , Eq. 2.11,
rather than the expectation value alone, some extra considerations must be taken.
Kernel density methods constitute one obvious approach for estimating pf .y/. For
simplicity and in keeping with most applications of the MCMC-method we shall
make use of histogram based estimators. According to Eq. 2.11 the probability,
p.f .x/ 2 Y/, of obtaining a value of f in some set Y D ŒyQ ; yQC / is given by
Z Z
p.f .x/ 2 Y/ D pf .y/dy D Y .f .x//p.x/dx D Ep ŒY .f .x// ;
Y ˝
(2.31)
where Y ./ is the indicator function on Y,
1 if y 2 Y
Y .y/ D :
0 otherwise
Since p.f .x/ 2 Y/ can be expressed as an expectation value, Eq. 2.31, an estimator
is given by
1 X
O .x/ 2 Y/ D Y .f / D
p.f Y .f .x t //:
N t
yQ CyQC
A Taylor expansion of Eq. 2.31 in y D yQC yQ around y D 2 yields
1
p.f .x/ 2 Y/ D pf .y/y C pf0 .y/y 2 C O.y 3 /:
2
ˇ ˇ ˇ ˇ
ˇ ˇ ˇ d ln.pf .y// ˇ1
Consequently, by choosing y ˇ pp.y/
0 .y/ ˇ D ˇ dy ˇ , the probability density
pf .y/ can be estimated as
ˇ ˇ
1 X ˇ d ln pf .y/ ˇ1
ˇ ˇ :
pOf .y/y D Y .f .x t //; y ˇ ˇ (2.32)
N t dy
64 J. Ferkinghoff-Borg
The histogram method for estimating the full density function pf .y/ arises naturally
by partitioning the image of f . Specifically, let fYi gL i D1 , be such a partition, where
yQ CyQ
Yi D ŒyQi ; yQi C1 /, yi D i 2 i C1 and yi D yQi C1 yQi for yQ1 < yQ1 < : : : < yQLC1 .
Then
n.yi /
pOf .yi / D ; (2.33)
N jyi j
where n.yi / is the total number of states in the sampling belonging to the i ’th
bin. Although Eq. 2.32 suggests that bin-sizes should be chosen according to the
variation of lnŒpf .y/ it is quite common in the histogram method to set the bin-
widths constant.
Note, that the histogram method easily generalizes to multi-valued functions,
f D .f1 ; ; fd /, by considering Yi toQ be a box with edge lengths y i D
d
.yi1 ; ; yid / and volume jyi j D j D1 jyij j. This approach could in
principle be used to estimate the the probability density, pf .y/, in the multi-variate
case, Eq. 2.12. In practice, however, kernel-methods offer a better alternative for
estimating probability densities, particular in higher dimensions, d > 1 [62].
where E refer to the physical energy function in the thermal case, and to Eq. 2.9
in the non-thermal case. Correspondingly, !E .E/ is used as a general notation
for the weight associated with E. For a thermal distribution !E .E/ D !ˇ .E/ D
exp.ˇE/, where ˇ can attain any value, whereas for a non-thermal distribution
!E .E/ D exp.E/ (see Sect. 2.2). This notation will allow us to treat the
estimation of partition functions in the same manner for both thermal and non-
thermal problems.
Now by setting f .x/ D E.x/ in Eq. 2.11, we obtain
Z
g .E/!E .E/
pE .E/ D ı.E.x/ E/p.x/dx D ; (2.36)
˝ Z
Z
ZD g .E/!E .E/dE;
R
Z
g .E/ D ı.E.x/ E/! .x/dx: (2.37)
˝
When ! .x/ D 1, (or ! .x/ D J.x/ for non-Euclidian geometries), g .E/ takes
the physical meaning of the density of states, Eq. 2.7. In particular we obtain for the
canonical ensemble
Z
g .E/ exp.ˇE/
pˇ .E/ D ; Z ˇ D g .E/ exp.ˇE/dE: (2.38)
Zˇ R
Similarly, we can introduce the statistical weights .Ee / for the reference state,
defined as Z
.Ee / D g .E/dE g .Ee /Ee ; (2.39)
Ee
where e refers to the e’th bin of the partition of the energies E.x/
˚ Ee Ee
Ee e D Ee ; Ee C :
2 2 e
Equation 2.39 reduces to the statistical weights for the thermal distribution, when
! is the uniform (or geometrical) measure. The reference partition function can be
expressed in terms of .Ee / as
Z Z X
Z D ! .x/dx D g .E/dE .Ee /: (2.40)
R e
In the context of statistical physics, estimating .E/ from Eq. 2.41 is known
as the single histogram method [702]. Since multiplying the estimates fO .E/ge ,
with an arbitrary constant fO .Ee /ge ! fc O .Ee /ge leads to the same rescaling of
the partition function, ZO ! c Z, O the statistical weights can only be estimated in
a relative sense. This reflects the fact, that MCMC-sampling only allows ratios of
partition functions to be estimated. In practice, an absolute scale is most simply
obtained by prescribing a particular value to either Z or .Eref /, where Eref is some
selected reference bin. In the latter case, Z is then estimated from Eq. 2.42. The
corresponding value for the reference partition function is subsequently found by
66 J. Ferkinghoff-Borg
inserting the estimated statistical weights into Eq. 2.40. This procedure will lead to
a unique estimate of Z=Z .
ited support of the density estimators, pOf .y/ and pOE .E/, discussed in Sect. 2.5.4.
The realization of the ensemble-approach will be reviewed in Sects. 2.7 and 2.8.
The common strategy to improve the Markov chain algorithm itself, is to perform
an integration of the MC-time. This can be particular effective in cases where the
average acceptance rate aN is very low, which is a problem typically encountered in
the Metropolis-Hastings algorithm at low temperatures. The idea of the MC-time
integration is roughly, that instead of rejecting 1=aN moves on average one can
increase the MC-time with 1=aN and enforce a move to one of the possible trial
states according to the conditional probability distribution. This method, which
is limited to discrete models, is known as event-driven simulation [421] or the
N-fold way [71, 548]. The event-driven method provide a solution to the problem
of trapping in local energy minima, in cases where these minima only comprise one
or a few number of states, and where the proposal distribution is sufficient local to
make the successive rescaling of the MC-time feasible. However, this approach is
more involved for continuous degrees of freedom and cannot circumvent the long
correlation times associated with cooperative transitions.
When the qQi ’s are constant (independent of x) this expression reduces to its usual
form, Eq. 2.24, for each move-type. In general, a compromise needs to be made
between the characteristic step-size of the proposal and the acceptance rate. If
the step-size is too large compared to the typical scale by which p.x/ varies, the
2 Monte Carlo Methods for Inference in High-Dimensional Systems 69
acceptance rate will become be exceedingly small, but too small moves will make
the exploration inefficient. Both cases compromise convergence [62].
A successful strategy is often to mix local and non-local types of proposals.
The best example of this approach is in MCMC-applications to spin-systems. The
cluster-flip algorithm, developed in the seminal works of R.H. Swendsen and J.-S.
Wang [704] and U. Wolff [769], was used in conjunction with single spin flip and
proven extremely valuable in reducing critical slowing down near second order
phase transitions.
Another important example is the pivot-algorithm for polymer systems [462]
which has been used with great success to calculate properties of self avoiding
random walks [734]. A Pivot-move is defined by a random choice of a monomer
as a pivot and by a random rotation or reflection of one segment of the chain
with the pivot as origin. The pivot-move is also efficient in sampling protein
configurations in the unfolded state but must be supplemented with local type of
moves to probe the conformational ensemble in the dense environment of the native
state. These type of proposal distributions are commonly referred to as concerted
rotation algorithms (see also Sect. 2.4.3). For instance, Favrin et al. have suggested
a semi-local variant of the pivot-algorithm for protein simulations [176]. This work
has given inspiration to the fully local variant suggested by Ulmschneider et al.
[730] and which provides improved sampling performance compared to earlier
concerted rotation algorithms with strict locality [730]. However, a key bottleneck
of all existing methodologies with strict locality is that some of the degrees of
freedom of the polypeptide chain are not controlled by the proposal function but
post-determined by the condition for chain closure. Based on an exact analytical
solution to the chain-closure problem we have recently devised a new type of
MCMC algorithm which improves sampling efficiency by eliminating this problem
[72]. Further examples of approximate distributions for protein and RNA structures
that serve as effective proposal distributions in a MCMC-sampling framework are
discussed in Chaps. 10 and 5.
In general, the extension of the proposal distribution needs to provide both a
reasonable constancy of the acceptance rate, Eq. 2.25, as well as ensuring that the
gain in sampling efficiency is not lost in the computer time to carry out the move.
In contrast to the method discussed in the next section, efficient choices of q.xjx 0 /
will always be system-specific.
suffers from slow relaxation (e.g. low temperatures/energies) to the part, where the
sampling is free from such problems (e.g. high temperatures/energies) [328]. This
construction can be viewed as a refinement of the chaining technique [62, 535].
While extended ensemble methods have previously been prevalent only in the field
of statistical physics, they are slowly gaining influence in other fields, such as
statistics and bioinformatics [197, 450, 530]. For an excellent and in-depth review
of extended ensembles and related methods in the machine-learning context, we
refer to the dissertation by Murray [530].
An attractive feature of the extended ensemble approach, is that they not only
ensure a more efficient sampling but also provide estimators for the tails of the
marginal distributions, pf .y/ or pE .E/, which are inaccessible to the standard
MCMC-approach (Sect. 2.5.4). Consequently, extended ensembles are particular
suited for calculating key multivariate integrals including evidence or partition
functions. To appreciate this aspect of the extended ensembles in a general statistical
context, we shall in the following retain the explicit use of the reference weights,
! . In a physical context and when ˝ is considered Euclidean, ! D 1 and can
therefore be ignored.
The merit of the extended ensembles is the generality of the approach. They can
in principle be combined with any legitimate proposal distribution and they can be
applied to any system. However, unlike a simple canonical sample the extended
ensembles all introduce a set of parameters, which are not a priori known. The
central strategy in the methods is then to learn (or tune) the parameters of the
algorithm by a step-by-step manner in preliminary runs which typically involve a
considerable amount of trial-and-error [179, 285]. This stage is termed the learning
stage [328] which we shall discuss in some details in Sect. 2.8. After the tuning has
converged, a long run is performed where the quantities of interest are sampled. This
stage is called the sampling or production stage. For all ensembles, it is then straight-
forward to reconstruct the desired statistics for the original target distribution, p.x/.
This reconstruction technique which we shall return to in Sect. 2.9, is known as
reweighting [702]. It should be emphasized, that the statistical weights plays a
key role in all extended ensembles, since both the unknown parameters as well as
the reweighting can be expressed in terms of .
It is convenient to distinguish between two types of extended ensembles, temper-
ing and generalized ensembles [285]. In tempering based ensemble extensions, the
new target distribution is constructed from a pre-defined set of inverse temperatures
fˇr gr , including the original ˇ. These inverse temperatures enter in a Boltzmann-
type of expression for a corresponding set of weights, f!r .x/gr , where !r .x/ /
exp.ˇr E.x//! .x/. In generalized ensembles (GE), the Boltzmann form is
abandoned altogether and !GE .E/ can be any function designed to satisfy some
desired property of the GE-target distribution. Parallel and simulated tempering
presented in Sects. 2.7.1 and 2.7.2 belong – as their names indicate – to the first type
of category, whereas the multicanonical and 1=k-ensemble presented in Sects. 2.7.3
and 2.7.4 belong to the second category. As clarified in Sect. 2.7.5 the tempering-
based methods have a more limited domain of application compared to generalized
ensembles, since in the former case certain restrictions are imposed on the allowed
2 Monte Carlo Methods for Inference in High-Dimensional Systems 71
Y
R Y exp.ˇr E.x r //! .x r /
pP T .x P T / D pˇr .x r / D : (2.43)
rD1 r
Z.ˇr /
pˇr .x s /pˇs .x r /
aQ rs D D expŒ.ˇr ˇs /.E.x r / E.x s /: (2.44)
pˇr .x r /pˇs .x s /
Consequently, the exchange rate between replica r and s becomes Wrsre D qrs ars .
The simultaneous probability distribution pP T will be invariant with this choice, so
detailed balance is automatically satisfied. The temperatures of the two replica r and
s have to be close to each other to insure non-negligible acceptance rates. In a typical
application, the predefined set, fˇr g, will span from high to low temperatures, and
only neighboring temperature-pairs will serve as candidates for a replica exchange.
This construction is illustrated in Fig. 2.1.
The averages taken over each factor pˇr .x r / reproduces the canonical averages
at inverse temperature ˇr , because the replica-exchange move does not change the
simultaneous distribution pP T .x P T /. At the same time, the states of the replicas are
72 J. Ferkinghoff-Borg
Fig. 2.1 An illustration of parallel tempering. The columns of circles represents states in the
Markov chain (with state space ˝ 4 ). Each row of circles has stationary distribution pˇr for r 2
f1; 2; 3; 4g. The horizontal arrows represent the conventional transition matrix, W ˇr , that makes
independent transitions at inverse temperature ˇr . The dotted arrows represent the replica exchange
transition matrix, W re
rs , that exchanges the states of chains with adjacent inverse temperatures r and
s (Figure by Jes Frellsen [195], adapted from Iain Murray [530])
effectively propagated from high to lower temperatures and the mixing of Markov
chain is facilitated by the fast relaxation at higher temperatures. Note that pˇD0 .x/
is identical to the reference distribution pˇD0 .x/ D . Consequently, by setting
ˇrD1 D 0 and ˇrDR D ˇ in the replica set, where ˇ is the inverse temperature of the
original target distribution (ˇ D 1 in the non-thermal case), the parallel tempering
sampling makes it feasible to estimate Z=Z using reweighting, see Sect. 2.9.
The PT-method can be considered as a descendant of the simulated annealing
algorithm [375]. The term “annealing” indicates that the simulations are started at
high temperatures, which are then gradually decreased to zero. The object of this
method is typically not statistical inference but only to search for the ground state
of complex problems. Simulated annealing is a widely used heuristic, and is one of
the key algorithms used in protein structure prediction programs, such as Rosetta
from the Baker Laboratory [426]. While the annealing is useful for escaping from
shallow local minima it does not allow the escape from deep meta-stable states. Such
jumping is in principle facilitated by the PT-method by allowing the temperature to
change “up and down” alternately.
In the field of biopolymer simulations the parallel tempering/replica exchange
method has become very popular in conjunction with molecular dynamics (MD)
algorithms [694], the direct integration of Newtons equations of motion. In these
approaches the sampling of each individual replica is simply given by the MD-
trajectory, whereas replica exchange is performed according to Eq. 2.44. The replica
exchange method has also been used successfully in the context of inferential
structure determination by Habeck and collaborators, as detailed in Chap. 12.
The challenge in the learning stage of parallel tempering algorithm is to set
the number of inverse temperatures R and spacing between them. The standard
requirement is that the exchange-rate between all adjacent pairs of replica happens
with a uniform and non-negligible probability. It can be shown [323, 324, 328] that
this is satisfied when the spacing between adjacent inverse temperatures jˇrC1 ˇr j
is selected approximately as Q.ˇr /1 , where
2 Monte Carlo Methods for Inference in High-Dimensional Systems 73
q q
Q.ˇ/ / Eˇ ŒE 2 Eˇ ŒE2 D ˇ2 .E/ (2.45)
Here, Eˇ ŒE denotes the expectation value of E with respect to the Boltzmann
distribution, Eq. 2.2. Note that ˇ2 .E/ is related to the heat capacity C as C D
kˇ 2 ˇ2 .E/, see Eq. 2.72, and may also be expressed as
d2 lnŒpˇ .x/
ˇ2 .E/ D Eˇ D I.ˇ/;
dˇ 2
where I.ˇ/ is the Fisher information, see Chap. 1. Since E typically scale with
system size or number of degrees of freedom, D, as E / D so will the variance,
ˇ2 .E/ / D. Consequently, the required number of replica scales with system
size as Z ˇmax p
R' Q.ˇ/dˇ / D
ˇmin
However, as ˇ2 .E/ is not a priori known, the number of inverse temperatures and
their internal spacing have to be estimated, typically through an iterative approach
[322, 362]. Further aspects and perspectives of the PT-method can be found in the
recent review by Earl and Deem [166].
1
pS T .x; ˇr / D exp ˇr E.x/ C .ˇr / ! .x/; (2.46)
ZS T
X
R
ZS T D e .ˇr / Zˇr ;
rD1
where is a weight function that controls the distribution in fˇr gr . The Markov
chain is constructed by simulating the system with the ordinary proposal function,
q.x 0 jx/, combined with a probability q.sjr/ of selecting the temperature move,
ˇr ! ˇs , where ˇr is the current inverse temperature. According to Eq. 2.46 the
latter proposal would have the Metropolis acceptance probability
74 J. Ferkinghoff-Borg
Fig. 2.2 An illustration of simulated tempering. The circles represent states in the Markov chain.
The rows represent different values of the inverse temperature. The horizontal arrows represent
the conventional transition matrix, W ˇr at inverse temperature ˇr . The dotted arrows represent the
transition matrix, W str!s , that changes the inverse temperature to an adjacent level (Figure by Jes
Frellsen [195], adapted from Iain Murray [530])
n
o
a.s; xjr; x/ D min 1; exp .ˇs ˇr /E.x/ C .ˇs / .ˇr / :
The unbiased choice, .ˇi / D const:, would on the other hand trap the system in
one of the extreme ends of the temperature region. Accordingly, the learning stage
of simulated tempering involves choosing the set of inverse temperatures, fˇr gR rD1 ,
and finding the weight function . The optimal spacing of the inverse temperatures
can be iteratively estimated in a similar way as described for parallel tempering
[371]. Equivalently, can be estimated iteratively [371], which means estimating the
partition functions, fZˇr g, when is chosen according to Eq. 2.48. Thus, by setting
ˇrD1 D 0 and ˇrDR D ˇ – the target inverse temperature – the ratio Z=Z D
exp.1 R / can be obtained. An alternative estimation approach is discussed in
Sect. 2.9.
The ST-method is less memory demanding compared to PT, since one simulates
only one replica at the time. However, the circumstance that the partition sums also
needs to be iteratively estimated makes the PT-ensemble a better choice both in
terms of convenience and robustness [328], provided that memory is not a limiting
factor.
2 Monte Carlo Methods for Inference in High-Dimensional Systems 75
A very popular alternative to the parallel and simulated tempering is the multicanon-
ical ensemble (MUCA) [40–42]. A more convenient formulation of the ensemble,
which we shall adopt here, is given by J. Lee [427], under the name “entropic
ensemble”. Methods based on similar idea are also known as Adaptive umbrella
sampling [23,313,504]. In their original forms, multicanonical ensembles deals with
extensions in the space of total energy, while Adaptive umbrella sampling focuses
on the extensions in the space of a reaction coordinate for a fixed temperature.
The two methods correspond to different choices of f used for the weight-
parameterization, but from a conceptual point of view they are closely related. More
recently, the methodology of the adaptive umbrella sampling has been generalized
in the metadynamics algorithm proposed by Laio and Parrinello [411], which we
will return to in Sect. 2.8.5.
As opposed to the ST- and PT-method, the state space itself is not extended in the
multicanonical approach, ˝MUCA D ˝. Instead, the multicanonical ensemble is a
particular realization of a generalized ensemble (GE), where the use of Boltzmann
weights are given up all-together and replaced by a a different weight function !GE .
As further discussed in Sects. 2.7.5 and 2.8.5, !GE can in principle be a function of
any set of collective coordinates, f D .f1 ; ; fd /T . For sake of clarity we shall
here assume !GE to be a function of energy E only
!GE .x/ D !E;GE E.x/ ! .x/: (2.49)
The procedure of sampling according to !GE follows directly from the Metropolis-
Hastings expression for the acceptance probability
0 !GE .x 0 /q.x 0 ! x/
a.x jx/ D min 1; : (2.50)
!GE .x/q.x ! x 0 /
According to Eq. 2.36 the marginal distribution over E for any arbitrary weight
function !E;GE is given as
Z
!E;GE .E/g .E/
pE;GE .E/ D ; ZGE D g .E/!E;GE .E/dE:
ZGE R
The question is how to choose !E;GE in an optimal way. The choice of Boltzmann
weights generates a sampling, which only covers a rather narrow energy-region
(Sect. 2.5.4). Instead, a uniform sampling in the energy-space would be realized
by the function,
!E;MUCA .E/ D g .E/1 : (2.51)
These weights define the multicanonical ensemble.
The multicanonical ensemble was originally devised as a method to sample first
order phase transitions [41], and in this respect it is superior to the “temperature
76 J. Ferkinghoff-Borg
based” approaches of the PT- and ST-ensemble. In the original work [41], the
method was shown to replace the exponential slowing down related to a canonical
sampling of a first order phase transition with a power like behavior. The disad-
vantage of the MUCA-ensemble resides in the extensive number of parameters
which need to be estimated in the learning stage. In principle, the weights for
each individual energy (or energy-bin) !E;GE .E/ represent an independent tunable
parameter. For the uniform reference state, the total entropy, S D k lnŒtot , will
scale with system size as S / D. Consequently, the required number of weight-
parameters to resolve
p lnŒg.E/ sufficiently fine also scales proportionally to D, as
opposed to the D scaling for the required number of temperatures in the PT- and
ST-ensemble. We shall return to this problem in Sect. 2.8 .
The multicanonical ensemble has been applied in a wide range of problems in
statistical physics, including systems within condensed matter physics, in gauge
theories and in different types of optimization problems (reviewed in [39,328,755]).
U.H.E. Hansmann and Y. Okamoto pioneered its use in studies of proteins [284] and
by now a quite extensive literature exists which is compiled in [288, 332].
2.7.4 1/k-ensemble
The MUCA-ensemble is diffusive in the energy space and may therefore spend
unnecessary amount of time in high energy regions. In many statistical problems,
one is often more interested in properties of the system around the ground
state. However, an ensemble with too much weight at low energies may become
fragmented into “pools” at the bottoms of “valleys” of the energy function, thus
suffering from the same deficiencies as canonical ensemble at low temperatures,
where ergodicity is easily lost. It has been argued that optimal compromise between
ergodicity and low temperature/energy sampling is provided by the 1/k-ensemble
[303]. The 1=k-ensemble is also a generalized ensemble, defined by the weight-
function
!1=k .x/ D !E;1=k E.x/ ! .x/; (2.52)
where Z
1
!E;1=k .E/ D ; k .E/ D g .E 0 /dE 0 : (2.53)
k .E/ E 0 <E
Consequently, the 1=k-ensemble assigns larger weights to the low energy region as
compared to the multicanonical ensemble. Furthermore, the 1=k-ensemble is less
sensitive to errors in an estimate of density of states, due to the integral form of the
weights [303].
It can be shown that samples from the 1=k-ensemble approximately have a
uniform marginal distribution over lnŒg .E/ [303]. This appealing feature is shared
by the nested sampling method [666, 669], briefly presented in Sect. 2.3.2. In
fact, nested sampling parameterizes the partition- or evidence function, Eq. 2.35,
precisely according to the cumulant distribution k .E/. Specifically, the total
2 Monte Carlo Methods for Inference in High-Dimensional Systems 77
cumulant reference (or prior) mass covering all energies below a given E can be
expressed as X.E/ D kZ.E/
. Since this is a monotonously increasing function one
may equally well parameterize E according to X . Consequently, Eq. 2.35 can also
be written as Z 1
ZD !E E.X / dX: (2.54)
0
This reparameterization is used in a quite original way in the nested sampling
method [669].
So far we have discussed extended ensembles using the energy as the axis of
extension. The energy, however, plays a special role as this quantity dictates the
probability weights !.x/ of the target distribution, as discussed in Sect. 2.2. In the
following we shall shortly outline how generalized ensembles can be applied to
facilitate broad sampling along any choice of variable(s) f , here considered as a
reaction coordinate. The digression will also serve to highlight the limitations of
using tempering based methods for this task.
When the axis of extension is not some simple function of the energies, Eq. 2.49
is not the most convenient form of the GE-weights. Rather, we shall use
!GE .x/ D !f;GE f .x/ !.x/; (2.55)
where !.x/ is the original target weights. The marginal probability distribution
pf;GE .y/ then becomes
Z
1
pf;GE .y/ D ZGE ı f .x/ y !GE .x/dx
˝
Z
D !f;GE .y/pf .y/; (2.56)
ZGE
where Z
ZGE D Z !f;GE .y/pf .y/dy: (2.57)
R
Consequently, the flat histogram/multicanonical ensemble is realized with the
choice
1
!f;MUCA .y/ D : (2.58)
pf .y/
Equation 2.58 can be applied to any marginal distribution pf .y/. Tempering based
extensions on the other hand, are by and large limited to cases where lnŒpf .y/ is
concave. This observation can be deduced from Eq. 2.56 which for a replica with
inverse “temperature” , the Lagrange parameter conjugated to y, would read
78 J. Ferkinghoff-Borg
Z
Z
pf; .y/ D exp.y/pf .y/; Zf; D Z exp.y/pf .y/dy: (2.59)
Zf; R
If lnŒpf .y/ is not concave, this equation will have multiple solutions for some
range of -values, implying that pf; .y/ is not uni-modal. Such distributions are
generally difficult to sample efficiently (Sect. 2.6). It is precisely the ability of the
GE-techniques to handle a non-concave behavior of lnŒg.E/ that allow them to
alleviate the exponential slowing down associated with first order phase transitions.
For a more detailed discussion of the caveats of tempering based methods to sample
non-concave functions we refer to the excellent presentation by Skilling [669].
based on the best current estimates of g D =E (pf ). In Sect. 2.8.1 we present
the most common approach, originally used in the simulation field, and which is
based on the single histogram method (Sect. 2.5.2). Due to its poor convergence
properties a number of alternative approaches have since then been suggested,
including transition based methods (Sect. 2.8.2), hybrid methods (Sect. 2.8.3) and
non-Markovian approaches (Sects. 2.8.4 and 2.8.5). In Sect. 2.8.6 we present a
learning algorithm (Muninn) developed in our own group, which aims at combining
the merits of non-Markovian approaches in terms of their adaptiveness and speed
with the requirement of detailed-balance. Each section is concluded with references
to model systems where the particular algorithm has been successfully applied.
The simplest approach [38] to obtain the GE-weights is based on applying the single
histogram method (see Sect. 2.5.3) iteratively, given some pre-partitioning fEe gL
eD1
of the energy-space with constant bin-sizes Ee D E. Accordingly, let nk .Ee /
be the histogram obtained using the weight scheme
where e refers to the energy bin and !E;k .Ee / is the generalized ensemble weights
for the k’th iteration. From Eq. 2.41 one obtains the following estimates of the
statistical weights
nk .Ee /
O .Ee / / : (2.60)
!E;k .Ee /
In turn, this estimate can be used to define the GE-weights !E;kC1 for the subsequent
iteration according to the choice of ensemble. Typically, constant weights are used
for the first iteration.
The standard iteration scheme, as defined by the single histogram method, entails
a number of problems which were first systematically addressed in Refs. [672,673].
First of all, it suffers from the loss of statistics inherited in the updating rule, as
the information obtained from the previous k 1 iterations is neglected [179, 180].
Secondly, the scheme requires a careful choice of sampling time Tk for each iteration
k. If Tk is not much longer than the convergence time,
conv of the extended
ensemble the iteration procedure will fail to converge as discussed in Sect. 2.7.5.
However, choosing Tk too high will compromise the speed of convergence. Thirdly,
the scheme is sensitive to errors in gO resulting from low statistics, which can
cause the convergence process to become irregular [672]. Another aspect of this
problem is how to assign weights to energies that have not yet been visited by the
simulation. Finally, the choice of E involves a compromise between the accuracy
of the estimator O .Ee / gO .Ee /E and the resolution (viz. the efficiency) of
the ensemble. Ideally, E should be chosen as a function of E to ensure a uniform
2 Monte Carlo Methods for Inference in High-Dimensional Systems 81
Initially, gO .Ee /
1 and with
typically set to
0 D exp.1/, the simulation
will visit a wide range of energies quickly [754]. The modification procedure is
repeated until the accumulated histogram satisfy some prescribed criteria of flatness
which is periodically checked. Typically, the histogram is considered sufficiently
flat when the minimum entry is higher than 80% of the mean value. At this point,
the histogram is reset and the modification factor is reduced according to the recipe,
p
kC1 D
k . Note, that this procedure implicitly guarantees a proper choice of the
simulation time at iteration k, Tk . The simulation is stopped when the modification
factor is less than a predefined final value
f i nal . For continuous state-spaces it is
2 Monte Carlo Methods for Inference in High-Dimensional Systems 83
customary to also supply the algorithm with a specific energy-window ŒEmin ; Emax
to which the sampling is constrained.
It is important to emphasize that detailed balance, Eq. p2.22 is only ensured in the
limit
! 1. Indeed, the error is known to scale as ln.
/ [796]. Consequently
ln.
f i nal / is typically set very small, ln.
f i nal / ' 106 108 [413, 753]. At this
point, gO readily provides an accurate estimate of the true density of states.
The elegance and apparent simplicity of WL-algorithms over other methods has
lead to a large variety of applications, including spin systems [753, 754], polymer
models [561, 596, 640, 746], polymer films [331], fluids [725, 774], bio-molecules
[217, 599] and quantum problems [727]. However, despite being generally regarded
as very powerful, the algorithm suffers from a number of drawbacks and perfor-
mance limitations [138, 413, 796], especially when applied to continuous systems
[590]. These drawbacks stem from the difficulty of choosing simulation parameters
optimally, dealing with the boundaries of the accessible conformational space, as
well as from the practical discretization of the energy range [590]. The problems
pertaining to the constant binning in the WL-approach has also been highlighted
in recent work on the thermodynamics of Lennard-Jones clusters using nested
sampling [562] (Sect. 2.3.2).
2.8.5 Metadynamics
In this sense, the method can be considered as a learning algorithm for the
multicanonical ensemble. Once a flat distribution has be realized, the original
probability distribution pf .y/ can be obtained from the weights as pf .y/ /
1
!M .y/.
Metadynamics is more conveniently discussed in terms of the free energy
surface, F .y/, defined as
e ˇ.F .y/F / D pf .y/; (2.62)
where ˇ is the inverse temperature of the original target distribution p.x/ and F
is the total free energy, here simply serving the role as a normalization constant.
Similarly, Eq. 2.61 is expressed as
84 J. Ferkinghoff-Borg
pM .y/ / e ˇ F .y/CFM .y/
;
where FM is the potential associated with the biasing weights of the metadynamics
algorithm, exp.ˇFM .y// D !M .y/. Metadynamics achieves a flat distribution in
f by adding a Gaussian term to the existing biasing potential FM at regular time
intervals
. For the one-dimensional case (f ! f ), the total biasing potential at
time t will then be given by
2 !
X y f .x t /
FM .y; t/ D h exp ;
t D
;2
;3
;
2ıy 2
where h and ıy is the height and width of the Gaussians [411, 412]. In an MD-
simulation the components of the forces coming from the biasing potential will
discourage the system from revisiting the same spot. The same effect will take place
in an MCMC-simulation using the the acceptance probability implied by the history-
dependent weights, cf. Eq. 2.55
0 !.x 0 /q.xjx 0 / 0
a.x jx/ D min 1; exp ˇŒFM .f .x // FM .f .x// :
!.x/q.x 0 jx/
a.fx s ; x r gjfxr ; x s g/
n
o
D min 1; exp ˇ FM;r .x r / C FM;s .x s / FM;s .x r / FM;r .x s / ;
Muninn is a histogram based method that aims at combining the merits of the
non-Markovian approaches in terms of their adaptiveness with the requirement
of detailed-balance. As opposed to other GE-learning algorithms, the estimates
of the statistical weights are based on a maximum likelihood approach which
in principle allows one to combine all the information obtained from previous
iterations. As detailed below, this approach also makes it possible to adapt the
histogram bins according to the variations of ln.g / or ln.!E;GE / to ensure a
uniform ensemble resolution [196]. Here, !E;GE refers to the target weights as
defined by the multicanonical, 1=k or any other generalized ensemble.
from Eqs. 2.36 and 2.39 that the probability of visiting Ee in simulation i within a
restricted energy domain Si is approximately given by
where .Ee / g .Ee /Ee are the statistical weights associated with the
reference state and i is the indicator function for Si
1 if Ee 2 Si
i .Ee / D
0 otherwise.
Given that the i ’th simulation is in equilibrium within the region Si , the histogram
ni will be a member of the multinomial probability distribution
P
where Ni D e i .Ee /ni .Ee / is the total number of counts within the region Si for
the i ’th simulation. Note that ni denotes all bins of the i ’th histogram, while ni .Ee /
only denotes the e’th bin. The probability for observing the full set of histograms is
then given by
Y
M
p.fni gMi D1 j / D pi .ni j / : (2.63)
i D1
where p.Dj / represents the likelihood, Eq. 2.63, and p. / represents the prior
distribution for . Choosing a constant prior, the most likely statistical weight
function, O will by obtained by maximizing Eq. 2.63 with respect each .Ee /.
It can be shown [179, 180, 196] that this maximum likelihood estimator can be
expressed as
PM
O i D1 ni .Ee /
.Ee / D PM : (2.65)
i .Ee /Ni !E;i .Ee /ZO 1
i D1 i
The partition sums ZO i must be estimated self-consistently from Eq. 2.65. This
estimation problem can be formulated as the root of M -nonlinear equations in
fZO i gM
i D1 which in turn can be determined efficiently using the iterative Newton–
Raphson method [180, 196]. This solution is then inserted into Eq. 2.65 to obtain
the statistical weight estimates. While the multihistogram equations improves the
estimates for the statistical weights compared the single histogram method, Eq. 2.41,
these estimators are still only defined up to an arbitrary scale. Consequently, one
must rely on some choice of normalization procedure as discussed in Sect. 2.5.3.
Equation 2.65 can be viewed as a generalization of the multi-histogram equations
derived by Ferrenberg and Swendsen and used for combining simulations at
different temperatures [703]. We shall therefore refer to Eq. 2.65 as the generalized
2 Monte Carlo Methods for Inference in High-Dimensional Systems 87
In fact, the GMH-equations can also be recast into the weighted histogram analysis
method (WHAM) developed by Kumar and coworkers [409] and which constitutes
a powerful and popular approach for reconstructing free energy profiles along
reaction coordinates from a set of equilibrium simulations. These equations arise by
repeating the probabilistic reasoning above for the estimation
of pf .y 2 Ye / from
M equilibrium distributions with weights !i .x/ D !f;i f .x/ !.x/, cf. Eq. 2.55:
PM
ni .ye / X
pOf .Ye / D PM i D1
; ZO i D pOf .Ye /!f;i .ye /: (2.67)
O 1
i D1 Ni !f;i .ye /Zi e
From a mathematical point of view the derivation of Eq. 2.65 only differs from that
of the multihistogram and WHAM-equations by the use of probability arguments
instead of a variance minimization technique. We shall retain the probabilistic
formulation as it generalizes more naturally into a Bayesian inference framework.
Note that the GMH/WHAM equations can be used straight-forwardly to multidi-
mensional problems as well.
The reason for introducing he restricted regions, Si , can deduced from Eq. 2.65:
due to the presence of the i functions, the counts for simulation i will only
influence the overall estimate of the statistical weights at a given energy, if this
energy belongs to Si . In the absence of a regularizing prior, these restrictions allows
us to account for the sensitivity of the estimators associated with low statistics in an
easy manner. Assuming that the histogramPare sequentially ordered with the newest
histogram first, the definition Si D fEe j M j Di ni .Ee / > g will ensure that bins
are only included in the support of i if the accumulated statistics at this point in
the iteration scheme is larger than a given cut-off value, ' 20 30. Muninn is
insensitive to this choice provided 1.
The GMH-equations can be used to define an efficient iteration scheme to obtain the
generalized ensemble weights and the statistical weights. Applying Eq. 2.65 on the
statistics obtained in the last M 1 iterations will provide a much more accurate
estimate of – and thus the GE-weights for the next iteration – than the single
(M D 1) histogram method. This makes three improvements possible. First of all, it
allows the weights to be changed more adaptively (using shorter simulation runs) in
the spirit of the non-Markovian approaches, without compromising the requirement
88 J. Ferkinghoff-Borg
Here, Tk is the simulation time at iteration k,
> 1 and S represents the domain
where reliable estimates are readily available. TkD1 can be set to any some small
constant times the total number of degrees of freedom, say TkD1 ' 100 D [180].
This schemes ensures fast weight-changes while guaranteeing that the total statistics
accumulated within the last M histograms scales proportionally with the total
simulation time in the limit of many iterations. This proportionality constant is given
by 1
M [196] which implies that iteration scheme preserves 1 22 D 75% of
the statistics when default values,
D 21=10 and M D 20, are used. Consequently,
the total simulation time T will eventually satisfy T
conv , and the scaling of the
accumulated statistics ensures that estimation errors will decay as / T 1=2 [196].
The second improvement related to Eq. 2.65 is that one can obtain reliable estimates
of the slope ˛ of the target log-weights lnŒ!GE by using the weights !k obtained
from the latest iteration,
In other words, setting r 1 the binning procedure will automatically ensure that
the bin-widths for the histogram technique are chosen appropriately according to the
variation of the target weights !E;GE , as required by Eq. 2.32. This procedure alle-
viates the deficiencies associated with a constant binning [562, 590], an advantage
that Muninn shares with the nested sampling approach [562] (Sect. 2.3.2).
Finally, the “running” estimates of ˛.Ee / allows one to restrict the generalized
ensemble sample to a particular temperature window of interest [196], rather than
to a particular energy window of interest. For continuous systems it is most often
2 Monte Carlo Methods for Inference in High-Dimensional Systems 89
easier to express the region of state space of relevance in terms of the temperatures
rather than the energies.
2.8.6.4 Summary
In summary, we have discussed how the main difficulties pertaining to the learning
stage of generalized ensembles, as outlined in Sect. 2.8.1, are addressed in Muninn.
This includes estimation technique, simulation time, binning procedure and assign-
ment of weights to unobserved regions. The original algorithm was benchmarked
on spin-systems, where a marked improvement of the accuracy was demonstrated
compared to existing techniques, including transition-based methods and the WL-
algorithm, particularly for larger systems [180]. The method has subsequently been
applied to folding and design of simplified protein models [6–8, 276, 726], all-atom
protein and RNA structure determination from probabilistic models [69, 70, 197]
as well as from data obtained by small angle X-ray scattering [693] and nuclear
magnetic resonance [552]. A recent study clearly demonstrates that the method also
compares favorably to the WL-algorithm on protein folding studies and other hard
optimization problems [196], where order(s) of magnitude faster convergence is
observed.
While Muninn has not been applied for sampling along reaction coordinates,
we would expect the algorithm to be equally efficient for these problems due to
the statistical equivalence between estimating and pf .y/. The code is freely
available at http://muninn.sourceforge.net.
We shall conclude this section by returning to the three main inference problems
outlined in Sects. 2.2.2 and 2.5, namely how to estimate expectation values, marginal
density distributions and partition functions from the sampling.
The basic premise of extended ensembles is that the sampling along the axis of
extension satisfies the dual purpose of enhancing the mixing of the Markov chain
as well as allowing the calculation of particular multivariate integrals of interest.
Assuming that the appropriate parameters of the extended ensembles have been
learned, the inference problems outlined above can be solved based on sufficient
samples, T
conv , from the equilibrium distribution.
along the coordinate y D f .x/. The metadynamics algorithm has been designed
especially for this purpose and has been shown to be applicable also for higher-
variate distributions [576]. Other generalized-ensemble algorithms including those
discussed in Sect. 2.8 could in principle be applied for this problem as well, although
it is still unclear how these various algorithms compare to each other in terms of
accuracy, generality and efficiency. As discussed in Sect. 2.7.5, tempering based
methods are not suited for this task.
The calculation of partition function ratios, Z=Z can be carried out choosing E
as the axis of extension, where E either refers to the physical energy function
or to Eq. 2.9 for non-thermal problems. The energy extension is required in cases
where Z=Z is not of the order of unity, otherwise the standard MCMC-approach
will suffice [34]. More precisely, the estimation requires that the range of observed
energies E covers the bulk part of the probability mass for both p.x/ and .x/.
The ratio of the partition functions can generally be obtained from the estimates
of the statistical weights .Ee /, where fEe gL
eD1 is some fine-grained partitioning
of E. In the parallel- or simulated tempering approach, can be found from
the multi-histogram equations (see Eq. 2.66) [703]. In the generalized ensemble
approach, is given by the particular learning algorithm employed, as discussed
in Sect. 2.8. While can only be estimated up to a constant factor, combining
Eqs. 2.42 and 2.40 will provide a unique estimate of the ratio Z=Z :
P
ZO !E .Ee /O .Ee /
D e
P : (2.68)
ZO O
e .Ee /
The weights, !E , for the target distribution are defined from its associated inverse
temperature !E .E/ D exp.ˇE/, where ˇ D 1 in the non-thermal case as
discussed in Sect. 2.2. As shown by Eq. 2.36 it now becomes straightforward to
calculate the marginalized probability distribution over energies for any inverse
Q by reweighting
temperature ˇ,
Q e/
.Ee / exp.ˇE
p Q̌ .Ee / D ; (2.69)
Z Q̌
where X
Z Q̌ D Q e /:
.Ee / exp.ˇE (2.70)
e
The only two prerequisites for this reweighting scheme is that the variation of the
probability weights !E .E/ D exp.ˇE/ Q is small within each energy-bin Ee , and
that the bulk of the probability mass of p Q̌ is within the observed energy region E.
2 Monte Carlo Methods for Inference in High-Dimensional Systems 91
Furthermore, Helmholtz free energy F .ˇ/ (see Table 2.1) and the thermodynamic
entropy S.ˇ/ can be evaluated from the knowledge of the partition function and by
using the relation F D Eˇ ŒE T S
where Ee is the indicator function on Ee . Equation 2.75, which in a physical context
represents the microcanonical average, can be estimated from the sampled states
fxt gt , as X
EO p.jEe / Œf .x/ D n.Ee /1 Ee E.x t / f .x t /; (2.76)
t
where
92 J. Ferkinghoff-Borg
X
n.Ee / D Ee E.x t / (2.77)
t
is the total number of observed states belonging to Ee . Since p.x/ D p.xjEe /p.Ee /
the expectation value of f with respect to p.x/ can be expressed as
X 1 X
Ex Œf .x/ D Ep.jEe / Œf .x/p.Ee / D Ep.jEe / Œf .x/ .Ee / exp.ˇEe /:
e
Zˇ e
(2.78)
Consequently, an estimate of Ep.x/ Œf .x/ is obtained from O and Eq. 2.76.
In this chapter we have focused on the MCMC-method in general and the extended
ensemble approach in particular as a tool for inference in high-dimensional model
systems, described by some given (target) probability distribution p.x/. By intro-
ducing the general notion of a reference state (Sect. 2.2.1) we have aimed at
presenting the use of these methods for inference in Bayesian models and thermal
systems at a unified level. This reference state .x/ can be either an alternative
Bayesian model describing the same data, the prior distribution for a single Bayesian
model or simply the geometrical measure of the manifold ˝, usually assumed to be
Euclidian (! D 1). When the “energy” of a state is defined according to Eq. 2.9
for non-thermal models, the reference state will in all cases be associated with the
inverse temperature ˇQ D 0 and the target distribution with the inverse temperature
ˇQ D ˇ. Using energy as the axis of extension the extended ensemble approach
facilitates a sampling which smoothly interpolates from ˇQ D 0 to ˇQ D ˇ while at the
same time improving the mixing of the Markov chain (Sect. 2.7). This construction
allows the calculation of partition function ratios used for evidence estimates, model
averaging or thermodynamic potentials (Sect. 2.9), quantities which are usually
not tractable by the standard MCMC-procedure (Sects. 2.5 and 2.6). The proper
sample weights for the target distribution p and the estimation of expectation values
becomes a simple postprocessing step (Sect. 2.9).
One noticeable drawback of the extended ensemble approach is the use of a
number of parameters which needs to be learned before reliable inference can
be made from the sampling. The popularity of the parallel tempering/replica
exchange ensemble derives from the fact that this step is less involved than in the
generalized ensemble approaches (Sect. 2.8). On the other hand, the Wang-Landau
and metadynamics algorithm (Sects. 2.8.4 and 2.8.5) used for the multicanonical
ensemble constitute a significant step forward with respect to the automatization of
this parameter learning, an advantage we believe Muninn shares with these non-
Markovian approaches.
The nested sampling MC-method proposed by Skilling [666, 669] (Sect. 2.3.2)
provides an interesting alternative to extended ensembles, because it only involves
2 Monte Carlo Methods for Inference in High-Dimensional Systems 93
one essential parameter; the population size K. However, since nested sampling
uses a sequence of strictly decreasing energies the ergodicity or accuracy of the
sampling has to be ensured by a sufficiently large choice of K. In this respect it is
very different from the extended ensemble approach where ergodicity is established
by “seamless” transitions up and down the energy/temperature axis. How these two
different approaches generally compare awaits further studies.
As a conclusive remark, it is an interesting curiosity that while Monte Carlo
methods continues to play a central role in Bayesian modeling and inference
[62], the method itself has never been subject to a full Bayesian treatment [272].
Rather than providing beliefs about quantities from the computations performed,
MC-algorithms always lead to “frequentist” statistical estimators [272, 530, 598].
Rasmussen and Ghahramani have proposed a Bayesian MC methodology for cal-
culating expectation values and partition function ratios using Gaussian Processes
as functional priors. This approach leads to superior performance compared to
standard MC-estimators for small sample sizes [598]. However, in its current form
the method does not extend easily to large samples for computational reasons [598].
Skilling has argued that the nested sampling method is inherently Bayesian, because
the reparameterization of the partition function, Eq. 2.54, imposes a definitive
functional form on the estimated values ZO during sampling which in turn induces an
unambiguous distribution for the estimates p.Z/ O [669]. One may tentatively argue
that a general Bayesian treatment should include a prior distribution over Z or –
alternatively – over the statistical weights p. /. For a further discussion on the
Bayesian aspects of the nested sampling, we refer to Murrays dissertation [530].
One natural entrance point to a Bayesian Monte Carlo methodology in the
context of extended ensembles is given by the maximum-likelihood approach used
in Muninn. Indeed, Eq. 2.64 is directly amenable to a full Bayesian treatment,
where inference of the statistical weights are based on the posterior distribution
p. jD/, rather than on the likelihood function p.Dj /. In practice, this requires
the formulation of a suitable mathematical framework, where general smoothness,
scaling or other regularizing properties of can be incorporated in the prior p. /
while at same time allowing the posterior knowledge to be summarized in a tractable
manner.
Part II
Energy Functions for Protein Structure
Prediction
Chapter 3
On the Physical Relevance and Statistical
Interpretation of Knowledge-Based Potentials
3.1 Introduction
In the concluding section, we present our recent work that uses simple Bayesian
reasoning to provide a self-consistent, quantitative and rigorous definition of KBPs
and the reference state, without introducing unwarranted assumptions such as
pairwise decomposability. Our work also elucidates how a proper definition of
the reference state relates to the intended application of the KBP, and extends the
scope of KBPs beyond pairwise distances. As summarized in Sect. 3.5, KBPs are in
general not related to physical potentials in any ordinary sense. However, this does
not preclude the possibility of defining and using them in a statistically rigorous
manner.
In this section a brief introduction to statistical physics and liquid state theory is
given, as most KBPs for protein structure prediction are based on concepts from
these areas. We summarize the theoretical background, and introduce the two main
concepts, the free energy and the potential of mean force, and show how they can
be related to the probability that a system adopts a specific configuration. The main
results are given by Eq. 3.9 relating physical free energies to the entropy and average
potential energy of a thermal system; Eq. 3.18 defining the potential of mean force
for liquid systems and Eq. 3.33 relating free energy differences to probability ratios.
The latter expression forms the basis for the construction of all KBPs.
1
See [33] for a good presentation of Boltzmann statistics and the relation to information theory.
100 M. Borg et al.
where
X p2
H.r; p/ D E.r/ C i
(3.2)
i
2m i
Here, h is the Planck constant. Since this integral factorizes we can write
Zrp D Zr Zp ; and
P p2
ˇ i
e ˇE.r/ e i 2mi
pˇ .r; p/ D pˇ .r/pˇ .p/ D ; (3.5)
Zr Zp
where,
Z P p2i Y dpi Y .2 mi kb T /3=2
ˇ
Zp D e i 2mi
D and
i
h3 i
h3
Z
Zr D e ˇE.r/ dr: (3.6)
˝
F D kB T ln Z: (3.7)
This potential is a scalar function which represents the thermodynamic state of the
system, as further discussed in Chap. 2. The entropy of the ensemble is defined as
Z
e ˇE.r/
S kB Eˇ ln p .r/ D kB ˇE.r/ ln Z dr
˝ Z
1
DT Eˇ ŒE F ; (3.8)
3 On the Physical Relevance and Statistical Interpretation 101
where Eˇ Œ represents the expectation value of a quantity with respect to the
configurational Boltzmann distribution. Consequently, the free energy can be
viewed as a sum of contributions from the average energy and entropy,
F D Eˇ ŒE T S: (3.9)
In the widely accepted Anfinsen’s hypothesis [11, 12], a protein sample under
folding conditions is assumed to reach thermal equilibrium, and the state with
highest probability is the folded state, or native state. Protein structure prediction can
thus be viewed as a search for the state with the highest probability. Free energies
are in general difficult to compute, as information of the entire distribution of states
is required. The development of new methods for calculating or estimating free
energies from Monte Carlo or molecular dynamics simulations is an active field of
research [386, 388, 728].
The potential of mean force was introduced in the twentieth century in theoretical
studies of fluids [376]. Here, we follow the presentation given by McQuarrie [496].
We consider a system consisting of N particles that can be considered identical,
for example atoms or rigid molecules, in a fixed volume V . As in the previous
section, we omit the momenta of the particles (we assume that they can be integrated
out), and represent a state as a point r in the 3N -dimensional space that is defined
by the N atomic positions r D fri g. With dr D dr1 dr2 : : : drN we can write the
probability of finding the system in a specific conformation as
where the factor in front accounts for the combinatorics of selecting the n particles.
For non-interacting particles, E 0, we have Z D V N , and with n N we obtain
N Š V N n Nn
Q.n/ .r1 ; r2 ; : : : rn / D D n ; E 0 (3.12)
.N n/Š V N Vn
where Q.n/ refers to the interaction-free distribution and D N=V is the average
density of the system. The n-body correlation function, g.n/ .r1 ; r2 ; : : : rn /, is now
102 M. Borg et al.
The correlation function describes how the particles in the system are redistributed
due to interactions between them. Using Eq. 3.11, the correlation function can be
written
R R
Vn NŠ : : : e ˇE.r1 ;r2 ;:::rN / drnC1 drnC2 : : : drN
g.n/ .r1 ; r2 ; : : : rn / D n :
N .N n/Š Z
(3.14)
The two-body correlation function g.2/ .r1 ; r2 / is of particular importance in liquid
state physics. In a liquid consisting of spherically symmetric molecules,ˇ the two- ˇ
body correlation function only depends on the pairwise distances rij D ˇri rj ˇ.
We omit the subscript and write the two-body radial distribution function as g .r/.
For a liquid with density , the probability of finding a particle at distance r provided
there is a particle at the origin is given by g .r/ dr. Note that the integral of this
probability is Z
g .r/ 4 r 2 dr D N 1 N: (3.15)
At large distances, the positions are uncorrelated, and thus limr!1 g .r/ D 1,
whereas at small r, they are modulated by the inter-particle interactions. In Fig. 3.1
the radial distribution function is shown for a model liquid of hard spheres with
diameter and with intermolecular potential ˚ .r/ [377]
4
g(r)
1r<
˚ .r/ D : (3.16)
0 r
It is not possible to calculate the radial distribution function analytically even for
this simple model system, but there are approximate solutions that yield results that
are very close to results from computer simulations [789]. As seen in Fig. 3.1, the
excluded volume of the spheres results in an increased density at close separation.
The peak at r D 2 corresponds to the second coordination shell. For distances
r < , the pair distribution function is identically zero.
In practice, the radial distribution function can be determined experimentally
using X-ray or neutron diffraction [107]. The radial distribution function obtained
from diffraction experiments is a superposition of the distributions from different
atomic species in the sample, which limits this approach to studies of simple liquids.
Turning back to our n-body description of the system, we define a quantity
w.n/ .r1 ; r2 ; : : : rn / as
Using Eq. 3.14, the gradient of w.n/ .r1 ; r2 ; : : : rn / with respect to the position of one
of the molecules j 2 Œ1; n can be expressed as
R
e ˇE rj EdrnC1 drnC2 : : : drN
rj w.n/ .r1 ; r2 ; : : : rn / D ˇR
e ˇE drnC1 drnC2 : : : drN
D ˇEˇ;n rj EN .r1 ; r2 ; : : : rn / ; (3.18)
where Eˇ;n is the expectation value of a quantity for a fixed set of molecule
positions r1 ; : : : ; rn . In other words, the gradient of w.n/ .r1 ; r2 ; : : : rn / with respect
to one particle equals the expectation value of the gradient of the energy over the
remaining N n particles. This corresponds to the force acting on that particle
when the remaining particles are distributed according to the canonical distribution.
Therefore, w.n/ is called the potential for the mean force.
While Eq. 3.10 is useful in Monte Carlo simulations, it can not be ‘inverted’; it is
not possible to obtain the detailed energy potential, E.r/, by estimating p.r/ from
observed frequencies, due to the high-dimensional nature of the configuration space
˝. This inversion scheme can only be carried out by studying the probability of
the system in some low dimensional subspace, as function of a few descriptors or
coarse-grained variables, .
The choice of descriptor could for example be the pairwise distances between
atoms, the torsional angles or radius of gyration of a protein molecule (see also
104 M. Borg et al.
Sect. 3.3.2). It is defined as a function of the positions of the N atoms for a state,
e.g.,
D .r1 ; r2 ; :::; rN / ; (3.19)
and corresponds to a hypersurface in conformational space. In the general case,
may represent a d -dimensional function D .1 ; ; d /, so the set
X
N
@rk .q/ @rk .q/
gij D mk ; (3.23)
@qi @qj
kD1
Therefore
P 1
e ˇH.q;pq / e ˇE.q/1=2ˇ ij .pq /i gij .pq /j
pˇ .q; pq / D D (3.25)
Zrp Zrp
The last expression on the right hand side is obtained by expressing the linear
momenta pi D mi rP i appearing in the original Hamiltonian, Eq. 3.1, as a function of
the time derivative of the generalized coordinate qPj . From here we obtain
3 On the Physical Relevance and Statistical Interpretation 105
Z p !3N Z
dsdpq 1 2kb T
pˇ .0 / D pˇ .s; 0 ; pq / 3N D Zrp e ˇE.s;0 / J.s; 0 /ds
h h
p (3.26)
where J.s; 0 / is a volume factor J D det.g/.s; 0 / coming from the marginal-
ization over the conjugate momenta. As opposed to the case where the configuration
space is parameterized using atomic positions, this volume factor will in general
depend on the coordinates, q D .s; 0 /, specifying the configuration. The extra
contribution from J is sometimes written in the form J D exp.V /, where V
is called the Fixman potential [187, 188, 291, 403]. Taking the Fixman potential
into account is important if a reparameterized potential is used for studying protein
dynamics for instance. Defining
Z
Z.0 / D e ˇF .0 / D e ˇE.s;0 / J.s; 0 /ds (3.27)
we can express the result, Eq. 3.26, in terms of the conditional probabilities
(
eˇE.r/ J.r/
if .r/ D 0
pˇ .rj0 / D Z.0 / (3.28)
0 otherwise;
where Eˇ; D Epˇ .rj/ represents the expectation value over the 3N d dimensional
manifold ˝./. In analogy to Eq. 3.9 we can write
F ./ D Eˇ; E ./ T S./; (3.30)
i.e., the free energy profile consists of contributions from both the average energy
and from the entropy profile, and it can therefore strictly speaking no longer be
considered as a potential of mean force.
The above derivation can conveniently be expressed as the following relation
between marginalized probability distributions and free energies
Z
Z./
pˇ ./ D D e ˇ.F ./F / D e S./ˇEˇ; ŒECˇF ; e ˇF D e S./ˇEˇ; ŒE d
Z
(3.31)
where F is the free energy of the full system, as defined by Eq. 3.7. When
is chosen to be the subset of atomic positions D .r1 ; r2 ; ; rn / for a liquid
106 M. Borg et al.
system, the Fixman potential will be identically zero and the entropy S./ is a
constant. Hence F ./ recovers its meaning as the potential of mean force, consistent
with Eq. 3.18. Furthermore, if we set E 0, the corresponding distribution pQ ./
becomes independent of temperature and will equal the normalized density of states,
Q Q e S./
Q
p./ D e ˇ.F ./F / D R ; E0 (3.32)
e S./ d
Here, the FQ ’s refer to the free energies in the interaction-free ensemble. Again,
setting D .r1 ; r2 ; ; rn / in the liquid system this probability is simply given by
Eq. 3.12. From Eqs. 3.13 and 3.17 we see that the potential of mean force w.n/ can
be viewed as the difference between the free energies of the two ensembles in the
presence and absence of interactions, respectively.
In other words, the interaction-free or ideal gas state serves as the reference state
for w.n/ . Since both F and FQ are constants we can – without loss of generality –
redefine F ./ and FQ ./ so that the above equation simplifies to
By analogy to the definition of the potential of mean force, one can for any choice
of define the reference state for the free energies as the interaction-free ensemble.
Consequently, we define
Note, that FQ ./ in general will be -dependent. From Eqs. 3.31 and 3.32 this free
energy difference can be expressed as the ratio of 2 probabilities
The final expression on the right hand side represents the ratio between the observed
statistics in the presence of interactions, nobs , to the statistics expected in the absence
of interactions, nexp . This equation forms the basis for many KBPs that follow
Sippl’s seminal paper [661]. These KBPs are typically called “potentials of mean
force”. However, in Eq. 3.33 the reference state has a clear definition and refers to
a system with the same degrees of freedom (e.g. molecules) taken at E 0 or –
equivalently – ˇ D 0. In contrast, KBPs are typically constructed by comparing
statistics across different protein systems. We shall return to this and other problems
with KBPs in the following sections.
3 On the Physical Relevance and Statistical Interpretation 107
used in many knowledge based potentials [352, 621, 651, 661, 797]. In this case, the
statistics required for calculating the potential are usually collected in the form of
histograms with discrete distance bins. When calculating these kind of potentials,
some care must be taken in order take finite size effects into consideration, otherwise
artifacts in the form of long-range repulsions between hydrophilic residues will
result, as hydrophilic amino acids are over-represented on the surfaces of folded
protein molecules. Another important consideration is to account for the correlations
in the pair distributions which arise as a consequence of chemical bonds only
[110, 618].
For amino acids that are close in sequence, the possible relative inter-residue
distances are mainly determined by the chemical bonds connecting them. A
common approach to take these effects into account is to treat local and non-local
interactions separately [142, 661, 792]. Another way to achieve a higher accuracy is
to use variables that depend on the chemical environment, like solvent accessibility
[658], or to construct potentials that depend on orientation in addition to distance
[17, 87, 453, 518].
Many common structural motifs in proteins involve clusters of three or more
amino acids [360]. These motifs are often functionally important sites, such as
metal-binding pockets and active sites. For example, so-called zinc fingers often
consist of four amino acids in contact with a zinc ion [395]. In order to model
these structures, many-body interactions are required. Even though higher-order
interactions are more difficult to model rigorously due to the use of histograms, there
are several examples in the literature of many-body knowledge based potentials
[142, 178, 491]. In the future, probabilistic models that go beyond histograms will
undoubtedly aid the development of adequate higher-order KBPs.
While the connection between relative frequencies and free energy differences is
apparent from Eq. 3.33, this relation is only valid for two ensembles defined by
the same degrees of freedom. The database that is used for the construction of a
KBP will typically include both structural and sequential variations. Typically, the
database will be a set of known native structures from high-resolution data, and may
also include a number of decoys depending on the application. Consequently, in
order to obtain a general, transferable knowledge based potential (KBP) for protein
structures, we need to be able to find a relation between proteins with different
amino acid sequences. This is usually carried out by introducing a hypothetical
reference state which separates energy contributions that are independent of the
protein sequence, from the specific, sequence-dependent interactions [661]. Typi-
cally, variations in the experimental conditions such as temperature, pH and buffer
are averaged out in this approach. As for the construction of the potential of mean
force in liquid theory discussed in Sect. 3.2.3, the reference state is defined by the
statistics one would expect in the absence of the potential.
110 M. Borg et al.
where Q refers to quantities calculated from the reference ensemble and ı.; / is
the Kronecker delta. For consistency, we have also introduced the KBP, Es .s/, as
function of any specific member s of S. In most cases, this function will only be a
function of the coarse-grained descriptors , Es .s/ D Es ..r/; a/. Note that this
‘energy’ differs from the physical energy of the particular state s, as we have no
a priori reason to assume that the structural and sequential variation in S comply
with the thermal/canonical distribution, a point we shall return to in Sect. 3.3.6. For
the same reason, ˇs is merely a scaling factor which – however – is often set to the
inverse thermodynamic temperature. The corresponding probabilities are
ZQ s .; a/
pQs .; a/ D D exp ˇs .FQs .; a/ FQs /
ZQ s
Zs .; a/
ps .; a/ D D exp ˇs .Fs .; a/ Fs / : (3.36)
Zs
Consequently, we find
ps .; a/
ln D ˇs Fs .; a/ FQs .; a/ C FQs Fs : (3.37)
pQs .; a/
It is common to neglect the last term, FQs Fs , on the right hand side as it is
independent of a and , as discussed in Sect. 3.2.3. In analogy to Eq. 3.33 the KBP,
3 On the Physical Relevance and Statistical Interpretation 111
Us , is defined from the ‘free energy’ difference, Fs .; a/ D Fs .; a/FQs .; a/, as
nobs .˛ ; a˛ /
ˇs Us .˛ ; a˛ / D ˇs Fs .˛ ; a˛ / D ln : (3.40)
nexp .˛ ; a˛ /
Using Eqs. 3.38 or 3.40, a knowledge based potential can be obtained from a set
of protein structures by extracting the distribution of the descriptor variable and
calculating the expected statistics for the reference set for each sequence. Different
choices of descriptors and reference states thus result in different potentials.
Miyasawa and Jernigan (MJ) [517] chose to be the set of amino-acid
contact-pairs. They constructed corresponding contact energies by considering a
quasi-chemical equilibrium [260] between residues of species I and J, and solvent S,
nN ij nN ss
D e eij ; (3.42)
nN i s nN js
where nN kl is the average number of contacts between species k and l. In this case,
the reference state is isolated and solvated residues, corresponding to an unfolded
state. These energies are subsequently used in constructing a second set of contact
energies, where the unfolded reference state is replaced with the average residue
environment. The quasi-chemical equilibrium under consideration is
IW RCJW R•I W JCR WR (3.43)
where R represents the average residue. Effective interresidue contact energies are
extracted through the relation
nN ij nN rr
D e .eij Cerr ei r ejr / (3.44)
nN i r nN jr
The energy eij C err ei r ejr is the energy difference between forming
a specific contact between I W J and between the average environments I W R C
J W R. In the random mixing approximation, effects due to chain connectivity are
neglected so that the expected number of contacts nij in the reference state is only
dependent on the chemical composition of the protein. Although this may seem
like a severe approximation, Skolnick et al. compared potentials obtained using
the quasichemical approximation with the corresponding potentials using more
physical reference states [671]. The reference states used in the comparison were
the Gaussian random coil and a reference state based on threading each sequence
onto a library of structures with similar compactness as the native conformation.
In both cases, the quasi-chemical approximation was shown to be very good. More
3 On the Physical Relevance and Statistical Interpretation 113
ei i C ejj
eijexcess D eij ; (3.45)
2
which describes the difference between real proteins and ideal solutions of amino
acids. Using this quantity, high correlations between different contact potentials
could be demonstrated [233]. Betancourt and Thirumalai re-examined the relations
between the MJ potential and the potentials obtained by Skolnick et al., and showed
that the potentials were very similar if threonine was chosen as reference solvent
in the derivation of the MJ potential [59]. Li, Tang and Wingreen analyzed the MJ
interaction matrix using eigenvalue decomposition to show that the major driving
force in protein folding are the hydrophobic effect and the force of demixing [439].
Other analyses of the MJ matrix have been carried out in order to partition the amino
acids into groups with similar physicochemical characteristics [114, 652, 756].
Lu and Skolnick [452] constructed a statistical potential where the reference state
was taken as
exp .r/ D i j nobs .r/;
ni;j (3.47)
where i and j are the mole fractions of species i and j , respectively.
The reference state is by definition the conformation where the KBP has its zero
i;j
point. Zhou and Zhou noted that for reference states where nexp .r/ Nobs .r/,
the zero point implicates that the attractive and repulsive interactions of folded
proteins average to zero. This observation led to the introduction of the distance-
scaled finite ideal-gas reference state (DFIRE)[797]. The DFIRE reference state is
based on the uniform distribution of non-interacting points in finite spheres. The
number of atomic pairs separated by a distance r is given by the uniform density of
pairs times the volume of a spherical shell with radius r,
114 M. Borg et al.
where ni (nj ) is the number of entities of species i (j ), and V is the volume of the
system. ˛ is a parameter that reflects that protein molecules have a finite volume
and was estimated from a database of folded proteins. For a liquid, ˛ = 2, whereas
for folded protein molecules it was found to be slightly lower, 1.61, which reflects
the existence of cavities. Zhou and Zhou also introduced a cut-off distance rcut so
that the potentials u .i; j; r/ D 0 for r > rcut . The approach was used to extract two
statistical potentials, one all-atom potential and one main chain + Cˇ atoms, and
showed good discriminations of decoys, with slightly better results for the former
[793, 797].
Chen and Sali used similar reasoning to develop a reference state based on
the distribution of distances in spheres [651]. This reference state consist of
uniformly distributed atoms in a sphere with the same radius of gyration and
density as the native structure. The resulting statistical potential, Discrete Optimized
Energy, DOPE, performs very well in decoy discrimination and is used as the
energy function in the widely used modeling package MODELLER-8 [619, 651].
Fitzgerald et al. refined the DOPE potential further by reducing the model to a Cˇ
representation and by separating local from non-local interactions [186]. Rykunov
and Fiser also recognized that the sizes of the protein molecules should be taken
into account in order to avoid systematic errors in the statistical potentials, and
constructed reference ensembles by shuffling the amino acid sequences of protein
structures [618].
Most knowledge based potentials are derived following the approach as outlined
above. The different potentials are distinguished by the protein representation, the
choice of reaction coordinates, the data sets used for extracting the potentials, and in
the choice of reference state. The latter is illustrated in Fig. 3.2, where three different
potentials are generated using the same data, but with different reference states.
The database, S, used for the construction of the knowledge based potential, Us , will
invariably only represent a small fraction of the possible configuration any given
sequence can adopt. A requirement which often is not met by these constructions
is the self-consistency of the derived potentials. Self-consistency implies that the
statistics obtained by sampling from Us match the statistics observed in S.
The question of self-consistency can be addressed by applying efficient sampling
algorithms, such as Markov chain Monte Carlo (MCMC) sampling, as discussed
in Chap. 2. This method can for instance be employed to sample the distribution
of amino acids in some confined volume V according to a given potential, so
as to assess the self-consistency of i.e. contact potentials. An alternative self-
consistent approach for these potentials has recently been proposed [58], where
3 On the Physical Relevance and Statistical Interpretation 115
residue correlation effects are reduced based on a Taylor expansion of the number
of observed contacts.
Typically, the use of MCMC-methods for assessing self-consistency has been
limited to applications where the potentials are of a physical nature. Here, the
statistical ensemble is given by the configuration space ˝.a/ for the given amino-
acid sequence, a. As outlined in Chap. 2, the MCMC-method enables one to estimate
Z
pˇ .ja/ D Z 1 .a/ exp.ˇE.r; a//ı .r/ dr; (3.49)
˝.a/
where E is a physical energy and ˇ is the inverse physical temperature.2 Since the
reference state corresponds to the distribution when E 0 we also have
Z
Q
p.ja/ D ZQ 1 .a/ ı .r/ dr (3.50)
˝.a/
Assuming that the potential is only a function of , E.r; a/ D U.; a/, Eq. 3.49
becomes
pˇ .ja/ / exp.ˇU.; a//p.ja/:
Q
2
A
B
1.5 C
1
E (kBT)
0.5
-0.5
-1
0 2 4 6 8 10 12
r (A)
Fig. 3.2 Distance-dependent potentials between the sidechain centers of mass of valine and
phenylalanine, using different reference states. A: reference state according to Eq. 3.46, B: average
frequencies for valine and phenylalanine residues only and C: nexp .r/ r 1:61 , similar to [797].
The resulting scales were translated so that the zero points are at r D 15:5 Å in order to facilitate
comparison. The data were extracted from structures in a dataset of 4,978 structures at a resolution
of 2 Å or better and with sequence similarities of 30% or less. The data set was obtained from the
PISCES server [751]
2
Here, we have for simplicity assumed the Fixman potential to be zero.
116 M. Borg et al.
pˇ .ja/
U.; a/ / kB T ln (3.51)
Q
p.ja/
However, since the potential is often derived by the wrong assumption of decompos-
ability and/or inaccurate choices of the reference state, Eq. 3.40, self-consistency is
not necessarily satisfied. On the other hand, the sampling approach can be extended
to a scheme where the KB potential is refined iteratively to eventually ensure
self-consistency. This so-called iterative Boltzmann inversion was used by Reith
et al. for simulations of polymers [600] and in the context of protein folding in
by Thomas and Dill [716] and Májek and Elber [464]. Huang and Zhou used
iterative Boltzmann inversion to calculate a statistical potential for protein-ligand
interactions [319, 320]. The potential at iteration i C 1 is calculated as
p native .ja/
Ui C1 .; a/ D Ui .; a/ ˇ 1 ln : (3.52)
pi .ja/
Starting with an initial U0 , the potential can be refined using Eq. 3.52 until
convergence.
Mullinax and Noid have developed a method to extract a statistical poten-
tial, termed generalized potential of mean force, that is designed to maximize
transferability, so that samples reproduce structural properties of multiple systems
(a so-called extended ensemble) [527]. Initial results for modeling protein structures
are promising [528].
Another MCMC-approach to ensure self-consistency is to make use of Boltz-
mann learning, as proposed by Winther and Krogh [768]. Winther and Krogh used
the submanifold of conformational space consisting of native conformations, N , as
the structural descriptor, , i.e.
1 if r 2 N
.r/ D (3.53)
0 otherwise
where the sum runs over all proteins in a training set. If the energy function is
differentiable, one can optimize by simple gradient ascent,
X
0 D C
r ln p .i D 1 j ai ; / ; (3.56)
i
where r is the gradient operator. Using Eq. 3.54, the second term can be written as
X h i h i
ˇ Eˇ r E .ri ; ai / Eˇ;i D1 r E .ri ; ai / ; (3.57)
i
where Eˇ Œ is the expectation value over the whole conformational space and
Eˇ;i D1 is the expectation value over the subspace that makes up the native
conformations of protein i . The drawbacks with this approach is that it requires a
classification of the subspace of native structures Ni , and that it is computationally
expensive to obtain expectation values over the entire conformational space. Hin-
ton developed a computationally efficient approximate variant, called contrastive
divergence [306]. This technique, which is presented in Chap. 5, was used by
Podtelezhnikov et al. to model hydrogen bonds [586] and the packing of secondary
structure [584].
Irrespective of the choice of statistical descriptors, , the above discussion
indicates that the requirement of self-consistency can not be separated from the
intended application of the KBPs. Indeed, when KBPs are used to sample plausible
protein states s, one must be aware of the details of the sampling scheme involved,
a point we shall return to in Sect. 3.4.2.
Here, F ./ is the typical (w.r.t different sequences) physical free energy for the
structural features and ˇ represents the inverse ‘conformational temperature’
118 M. Borg et al.
Here,
3 On the Physical Relevance and Statistical Interpretation 119
pˇ .N ja/
ln D ˇ.FN .a/ FU .a//
pˇ .U ja/
represents the stability free energy of the native state. Using the ı-approximation of
ps ./ we get for the corresponding construction of the KBP
where
ps .; a/
ps .ja/ D
ps .a/
represents the statistics obtained from S. The difference between Eqs. 3.59 and
3.60 elucidates the problems of identifying KBPs with physical free energies. Even
if one assumes that structural features of S (i.e. across different sequences)
are representative for the conditional Boltzmann distribution, ie. so ps .ja/
pˇ .jN ; a/, Eq. 3.59 shows that one still need to include the stability of the native
state as well as the statistical features of the unfolded state in order to fully translate
KBPs into physical free energies. We believe that these two points are too often
overlooked when physical concepts are brought into play to justify knowledge-based
approaches.
To conclude this section, there is no a priori reason to believe that a particular
KBP is connected to physical free energies. For some simple choice of the
statistics observed among known protein structures, S, will presumably just reflect
the overall restraint of native stability rather than other more specific properties of
S. In this case, the observed Boltzmann-like statistics should come as no surprise,
as discussed above. However, irrespective of the choice of , the relation between
KBPs and free energies can in general only be assessed by using the KBPs in
a full canonical calculation. This is required both in order to probe the statistics
of non-native states as well as the shortcomings of the typical assumption of
decomposability as in Eq. 3.40.
and comparing to Eq. 3.38, we can identify the negative log likelihoods and negative
log prior as free energy of the observed and the reference state respectively.
For the likelihood p .a j r/, an unfounded but convenient assumption of amino
acid pair independence leads to the expression:
Y
p .a j r/ D p ai ; aj j rij (3.63)
i;j
where the product runs over all amino acid pairs. Bayes’ theorem is then applied to
the pairwise factors, which results in:
p rij j ai ; aj p rij j ai ; aj
p ai ; aj j rij D p ai ; aj / : (3.64)
p rij p rij
Combining this likelihood with Eq. 3.62 obviously results in the “potentials of mean
force” based on pairwise distances as proposed by Sippl [661].
The likelihood actually used in the Rosetta force field p .a j r/ in turn, is of the
form
Y Y p ai ; aj j ei ; ej ; rij
p .a j r/ p .ai j ei / ; (3.65)
i i;j
p ai j ei ; ej ; rij p aj j ei ; ej ; rij
3 On the Physical Relevance and Statistical Interpretation 121
where rij is the distance between residue i and j and ei represents the environment
of residue i and depends on solvent exposure and secondary structure. The formal
justification of this expression is not discussed by Simons et al., but similar
expressions involving pairwise and univariate marginal distributions result from
the so-called Bethe approximation for joint probability distributions of graphical
models [274, 784].
While the Bayesian formulation by Simons et al. had a great impact on protein
structure prediction, the theory rests upon the assumption of pairwise decomposabil-
ity and only provides a qualitative explanation. Furthermore, the need of applying
Bayes’ theorem to obtain Eq. 3.64 is quite unclear, since the expression p.ai ; aj jrij /
is already directly available. In the following, we present a Bayesian formulation of
knowledge based potentials that does not assume pairwise independence and also
provides a quantitative view. This method, called the reference ratio method, is also
discussed from a purely probabilistic point of view in Chap. 4. Our discussion has a
more physical flavor and builds on the concepts introduced earlier in this chapter. As
is also discussed in Chap. 4, we show that when a KBP is applied for the purpose of
sampling plausible protein conformations, the reference state is uniquely determined
by the proposal distribution [276] of the sampling scheme
Assume that some prior distribution, q.rja/ is given. In protein structure
prediction, q.rja/ is often embodied in a fragment library; in that case, r is a set
of atomic coordinates obtained from assembling a set of polypeptide fragments. Of
course, q.rja/ could also arise from a probabilistic model or a pool of known protein
structures or any other conformational sampling method. Furthermore, we consider
some probability distribution ps .ja/ D p.ja/, given from a set of known
protein structures. Here, again represents some particular set of coarse-grained
structural variables, which are deterministic functions of the structure D .r/.
Consequently, we can express the prior distribution for the joint variables .; r/ as
Q
We consider the case where q.ja/ differs substantially from the known p.ja/;
Q
hence, q.ja/ can be considered as incorrect. On the other hand, we also assume
3
The tilde, Q, refers to the fact that qQ corresponds to the reference state as will be shown.
122 M. Borg et al.
As only p.ja/ is given, we need to make a reasonable choice for p.r j ; a/. The
choice most consistent with our prior knowledge is to set it equal to q.r j ; a/
which leads to:
p.r; ja/ D p.ja/q.r j ; a/: (3.69)
In the next step, we apply the product formula of probability theory to obtain
and consequently,
ı . .r// q.rja/
p.r; ja/ D p.ja/ (3.71)
Q
q.ja/
Finally, we integrate out the, now redundant, coarse-grained variable from the
expression: Z
p..r/ja/
p.rja/ D p.r; ja/d D q.rja/ (3.72)
Q
q..r/ja/
The distribution p.rja/ clearly has the right marginal distribution p.ja/. In
fact, it can be shown that Eq. 3.72 minimizes the Kullback-Leibler divergence
between p.rja/ and q.rja/ under the constraint of having the correct marginal
distribution p..r// (see Chap. 4 for a proof). The influence of the fine-grained
distribution q.r; ja/ is apparent in the fact that p.r j ; a/ is equal to q.r j ; a/.
The ratio in this expression corresponds to the usual probabilistic formulation
of a knowledge based potential where the distribution q.ja/ Q uniquely defines
the reference state. We refer to this explicit construction as the reference ratio
method [276].
It may be instructive to translate Eq. 3.72 into the ‘energy’-language. Suppose,
a prior energy is given, according to ˇs Es .r; a/ D ln q.rja/. Then Eq. 3.72 states
that the new energy, Es0 that correctly reproduces the observed statistics, p.ja/,
with minimal correction of the original energy function, is given as
ˇs Es0 .r; a/ D ˇs Es .r; a/ C Us ..r/; a/ ;
where the KBP, Us , is obtained from the difference of two ‘free energies’
3 On the Physical Relevance and Statistical Interpretation 123
ˇs Us .; a/ D ˇs Fs .; a/ FQs .; a/ (3.73)
ˇs Fs .; a/ D ln p.ja/ (3.74)
Z
ˇs FQs .; a/ D ı. .r// exp.ˇs Es .r; a//dr: (3.75)
In other words, the reference free energy is the free energy associated with the prior
energy function Es . In the absence of prior knowledge, Es 0, or equivalently if
q.rja/ is constant, the reference state reduces to the interaction-free state discussed
hitherto.
For cases when is of high dimensionality, it may become intractable to
Q
determine q.ja/ directly. This problem can be overcome by applying Eq. 3.72
iteratively [276]. In the first iteration (i D 0), we simply set qQi D0 .ja/ equal to
the uniform distribution. In iteration i C 1, the distribution pi .ja/ is improved
using the samples generated in iteration i :
p..r/ja/
pi C1 .rja/ D pi .rja/ (3.76)
qQi ..r/ja/
where qQi .ja/ is estimated from the samples generated in the i -th iteration and
p0 .rja/ D q.r j a/. After each iteration, the reference distribution qQi .ja/ can be
progressively estimated more precisely.
In many applications, a KBP is used in conjunction with an MCMC-type of
sampling method, where putative conformational changes, r0 , is proposed according
to some conditional distribution q.r0 jr/, where r is the current state. Again, this
proposal function may be derived from assembling fragments from a fragment
library, a generative probabilistic model as discussed in Chap. 10, or some other
valid sampling algorithm [276]. Setting the potential to zero, the sampling scheme
will lead to some stationary distribution, q.rja/, which now implicitly represents the
prior structural knowledge of the problem at hand. Again, Eq. 3.72 shows that the
proper reference distribution for KBP-construction in this case is obtained simply
Q
by calculating the marginal distribution, q.ja/, implied by the proposal distribution
alone.4
In a canonical sampling, the MCMC-algorithm is constructed to ensure a uniform
sampling, q.rja/ D const., when the potential is set to zero, E 0. In this case,
Q
as discussed in Sect. 3.2.3, q.ja/ will simply be the normalized density of states,
Q
q.ja/ / e S./ for the given amino acid sequence and Eq. 3.76 reduces to the
iterative Boltzmann inversion, Eq. 3.52. Thus, the present probabilistic formulation
demonstrates, that the iterative Boltzmann inversion is necessitated by any choice
of reference distribution that differs from the true density of states.
It is important to stress, however, that one does not have to insist that the
KBPs should correspond to physical free energies. Indeed, Eq. 3.72 shows that the
4
The specific requirement for this statement to be true in general is that q.rja/ satisfies the detailed
balance equation q.rja/q.r0 jr/ D q.r0 ja/q.rjr0 / [276].
124 M. Borg et al.
KBP is uniquely defined for any well-defined choice of and prior distribution
q, which may either be defined explicitly or implicitly through a given sampling
procedure. This latter point deserves special attention, since different applications
of KBPs, from threading, fold recognition to structure prediction, typically involve
very different sampling methodologies. Therefore, we conclude that KBPs can not
in general be defined independently from their domain of application.
3.5 Summary
Knowledge based potentials (KBP) have proven to be surprisingly useful not only
for protein structure prediction but also for quality assessment, fold recognition
and threading, protein-ligand interactions, protein design and prediction of binding
affinities. The construction of KBPs is often loosely justified by analogy to the
potential of mean force in statistical physics. While the two constructions are
formally similar, it is often unclear how to define the proper reference state in
the knowledge-based approach. Furthermore, these constructs are typically based
on some unfounded assumptions of statistical independence, the implications of
which are far from trivial. Therefore, KBPs are in general neither real potentials,
free energies or potentials of mean force in the ordinary statistical mechanics sense.
If KBPs are intended to be used as physical potentials, they most often need to
be iteratively refined to ensure self-consistency within the canonical ensemble. This
can for instance be achieved by means of Markov chain Monte Carlo sampling.
However, KBPs have many practical applications which do not rely on their specific
link to physical energies. The fact that appropriate KBPs can be constructed from
any well-defined sampling procedure and any choice of coarse-grained variables, ,
as shown in Sect. 3.4.2, opens up for a wide range of possible applications based on
sound probabilistic reasoning [276].
The steady increase in available structural data will enable even more detailed
potentials, including for example detailed many-body effects or the presence of
metal ions and cofactors. However, the development of better potentials is not so
much limited by the amount of available data, as by better formulations that use the
data in a more efficient manner. This is an area where methods from the field of
machine learning are extremely promising [274], and it will indeed be interesting to
see whether the next generation of KBPs can take advantage of modern statistical
methods.
4.1 Introduction
The recently introduced reference ratio method [276] allows combining distribu-
tions over fine-grained variables with distributions over coarse-grained variables in
a meaningful way. This problem is a major bottleneck in the prediction, simulation
and design of protein structure and dynamics. Hamelryck et al. [276] introduced
the reference ratio method in this context, and showed that the method provides a
rigorous statistical explanation of the so called potentials of mean force (PMFs).
These potentials are widely used in protein structure prediction and simulation,
but their physical justification is highly disputed [32, 390, 715]. The reference ratio
method clarifies, justifies and extends the scope of these potentials.
In Chap. 3 the reference ratio method was discussed in the contexts of PMFs. As
the reference ratio method is of general relevance for statistical purposes, we present
the method here in a more general statistical setting, using the same notation as in
our previous paper on the subject [803]. Subsequently, we discuss two example
applications of the method. First, we present a simple educational example, where
the method is applied to independent normal distributions. Secondly, we reinterpret
an example originating from Hamelryck et al. [276]; in this example, the reference
ratio method is used to combine a detailed distribution over the dihedral angles of
a protein with a distribution that describes the compactness of the protein using
the radius of gyration. Finally, we outline the relation between the reference ratio
method and PMFs, and explain the origin of the name “reference ratio”. For clarity,
the formal definitions of the probability density functions used in the text, as well as
their assumed properties, are summarized in an appendix at the end of the chapter.
(iii) the pdf g1 .y/ of the coarse-grained variable Y can be obtained from g.x/, while
(iv) the conditional pdf g2 .xjy/ of X given Y is unknown and not easily obtained
from g.x/.
where fO1 .y/ and fO2 .xjy/ respectively denotes the distribution of the coarse-grained
variable Y and the conditional distribution of X given Y for fO./.
It would be straightforward to construct fO.x/ if the conditional pdf g2 .xjy/
was known. In particular, generation of samples would be efficient, since we could
sample yQ according to f1 ./ and subsequently sample xQ according to g2 .jy/, Q if
efficient sampling procedures were available for the two distributions. However,
as previously stated g2 .xjy/ is assumed unknown. An approximate solution for
sampling could be to approximate the density g2 .xjy/ by drawing a large amount of
samples according to g.x/ and retain those with the required value of Y . Obviously,
this approach would be intractable for a large sample space. The solution to the
4 Towards a General Probabilistic Model of Protein Structure 127
problem was given by Hamelryck et al. [276] and is summarized in the following
theorem.
Theorem 4.1. The conditions (v) and (vi) are satisfied for the pdf given by
f1 .y/
fO.x/ D g.x/ ; (4.2)
g1 .y/
where y D m.x/.
Proof. First consider an arbitrary pdf h.x/ of X. Since Y is a function of X, we can
express the density of X in terms of the pdf h1 .y/ of the coarse-grained variable Y
and the conditional pdf h2 .xjy/ of X given Y by
where fO1 .y/ and fO2 .xjy/ denotes the pdf of Y and the conditional pdf of X given Y
implied by fO.x/. By inserting the desired pdfs from Eq. 4.1 in the expression above
we obtain
f1 .m.x//
fO.x/ D f1 .m.x// g2 .xjm.x// D g.x/ ;
g1 .m.x//
where we used Eq. 4.3 to expand the term g2 .m.x/jy/. By construction fO.x/
satisfies the conditions (v) and (vi). u
t
hO D argmin KLŒg k h
h2D
Z
g.x/
D argmin g.x/ log dx
h2D x h.x/
Z Z
g1 .y/ g2 .xjy/
D argmin g1 .y/g2 .xjy/ log C log dx dy
h2D y x2fQx j m.Qx/Dyg h1 .y/ h2 .xjy/
Z Z Z
g1 .y/ g2 .xjy/
D argmin g1 .y/ log dy C g1 .y/ g2 .xjy/ log dx dy
h2D y f1 .y/ y x2fQxjm.Qx/Dyg h2 .xjy/
„ ƒ‚ … „ ƒ‚ …
A B
where we have used h1 .y/ D f1 .y/ in the term A. The first term A is does not
depend on h. It follows from Jensen’s inequality [344] that the integral B is non-
negative, and consequently it obtains the minimal value of zero only when h2 D g2 .
This means that the whole expression is minimized when the conditional pdf of X
given Y for h is equal to g2 . As this conditional density is indeed equal to g2 for
the reference ratio density, this shows that the reference ratio density minimizes the
Kullback-Leibler divergence to g. t
u
In the following sections we will present two applications of the reference ratio
method.
The purpose of our first example is purely educational. It is a simple toy example
based on independent normal distributions, which simplifies the functional form of
the pdfs involved. Let X D .X1 ; X2 /, where X1 and X2 are independent normals
with
X1 N .; 1/ and X2 N .0; 1/:
1 2 1 2
g.x/ D d e 2 x1 2 x2 ;
where d is the normalizing constant. Suppose that Y D m.X/ D X1 . This means
that the pdf of Y for f ./ is
1
f1 .y/ D c 0 e 2 .x1 / ;
2
where c 0 and d 0 are the appropriate normalizing constants. Note that for both f ./
and g./ the conditional density of X given Y is the same and equal to the pdf of the
normal distribution N .0; 1/.
By applying the ratio method from Eq. 4.2, we obtain the expression
1 1 2 1 2
c 0 e 2 .x1 / d e 2 x1 2 x2
2
fO.x/ D
1 2 1 2
1 2
D c e 2 .x1 / 2 x2 : (4.5)
d 0 e 2 x 1
In this example we observed that fO./ D f ./, which is expected since the condi-
tional distribution of X given Y is the same for both f ./ and g./. Accordingly, it is
now trivial to check that the distribution of Y for fO./ is equal to f1 ./ and that the
conditional distribution of X given Y is g2 .xjy/, as stated in (v) and (vi).
In most relevant applications of the reference ratio method, the conditional
density f2 .xjy/ is unknown. In the next section we will consider such an example.
Radius
of TorusDBN
gyration
Fig. 4.1 An application of the reference ratio method. The purpose of this application is to
sample protein structures with a given distribution over the radius of gyration and a plausible local
structure. The desired radius of gyration is given by the pdf f1 .y/ (left box). In this example f1 .y/
2
is the normal distribution N .22 Å; 4 Å /, but typically this distribution would be derived from
known structures in the protein data bank (PDB) [45]. TORUSDBN is a distribution, g.x/, over
the sequence of dihedral angles, X, in the protein main chain (right box). TORUSDBN describes
protein structure on a local length scale. The distribution over the radius of gyration imposed by
TORUSDBN is g1 .y/. The desired distribution over the radius, f1 .y/, and TORUSDBN, g.x/, can
be combined in a meaningful way using the reference ratio method (formula at the bottom). The
reference ratio distribution, fO.x/, has the desired distribution for the radius of gyration (The figure
is adapted from Fig. 1 in Hamelryck et al. [276])
(b) The pdf, g.x/, of the fined grained variable X is given by TORUSDBN.
(c) Let Y D m.X/ be the radius of gyration of the protein, and assume that the pdf
2
f1 .y/ of Y is the normal distribution N .22 Å; 4 Å /.
(d) The density g1 .y/ of the coarse-grained variable is obtained by generalized
ensemble sampling [180] from g.x/ [276], which can be done since TORUS-
DBN is a generative model.
The reference ratio method is now applied to construct the density fO./, based
on the normal distribution over the radius of gyration, f1 .y/, the TORUSDBN
distribution, g.x/, and the distribution over the radius of gyration for TORUSDBN,
g1 .y/. It is important to stress that typical samples generated from TORUSDBN,
g.x/, are unfolded and non-compact, while typical samples from fO.x/ will be more
compact as the radius of gyration is controlled by the specified normal distribution.
Accordingly, samples from the reference ratio distribution, fO.x/, are expected to
look more like folded structures than samples from g.x/.
Hamelryck et al. [276] test this setup on the protein ubiquitin, which consists
of 76 amino acids. Figure 4.2 shows the distribution over Y obtained by sampling
from g.x/ and fO.x/, respectively. The figure also shows the normal density f1 .y/.
We observe that samples from g.x/ have an average radius of gyration around 27 Å,
4 Towards a General Probabilistic Model of Protein Structure 131
0.25
0.20
0.15
0.10
0.05
0.00
16 18 20 22 24 26 28 30
Fig. 4.2 The reference ratio method applied to sampling protein structures with a specified
distribution over the radius of gyration (Y ). The distribution over the radius of gyration Y for
samples from TORUSDBN, g.x/, is shown as triangles, while the distribution for samples from
the ratio distribution, fO.x/, is shown as circles. The pdf, f1 .y/, for the desired normal distribution
2
over Y , N .22 Å; 4 Å /, is shown as a solid line. The samples are produced using the amino acid
sequence of ubiquitin (The figure is adapted from Fig. 3 in Hamelryck et al. [276])
while samples from fO.x/ indeed have a distribution very near f1 .y/. As expected,
samples from fO.x/ are compact, unlike samples from g.x/.
A key question here is how to sample from fO.x/ efficiently? From a generative
point of view, we could use Eq. 4.4 directly and generate a sample, xQ , using the two
steps:
1. Sample yQ according to f1 .y/ and
2. Sample xQ according to g2 .xjy/.
Q
However, the problem lies in step 2, as there is no efficient way to sample
from g2 .xjy/; TORUSDBN only allows efficient sampling from g.x/. One could
consider using rejection sampling or the Approximate Bayesian computation (ABC)
method [29,486,591] for step 2, but either method would be very inefficient. Hamel-
ryck et al. [276] have given a highly efficient method, which does not, in principle,
involve any approximations. The idea is to use the Metropolis-Hastings algorithm
with g.x/ as proposal distribution and fO.x/ as target distribution. In this case, the
probability of accepting a proposed value x0 given a previous value x becomes
0 f1 .y 0 /g.x0 /=g1 .y 0 / g.x/ f1 .y 0 / g1 .y/
˛.x jx/ D min 1; D min 1; ;
f1 .y/g.x/=g1 .y/ g.x0 / f1 .y/ g1 .y 0 /
(4.6)
where y D m.x/ and y 0 D m.x0 /. In practice, the proposal distribution in the
MCMC algorithm would only change a randomly chosen consecutive subsequence
132 J. Frellsen et al.
f1 .r/
W .r/ / log ;
g1 .r/
where f1 .r/ is a pdf estimated from a database of known protein structures, and
g1 .r/ is the pdf for a so-called reference state. The reference state is typically
defined based on physical considerations. The pdf f1 .r/ is constructed by assuming
that the individual pairwise distances are conditionally independent, which consti-
tutes a crude approximation. In practice, the potential of mean force is combined
with an additional energy function, that is concerned with the local structure of
proteins. This additional energy term is typically brought in via sampling from a
fragment library [658] – a set of short fragments derived from experimental protein
structures – or any other sampling method that generates protein-like conformations.
From a statistical point of view, this means that the samples are generated according
to the pdf
f1 .r/
fO.x/ / g.x/ ; (4.7)
g1 .r/
where x are the dihedral angles in the protein, r are the pairwise distances implied
by x, and g.x/ is the pdf of the dihedral angles embodied in the sampling
method.
In this formulation, it can be seen that PMFs are justified by the reference ratio
method; their functional form arises from the combination of the sampling method
(which concerns the fine-grained variable) with the pairwise distance information
(which concerns the coarse-grained variable). This interpretation of PMFs also
4 Towards a General Probabilistic Model of Protein Structure 133
provides some surprising new insights. First, g1 .r/ is uniquely defined by g.x/, and
does not require any external physical considerations. Second, if the three involved
probability distributions are properly defined, the PMF approach is entirely rigorous
and statistically well justified. Third, the PMF approach generalizes beyond pairwise
distances to arbitrary coarse-grained variables. Fourth, PMFs should not be seen as
physical potentials, but rather as statistical constructs.
Obviously, the name “reference ratio” refers to the fact that we now understand
the nature of the reference state, and why the ratio arises in the first place. In
conclusion, the reference ratio method settles a dispute over the validity of PMFs
that has been going on for more than twenty years, and opens the way to efficient
and well-justified probabilistic models of protein structure.
4.6 Conclusions
The reference ratio method is important for the development of a tractable and
accurate probabilistic model of protein structure and sequence. Probabilistic models
such as those described in Chap. 10 are tractable and computationally efficient,
but only capture protein structure on a local length scale. It is important to point
out that the two properties are closely linked: these models are computationally
attractive because they only capture dependencies on a fairly local length scale.
The reference ratio method provides a convenient, mathematically rigorous and
computationally tractable way to “salvage” such models by including information
on nonlocal aspects of protein structure.
A surprising aspect of the reference ratio method is that it finally provides the
mathematical underpinnings for the knowledge based potentials that have been
widely used – and hotly debated – for more than twenty years. In addition, the
method opens the way to “potentials” that go beyond pairwise distances. We
illustrated this in this chapter by using the radius of gyration; in ref. [276] the
reference ratio method is also applied to hydrogen bonding. The latter application
also illustrates how the method can be applied in an interactive way.
Finally, we would like to end by pointing out that the reference ratio method
could find wide applications in statistical modelling. The method reconciles the
advantages of relatively simple models that capture “local” dependencies with the
demand for capturing “nonlocal” dependencies as well. For example, the reference
ratio method makes it possible to correctly combine two priors that respectively
bring in information on fine- and coarse-grained variables. The method might thus
be applied to image recognition, document classification or statistical shape analysis
in general.
Acknowledgements The authors acknowledge funding by the Danish Research Council for
Technology and Production Sciences (FTP, project: Protein structure ensembles from mathematical
models, 09-066546).
134 J. Frellsen et al.
Appendix
In this appendix, we given an overview of all the pdfs mentioned in the chapter, their formal
definitions and their assumed properties. We also briefly outline the ratio method at the end.
We start by noting that X and Y are random variables, with Y D m.X/, where m./ is a
many-to-one function. The random variables X and Y are called fine-grained and coarse-grained,
respectively. Normally, the dimensionality of Y is smaller than that of X. To emphasize this, we
will use a bold vector notation for X but not for Y , though Y is often a vector as well.
The pdfs are:
• f1 .y/: the pdf over the coarse-grained variable Y D m.X/. This pdf is assumed to be “true”
for practical purposes. We assume that this pdf can be evaluated.
• g.x/: the approximate pdf over X. This pdf is assumed to be approximative, in the sense that
its associated conditional pdf g2 .xjy/ is “true” for practical purposes. However, its associated
pdf g1 .y/ of the coarse-grained variable Y differs from the desired pdf f1 .y/. One way to view
g.x/ is as approximately correct on a local scale, but incorrect on a global scale. The pdfs g1 .y/
and g2 .xjy/ are defined below. We assume that g.x/ can be simulated.
• g1 .y/: the pdf for Y implied by g.x/. We assume that this pdf can be correctly obtained from
g.x/, and can be evaluated. However, g1 .y/ is not “true” in the sense that it significantly differs
from the desired pdf f1 .y/. The pdf g1 .y/ is given by
Z
g1 .y/ D g.x/ dx:
x2fx0 j m.x0 /Dyg
• g2 .xjy/: the conditional pdf of X given Y , as implied by g.x/. This pdf is also assumed to
be “true” for practical purposes. However, this pdf cannot be easily evaluated or simulated.
Formally, the distribution is given by
(
0 if y ¤ m.x/
g2 .x j y/ D 1
Z.y/
g.x/ if y D m.x/
• f .x/: this pdf is unknown and needs to be constructed. This pdf is given by
The pdf f .x/ results from combining the correct distribution over Y with the correct conditional
over X. As explained above, the problem is that g2 .xjy/ cannot be easily evaluated or simulated.
The reference ratio method re-formulates f .x/ as
f1 .y/
f .x/ D g.x/;
g1 .y/
where y D m.x/. In this way, f .x/ can be evaluated and simulated, because g.x/ allows
f .y/
simulation and the pdfs in the ratio g11 .y/ can be evaluated.
Part III
Directional Statistics for Biomolecular
Structure
Chapter 5
Inferring Knowledge Based Potentials Using
Contrastive Divergence
5.1 Introduction
Interactions between amino acids define how proteins fold and function. The
search for adequate potentials that can distinguish the native fold from misfolded
states still presents a formidable challenge. Since direct measurements of these
interactions are impossible, known native structures themselves have become the
best experimental evidence. Traditionally, empirical ‘knowledge-based’ statistical
potentials were proposed to describe such interactions from an observed ensemble
of known structures. This approach typically relies on some unfounded assumptions
of statistical independence as well as on the notion of a reference state defining
the expected statistics in the absence of interactions. We describe an alternative
approach, which uses a novel statistical machine learning methodology, called
contrastive divergence, to learn the parameters of statistical potentials from data,
thus inferring force constants, geometrical cut-offs and other structural parameters
from known structures. Contrastive divergence is intertwined with an efficient
Metropolis Monte Carlo procedure for sampling protein main chain conformations.
Applications of this approach have included a study of protein main chain hydrogen
bonding, which yields results which are in quantitative agreement with experimental
characteristics of hydrogen bonds. From a consideration of the requirements for
efficient and accurate reconstruction of secondary structural elements in the context
of protein structure prediction, we demonstrate the applicability of the framework
to the problem of reconstructing the overall protein fold for a number of commonly
studied small proteins, based on only predicted secondary structure and contact map.
5.2 Background
Knowledge based potentials are typically derived from simple statistical analysis
of the observed protein structures. CD learning is, on the other hand, a very powerful
technique to optimize the interaction parameters of an arbitrarily chosen protein
model. CD learning achieves thermodynamic stability of the model in agreement
with a set of observed protein structures. In this chapter, we describe a protein
model which features an all atom representation of rigid peptide bonds elastically
connected at C ˛ atoms [583, 585]. Amino acid side-chains are reduced to single
Cˇ atoms. CD learning relies heavily on the performance of a Monte Carlo (MC)
sampler. The reduced number of degrees of freedom in our model helps speed up
the simulations [385]. In addition, the conformations of just a few amino acids
are perturbed locally on each step, leaving the rest of the chain intact, which
increases the acceptance probability of the attempted moves and the efficiency of
the Metropolis MC procedure [172]. We model the polypeptide chain in Cartesian
space using efficient crankshaft moves [61].
The remainder of this chapter is structured as follows. Because of significant
overlap between the assumptions behind the CD learning approach and other
methodologies, we start from a brief overview of the previous chapter regarding
statistical potentials and native structure discriminants. We then describe the CD
learning approach and how it can be applied to a physically motivated Boltzmann
distribution. A critical comparison of the CD learning procedure with traditional
knowledge-based potentials is an important focus of this chapter. In the second half
of the chapter, we introduce our protein model and MC sampler and provide an
overview of the application of CD learning to protein interactions. In particular,
we investigate hydrogen bonding and the side-chain interactions that stabilize
secondary structure elements. We test our approach by reconstructing the tertiary
structure of two proteins: protein G and SH3 domain. We conclude by discussing
outstanding questions in protein energetics that can potentially be addressed by
contrastive divergence learning.
Fig. 5.1 Distribution of O-H distances and angles that characterize hydrogen bonding in globular
proteins. Distributions observed in helices and sheets are significantly different (Reproduced from
[392] with permission)
U.i / is obtained as
nobs .i /
U.i / D kB T ln ; (5.1)
nexp .i /
where nobs .i / is the number of observations in crystal structures taken from
the Protein Data Bank (PDB) [45], nexp .i / is the expected (reference) number
of observations, and kB T is the Boltzmann constant times the temperature. This
framework was first introduced by Pohl to analyze dihedral angle potentials using
a uniform reference [587]. It was later proposed to study residue and atomic
pairing frequencies [516, 706], with more detailed distance-dependent pairing
potentials introduced much later [661]. Subsequently, statistical potentials were
applied to residue contacts in protein complexes [519]. Figure 5.1 [392] illustrates
the application of statistical potentials to distance- and angle-dependent hydrogen
bonding.
Shortle [656] argues that the Boltzmann hypothesis represents a genuine evo-
lutionary equilibrium, which has maintained the stability of each protein within a
narrow set of values for these parameters. Bastolla et al. [24, 25] also proposed the
idea that protein sequences have evolved via a process of neutral evolution about
an optimal mean hydrophobicity profile. According to Jaynes [342], the maximum
entropy principle (MaxEnt) offers a way to choose a probability distribution for
which constraints imply that only the averages of certain functions (such as
the hydrophobicity profile) are known. In the present application these may be
considered to be evolutionary constraints on the protein sequence. Application of
the MaxEnt principle would then lead to a Boltzmann-like distribution, where the
partition function and “inverse temperature” are set to satisfy these constraints
[342, 457].
There are several objectionable assumptions underlying the standard knowledge-
based potentials, and their physical interpretation has been the subject of much
5 Inferring Knowledge Based Potentials Using Contrastive Divergence 139
criticism and debate [32,276,715]. First of all, the ansatz of statistical independence
is generally very poor. Secondly, the reference state nexp .i / lacks a rigorous defi-
nition. However, recent research has vindicated these potentials as approximations
of well-justified statistical quantities, clarified the nature of the reference state and
extended their scope beyond pairwise interactions [276]. We refer to Chaps. 3 and 4
for a thorough discussion of the statistical and physical interpretation and validity
of these potentials.
Unlike polypeptides with a random sequence, proteins have a unique native state
with energy much lower than any alternative conformation [11]. Furthermore,
protein sequences have evolved so that they quickly find their native state as if
following a funnel-like energy landscape [86]. According to the model proposed
by Rose and co-workers, on the folding pathways, protein segments rapidly twist
into compact ˛-helices, or accumulate into ˇ-sheets [21,22]. The entropy losses are
compensated by favorable short-range interactions when the secondary structure
elements are formed by amino acids meticulously joined in the sequence.
Figure 5.2 illustrates the folding funnel and a few selected meta-stable energy
levels. This is considered a fundamental property of proteins; any model system
attempting to mimic folding should mimic this property.
A simple model of a protein with sequence a assumes that its energy is
completely determined by the pairwise contacts between residues. In this model,
the protein conformation is reduced to a binary contact map (a proximity matrix)
between its N residues, D fCij g. The total energy is thus given be the following
equation
X i 1
N X
E.; / D Eij
i D1 j D1
where the interaction parameters ai ;aj are defined by the types of residues ai
and aj , giving the set of 210 parameters for 20 amino acid types. It is feasible
that different adjustable parameter vectors, D f1 ; : : : ; 210 g, may render model
proteins with different folding properties. The goal is to find the vector that
reproduces the fundamental characteristics of protein folding and gives the native
state the lowest energy. The optimal parameter values, in this case, may reflect the
relative strength of pairwise residue interactions. In general,
E.0 ; / D.; /
D arg min (5.3)
ıE.; /
where E.0 ; / is the energy of the native state 0 , D.; / is the energy of
alternative (decoy) conformations, and ıE.; / is a scaling factor. This approach
has been proposed in a number of studies since the early 1990s, with implementation
details largely driven by limited computer power. Instead of properly sampling
alternative conformations in continuous space, researchers either generated decoys
by threading the sequence onto the fixed main chain of other proteins [238, 463] or
simulated hypothetical proteins on a cubic lattice [289, 512]. The set of alternative
conformations was then reused throughout the optimization protocol.
In one implementation of this optimization, D.; / was estimated as the mean
energy of all decoy conformations and the scaling factor ıE.; / was equated to
the standard deviation of the decoy energies [238,289,512]. In this approach Eq. 5.3
minimizes the negative Z-score of the native energy against the decoy energies.
The solution draws heavily on the linear dependence of Eq. 5.2 with respect to the
parameters. The numerator and the denominator p in Eq. 5.3 can then be expressed
as a dot product A and a quadratic form B. The vector A and the matrix
B can be evaluated for any native structure and the corresponding set of decoys.
This optimization is reminiscent of feed-forward neural network approximations
and explicitly gives D B1 A [238]. Gradient descent schemes have also been
deployed for this optimization [289].
Another implementation sought to guarantee that the native energy is below
any decoy energy [463, 522, 736]. As above, since the energy in Eq. 5.2 is a
linear function of , this is equivalent to a number of simultaneous dot-product
inequalities, AK > 0, where AK is defined by the difference between the contact
maps of the Kth decoy and the native state. Each of these inequalities dissects
the 210-dimensional parameter space with a hyperplane, resulting in only a small
region of allowed parameters. It was recognized early that only a small number
of inequalities define the region boundaries. This means that the native energy
has to be compared to a few low-energy decoys that define D.; / in Eq. 5.3,
5 Inferring Knowledge Based Potentials Using Contrastive Divergence 141
effectively maximizing the gap between the native state and the low-energy decoys.
The denominator ıE.; / in this approach is kept constant and disregarded.
The parameter optimization in this approach was done either by linear program-
ing [463,522], support vector machine learning [317], or perceptron neural network
learning algorithms [736]. The latter is an iterative scheme of cyclically presenting
each AK and updating the parameters if they appear contradict the inequality,
AK 6> 0.
.i / C AK
.i C1/ WD .i / (5.4)
j C AK j
where is a small positive learning rate. The procedure essentially moves the vector
towards the hyperplane along its normal AK and normalizes afterwards. The
procedure is supposed to converge when the inequalities for all decoys are satisfied.
Note that, in 210-dimensional space, the solution does not necessarily exist when the
number of inequalities is 210 or greater. Indeed, Vendruscolo and Domany [736]
found that this procedure did not always converge, especially if the training set
contained several hundreds of decoys. The pairwise contact potentials are, therefore,
demonstrably insufficient (rather than unsuitable) to accurately discriminate the
native state. The later inclusion of solvent accessibility surface and secondary
structure information in a neural network framework greatly improved the accuracy
of discrimination [750].
In the discussion so far we have given Eq. 5.3 and outlined the optimization
procedures in applications to a single native state. For a dataset of a few hundreds
of proteins the solution is obtained by either cycling through different proteins in
iterative schemes or averaging the final results for individual proteins. Parameter
optimization was also performed for more complex models of protein interactions,
where the total energy was not necessarily a linear function of parameters, or
contained a distance dependence instead of using a binary contact definition. In
this case, gradient ascent methodologies were used to find the optimal energy
discriminant [506, 611].
Finally, we note that in describing native structure discriminant methods we never
mentioned the thermal energy, kB T . Fundamentally, discriminant analysis does not
address the question of thermodynamic stability of proteins. The parameters that
are optimal as native structure discriminants may only reflect the relative strength
of interactions.
1
p.j/ D exp ŒE.; /
Z./
Z
Z./ D d exp ŒE.; / (5.5)
where Z./ is the partition function. Here, the energy is expressed in units of kB T .
Assuming that 0 is an observed conformation with energy near the minimum, the
inverse problem of estimating the values of the parameters, , can be solved by
maximum likelihood (ML) optimization using the gradient ascent method [223]:
@ @ ln Z./ @E.0 ; /
.i C1/ WD .i / C ln p.j0 / D .i / C (5.6)
@ @ @
where we used Bayesian equality p.j0 /p.0 / D p.0 j/p./ and differentiated
Eq. 5.5, disregarding the prior p./. A positive learning rate needs to be small
enough for the algorithm to converge. In general, the first term in the square
brackets is equal to the expectation value for the energy gradient with respect to
parameters, ,
@ ln Z./ 1 @Z./
D
@ Z./ @
Z
1 @E.; /
D d exp ŒE.; /
Z./ @
@E.; /
D : (5.7)
@ 1
Here, hi1 D Ep ./ is the expectation of a quantity with respect to the equilibrium
distribution p, corresponding to an infinitely long sampling time. After substituting
Eq. 5.7 into Eq. 5.6 we obtain the generalized Boltzmann machine learning rule
[307]:
.i C1/ @ @E.; / @E.0 ; /
WD .i /
C ln p.j0 / D .i / C (5.8)
@ @ 1 @
Equation 5.8 can be generalized for the case of multiple observed protein
structures, 0 . From information theory, ML optimization by gradient ascent
follows the gradient and minimizes the Kullback-Leibler divergence,
X p.0 /
KLŒp.0 / k p.0 ; / D p.0 / ln (5.9)
p.0 ; /
0
which reflects the difference between model distribution p.0 ; / and the distribu-
tion of observations p.0 /. Differentiating this equation produces essentially the
same result,
5 Inferring Knowledge Based Potentials Using Contrastive Divergence 143
@
.i C1/ WD .i / C KLŒp.0 / k p.0 ; /
@
@E.; / @E.; /
D .i / C (5.10)
@ 1 @ 0
where the subscript 0 signifies averaging over the dataset of originally observed
conformations. This expression is as attractive as it is difficult to use in practice,
because evaluating the expectation value of the energy gradient may require
tremendously long equilibrium simulations. There was, however, a recent attempt
to undertake such a feat [768].
Instead of extensively sampling conformations to determine the expectation
value, Hinton [306] proposed an approximate ML algorithm called contrastive
divergence (CD) learning. The intuition behind CD learning is that it is not necessary
to run the MCMC simulations to equilibrium. Instead, just after a few steps, we
should notice that the conformations start to diverge from the initial distribution.
Iterative updates to the model interaction parameters will eventually reduce the
tendency of the chain to leave the initial distribution. CD learning follows the
gradient of a difference between two KL divergences,
To obtain K in this expression, the original conformations are perturbed in the field
of model potentials E.; / using a K-step MCMC procedure. The CD learning
rule can be obtained by differentiating Eq. 5.11:
@ @E.; / @E.; /
.i C1/ WD .i / C CDK D .i / C (5.12)
@ @ K @ 0
The Boltzmann distribution defines the probability p.; a/ that a protein sequence
a adopts a conformation . This probability can be factorized into the product
of the sequence likelihood for a given conformation and the prior distribution of
conformations, p.; a/ D p.aj/p./. This can be rewritten in energetic terms
Fig. 5.3 (a) Hydrogen bond geometry. The distance and two angular parameters of hydrogen
bonds are shown. (b) Schematic one-dimensional approximation of hydrogen bond energy with
a square-well potential. This approximation sharply discriminates between strong and weak
hydrogen bonds. Weak bonds do not contribute to the total energy and are dropped from
consideration in this work. The hydrogen bond strength H corresponds to an average strength
of hydrogen bonds (Reproduced from [586] with permission)
where r.O; H / is the distance between oxygen and hydrogen, and symbol †
denotes the angle between the three atoms (see Fig. 5.3a). The lower bound on
the separation between the atoms (r.O; H / > 1:8 Å) was implicitly set by the
hard-sphere collision between oxygen and nitrogen. We used the same hydrogen
bond potential regardless of the secondary structure adopted by the peptide main
chain. The energy of the hydrogen bond (Fig. 5.3b) was described by a square-well
potential,
EijHB D nh H (5.14)
where H is the strength of each hydrogen bond, and nh is the number of hydrogen
bonds between the amino acids i and j . Determining the strength of the hydrogen
bonds, H , as well as the three cutoff parameters, ı, , , is the task of the CD-
learning procedure.
The sequence-dependent part of the potential (the negative log-likelihood) was
approximated in our model by pair-wise interactions between side-chains. Our
main focus was on the resulting effect of these interactions and how they stabilize
secondary structural elements. We did not consider the detailed physical nature of
these forces or how they depend on the amino acid types. We introduced these
interactions between the polypeptide side chains as an effective Gō-type potential
[230] dependent on the distance between Cˇ atoms,
binary matrix, two types of contacts were defined in the context of protein secondary
structure. First, only lateral contacts in the parallel and anti-parallel ˇ-sheets were
indicated by 1s. Second, the contacts between amino-acids i and i C 3 in ˛-helices
were also represented by 1s. The contacts of the first and second type typically
have the closest Cˇ Cˇ distance among non-adjacent contacts in native proteins.
The force constants depended on the secondary structure type, introducing positive
˛ and ˇ . Non-adjacent contacts in secondary structural elements were, therefore,
stabilized by attracting potentials.
We also modeled interactions between sequential residues. This interaction was
defined by the mutual orientation of adjacent residues that are involved in secondary
structure elements,
C1 D cos i;i C1
SC
Ei;i (5.16)
X
N X
N X
i
E.R; / D EiB C .Eijvd W C EijHB C EijS C / (5.17)
i D1 i D1 j D1
where we consider harmonic valence elasticity, EiB , hard-sphere van der Waals
repulsions, Eijvd W , and square-well hydrogen bonding, EijHB . The valence elasticity,
van der Waals repulsions, and hydrogen bonding that contribute to this potential
have a clear physical meaning and are analogous to traditional ab initio approaches.
The side-chain interactions, EijS C in this model were introduced as a long-range
quadratic Gō-type potential based on the contact map and secondary structure
assignment. This pseudo-potential had two purposes: it was needed to stabilize
the secondary structure elements and to provide a biasing force that allows
reconstruction of the main chain conformation in the course of Metropolis Monte
Carlo simulations [585, 586].
Because peptide bonds are rigid and flat, polypeptides are often modeled in the
space of '- angles, which reduces the number of degrees of freedom and speeds
up MC simulations [385]. As an alternative, we proposed a sampler that utilized
local crankshaft rotations of rigid peptide bonds in Cartesian space. In our model,
5 Inferring Knowledge Based Potentials Using Contrastive Divergence 147
a
x O
C
O
c
C
N Cα
p
z Cα
H N n
y
H R
R
b
Cα
Cα
Cα
Cα Cα
Cα
Cα
Cα Cα
Cα
Cα
Fig. 5.4 (a) Polypeptide model. The orientations of perfectly planar and rigid peptide bonds are
given by the orthonormal triplets (x, y, z), with z pointing along the C˛ C˛ direction. Other
peptide bond atoms lie in the plane yz. The position of the side-chain atoms R is specified by
the vectors n and c. (b) Local Metropolis moves. Two types of moves are used in this work: a
crankshaft rotation around the line connecting two C˛ atoms in the middle of the chain, and a
random rotation at the termini around a random axis passing through the C˛ atom (Reproduced
from [585] with permission)
the primary descriptors of the polypeptide chain conformation were the orientations
of the peptide bonds in the laboratory frame (Fig. 5.4a). For a chain of N amino
acids the orientations of the peptide bonds were specified by the orthonormal
triplets, .xi ; yi ; zi /; i D 0 : : : N . In this representation most of the uninteresting
covalent geometry was frozen by fixing the positions of peptide bond atoms in these
local coordinate frames. In principle, any mutual orientation of peptide bonds was
allowed. In practice, they were governed by Boltzmann distribution with energy
given by Eq. 5.17.
To obtain the canonical ensemble of polypeptide conformations, we developed
a novel MCMC procedure. New chain conformations were proposed by rotating a
few adjacent peptide bonds and applying regular Metropolis-Hastings acceptance
criteria [293, 501]. We used crankshaft rotations in the middle of the chain and
pivotal rotations at the termini to preserve chain connectivity. Each move was
local and the conformation of the rest of the chain was not altered. Local moves
are extremely important in achieving an efficient sampling of dense polypeptide
conformations [172]. To satisfy detailed balance any rotation and its inverse are
picked with equal probability. Figure 5.4b illustrates the rotations that we used in
our procedure. We refer the reader to Chap. 2 for a general discussion on polymer
sampling using MCMC-algorithms and to [585] for greater details of the sampler
employed in the current work.
148 A.A. Podtelezhnikov and D.L. Wild
Secondary structure elements appear early in protein folding [21,22]. They are stabi-
lized by both sequence-dependent side-chain interactions and sequence-independent
interactions dominated by hydrogen bonds between main chain atoms [565, 566].
Careful balance between the two contributions is crucial for secondary structure
element stability at room temperature. In the context of our protein model this
requires careful optimization of hydrogen bonding parameters and interactions
between side-chains as mimicked by Gō-type interactions between Cˇ atoms (see
Sect. 5.6). Overall, eight model parameters were simultaneously optimized using
contrastive divergence: four parameters characterizing the hydrogen bonding and
four parameters characterizing side-chain interactions.
The strength of hydrogen bond is a subject of ongoing discussions in the literature
(see recent review [189]). Pauling et al. [566] suggested that the strength of the
hydrogen bond is about 8 kcal/mol. Some experimental evidence suggests that the
strength is about 1.5 kcal/mol [533, 630]. Others suggest that hydrogen bonding has
a negligible or even a destabilizing effect [19]. At present the consensus is that the
strength of the hydrogen bond is in the range of 1–2 kcal/mol [189]. Figure 5.5
shows the parameter learning curves produced by the iterative CD learning proce-
dure. We found that the hydrogen bond strength converges to H=kB T D 1:85, or
1.1 kcal/mol, in excellent agreement.
In the literature, geometric criteria for hydrogen bonding have been designed to
capture as many reasonable hydrogen bonds as possible, which in general produced
rather loose criteria [18, 189, 690]. For the first time, to our knowledge, we were
able to simultaneously optimize hydrogen bond geometry and strength using the
CD learning procedure. We found the H O distance, ı < 2:14 Å, and minimum
5 Inferring Knowledge Based Potentials Using Contrastive Divergence 149
1.5
0.5
0
0 100 200 300 400 500
Iterations
allowed angles †COH and †OHN , > 140ı and > 150ı respectively. This
means that all four atoms seem to be approximately co-linear when they form a
strong hydrogen bond. The established view until now was that only N H O
are co-linear. We determined that the same hydrogen bonding potential works
reasonably well in both ˛-helices and ˇ-sheets, validating the assumption that the
intra-peptide hydrogen bonding is independent of secondary structure.
We also found the value of the force constant for the attracting potential between
amino acids in secondary structure elements to be equal to ˛ =kB T D 0:10 Å2
and ˇ =kB T D 0:09 Å2 . The two values are very close to each other indicating
that these interactions are indeed similar in both helices and sheets. This is an
expected result because the separation between the Cˇ atoms in both ˛-helices and
ˇ-sheets is about 5.4 Å. The effective force constants for the interactions between
adjacent residues were determined to be ˛ =kB T D 0:6 and ˇ =kB T D 4:5.
In agreement with our expectations, these force constants have opposite signs,
stabilizing alternative mutual orientations of adjacent residues in ˛-helices and
ˇ-strands.
Spontaneous folding of ˛-helices and ˇ-hairpins structures is difficult under the
influence of hydrogen bonds alone and greatly facilitated by side-chain interactions
[585]. To improve the stability of ˛-helices and ˇ-sheets in these and further
simulations, we moderately adjusted some force constants (˛ =kB T D 0:12 Å2 ,
ˇ =kB T D 0:11 Å2 , and hydrogen bonding parameters (H=kB T D 2:5, ı D
2:19 Å, D 130ı and D 140ı). The difference between these values and those
determined in the contrastive divergence procedure is less than 30%. Unfortunately,
150 A.A. Podtelezhnikov and D.L. Wild
Fig. 5.6 Reconstruction of the protein G fold by specifying predicted interactions. Panel A shows
the regularized contact map with predicted interactions in ˛-helix and ˇ-sheets with the predicted
contact map in the background. The grey levels in the predicted contact map represent the predicted
probability of a particular contact. The regularized diagonal contacts pass through the local maxima
on the predicted contact map and extend until the predicted contact probability levels off. The best
structure corresponding to the maximum fraction of predicted contacts at relatively low energy is
shown in the panel B (Reproduced from [584] with permission)
fraction of predicted contacts specified in the regularized contact map during the
simulations can be found in Fig. 5.7.
Figure 5.6b shows the structure that corresponds to the maximum fraction
of predicted contacts at relatively low total energy. The fold of the structure
corresponds to the native fold of the protein G, although the RMSD with the native
structure is 10 Å. Because of the underprediction of the length of both the ˛-helix
and ˇ-sheets, larger portions on the chain were left as coils in comparison to the
native structure. This can partly explain why the ˛-helix does not pack against the
ˇ-sheets in our simulated structures. Both the anti-parallel and parallel ˇ-sheets
between the termini of the chain were able to form in our simulations. It is, therefore,
impossible to rule out the anti-parallel conformation based on the contact map
prediction alone. Our results indicate that it is crucial to obtain a good quality
predicted contact map and secondary structure to faithfully reconstruct the 3D fold
of a protein.
In another example, the general applicability of the modeling procedure
described above was further demonstrated by modeling Src tyrosine kinase SH3
domain (SH3, PDB code 1SRL) [788], This protein is often used as folding model
in simulation systems [529, 650]. Native 56-residue SH3 has a 5-stranded ˇ-barrel
structure. The simulation results for the SH3 domain are shown in Fig. 5.8. The
quality of the secondary structure prediction was comparable to that of protein
G: ˇ-strand locations in the sequence were correctly predicted, whereas their
152 A.A. Podtelezhnikov and D.L. Wild
0.6
0.4
0.2
0 200 400 600 800 1000
Snapshot
400
300
200
E
100
-100
0.2 0.4 0.6 0.8
Q
length was slightly underpredicted. The contact map prediction (shown in panel A)
presented a challenge for our further modeling because all possible combinations
between the ˇ-strands were predicted with comparable probability. The correctly
predicted ˇ-hairpin contacts allowed unambiguous interpretation in terms of residue
contacts and Gō-type potentials in our protein model (see protein G modeling
above). On the contrary, the other predicted contacts were mostly false positives.
For further simulations, we used three ˇ-hairpin contacts and one longer-range
contact between N- and C-terminus that were reliably predicted and corresponded
to the native structure.
The structures shown in the Fig. 5.8b were selected based on the maximum
fraction of predicted contacts at relatively low total energy. In the case of the
5 Inferring Knowledge Based Potentials Using Contrastive Divergence 153
Fig. 5.8 Reconstruction of Src tyrosine kinase SH3 domain. Panel A shows the regularized contact
map with the predicted contact map in the background. The regularized diagonal contacts pass
through the local maxima on the predicted contact map and extend until the predicted contact
probability levels off. The selected regularized contacts correspond to the native fold of SH3.
False positive predictions are not included in the reconstruction. The best structure corresponding
to the maximum fraction of predicted contacts at relatively low energy is shown in the panel B
(Reproduced from [584] with permission)
SH3 domain, four out of five strands correctly packed in the ˇ-sheet with small
misalignments of up to two residues. The C-terminus was correctly packed against
the rest of the structure but the hydrogen bonds did not completely form to complete
the barrel. The RMSD of the structure shown with the native fold was equal to 5.8 Å.
In all these cases, we can speculate that the resulting structures are stable folding
intermediates, with the native structure being adopted as a result of the formation of
the final contacts and small adjustments in alignment between the ˇ-strands.
5.10 Conclusions
Interactions that stabilize proteins have been an ongoing challenge for the scientific
community for decades [20]. The microscopic size of protein molecules makes
it impossible to measure these interactions directly. Several major contributors to
protein stability have, however, been identified. Non-specific interactions that are
only weakly determined by amino acid type have received the most attention. In
particular, main chain hydrogen bonding [189] and hydrophobic core interactions
[106] have been extensively studied both theoretically and experimentally. Although
non-specific interactions may explain why proteins adopt a compact shape, they are
not sufficiently detailed to explain how proteins adopt a particular fold. Only specific
154 A.A. Podtelezhnikov and D.L. Wild
interactions defined by the amino acid type define the secondary structure and the
tertiary fold for each particular sequence.
Non-specific interactions are ubiquitous and usually well represented in datasets
of even a moderate size. Studies of specific interactions between particular types of
amino acids may require significantly larger datasets of known structures to be well
represented and to allow contrastive divergence to properly converge. Optimization
of potentials has to be done on a large set of 500–1,000 known structures. The data
set has to be representative of the structures and sequences observed in the native
proteins. CulledPDB [752] and ASTRAL [78] contain such diverse structures with
low sequence identity.
The results from contrastive divergence optimization of interaction parameters
may provide valuable insights into protein energetics. They will help select the
model of protein interactions that best corresponds to the native protein structures.
For example, the value of dielectric permittivity, ", inside a protein is hotly debated
in the scientific community with experimental and theoretical estimates ranging
from 2 to 20 [637]. CD learning can also optimize the Kauzmann hydrophobicity
coefficient, kh , between the hydrophobic energy and the buried surface area,
S ,
EiHP D kh
S (5.18)
Current estimates for this parameter range from 9 to 20 kJmol1 nm2 [106].
Machine learning of these parameters with contrastive divergence should provide
important information about the magnitude of the corresponding interactions in
native proteins.
As with statistical potentials, CD learning relies on the Boltzmann hypothesis
that the observed protein structures and their details correspond to the canoni-
cal ensemble. The success of either methodology depends on the magnitude of
systematic errors in the crystal structures. Based on our preliminary studies, we
have no reason to believe that this presents significant obstacles for the contrastive
divergence approach. The discrepancy between the stability of the model and actual
protein may result from systematic errors in the original data set of crystal structures.
Contrastive divergence learning could also be applied to study protein-protein
and protein-ligand interactions, where statistical potentials have been successfully
used as scoring functions for docking [519, 675]. In these applications, the docking
fundamentals are similar to protein folding approaches, and include system rep-
resentation with reduced dimensionality, global conformational space search, and
evaluation of conformations using a scoring function [273]. As discussed above,
the scoring functions based on statistical potentials currently typically rely on
assumptions regarding the reference state [511, 519, 798], which are often dubious
[715]. This problem and its recent solution [276] are discussed extensively in
Chaps. 3 and 4. The scoring function could also be optimized using a CD learning
procedure that does not depend on any reference state, but requires a small number
of MC steps in the chosen protein interface representation. A suitable dataset of
crystallized protein complex structures are available in the Protein Data Bank, with
non-redundant subsets discussed in the literature (see [604] and references therein).
5 Inferring Knowledge Based Potentials Using Contrastive Divergence 155
Acknowledgements This work was supported by a grant from the National Institutes of Health
(1 P01 GM63208). DLW acknowledges support from an EU Marie-Curie IRG Fellowship (46444).
Chapter 6
Statistics of Bivariate von Mises Distributions
6.1 Introduction
3
Beta strand Left-handed
2
helix
Beta strand
ψ (radians)
1 Left−handed
0 helix
Alpha helix
−1
−2
Alpha helix
−3
−3 −2 −1 0 1 2 3
φ (radians)
Fig. 6.1 Two Ramachandran plots for the 100 protein structures in the Top100 database [771].
The left plot is a conventional scatter plot in the plane. The right plot is a scatter plot of the same
data on the torus, where describes the rotation about the axis through the center of the hole in
the torus and describes the rotation about the axis through the center of the torus tube. Areas for
ˇ-strands, right-handed ˛-helices and left-handed ˛-helices are indicated in both plots
In this chapter, we will discuss how to describe distributions on the torus using
directional statistics. We will focus on bivariate von Mises distributions, which
in several studies have been successfully applied in modelling the Ramachandran
map [68, 478]. We will start by presenting some basic concepts in directional
statistics in Sect. 6.2, including an short description of the univariate von Mises
distribution. Following this, we will introduce the full bivariate von Mises distribu-
tion and a number of submodels in Sect. 6.3. Just as the standard bivariate normal
distribution, these submodels have five parameters: two means, two concentrations,
and a parameter controlling “correlation”. In Sect. 6.4, we will describe some of
the key properties of these distributions, and in Sects. 6.5 and 6.6 we will discuss
inference and simulation. In Sect. 6.7, we will introduce a multivariate extension of
the bivariate von Mises distribution, and finally in Sect. 6.8 we describe conjugate
priors for the von Mises distributions.
N
CN D RN cos ; SN D RN sin N ; (6.2)
where RN D .CN C SN / 2 is the mean resultant length. When the mean resultant length
1
while for a zero mean resultant length, RN D 0, the mean direction is undefined.
If we return to the example from the introduction and consider the mean of the
two angles 1 ı and 359 ı using this definition, we obtain the more intuitive value
of 0 ı .
The univariate von Mises distribution [743] is the most well known circular
distribution from directional statistics. It can be considered the circular analogue
of the univariate normal distribution. Akin to the normal distribution, the von Mises
distribution has two parameters, a mean and a concentration parameter; the latter
can be considered as an anti-variance. The density of the von Mises distribution
M.; / is given by (see for example [474])
where I0 ./ denotes the modified Bessel function of the first kind and order 0, the
parameter < is the mean direction, and 0 is the concentration
parameter. In fact, the von Mises distribution can be approximated by the normal
density for high concentration, which corresponds to a small variance in the normal
distribution [474].
For a set of angular observations 1 ; : : : ; n , it can be shown that the maximum
likelihood estimate O of the mean parameter in the von Mises distribution is given by
the mean direction N of the observations [474], as described in Eq. 6.3. Furthermore,
the maximum likelihood estimate O of the concentration parameter is given by the
solution to
I1 ./
O
D RN ; (6.5)
I0 ./
O
where RN is the mean resultant length of the observations and I1 ./ denotes the
modified Bessel function of the first kind and order 1. However, there does not exist
a general closed for solution to this equation. Accordingly, the maximum likelihood
estimate of the concentration parameter is normally obtained by a numerical or an
162 K.V. Mardia and J. Frellsen
Algorithm 1 A sampling algorithm for the univariate von Mises distribution. This
algorithm was originally developed by Best and Fisher [56].
p 2 Œ0I 2/ and concentration parameter > 0
Require: Mean value
1 Set a D 1 Cp 1 C 4 2
2 Set b D a2 2a
2
3 Set r D 1Cb
2b
4 repeat
5 Sample u1 ; u2 Uniform.0; 1/
6 Set z D cos.u1 /
7 Set f D 1Crz
rCz
8 Set c D .r f /
9 until c.2 c/ u2 > 0 or log.c=u2 / C 1 c 0
10 Sample u3 Uniform.0; 1/
11 Set D C sign.u3 1=2/ arccos.f /
12 return
approximative solution to Eq. 6.5. Several different approximations are given in the
literature, here we give the approximation described by Lee [428]
8
<RN 2RN 2 if RN 2=3
1RN 2
O : (6.6)
: N
RC1
if RN > 2=3
N
4R.1 N
R/
There is no closed form expression for the distribution function of the von Mises
distribution [56]. This complicates simulation of samples from the distribution.
However, an efficient acceptance/rejection algorithm has been given by Best and
Fisher [56, 474], see Algorithm 1.
A general bivariate extension of the univariate von Mises distribution was intro-
duced by Mardia in 1975 [468]. This model, which we will call the full bivariate
von Mises distribution, has probability density function proportional to
˚
f .; / / exp 1 cos. / C 2 cos. /C
.cos. /; sin. // A .cos. /; sin. //T ; (6.7)
˚
f .; / / exp 1 cos. / C 2 cos. /C
˛ cos. / cos. / C ˇ sin. / sin. / ; (6.8)
In 2002 Singh et al. [660] presented a special case of the 6-parameter model, where
˛ D 0 and ˇ D in Eq. 6.8. We call this the sine model, and the probability density
function is given by
fs .; / D Cs expf1 cos. /C2 cos. /C sin. / sin. /g ; (6.9)
where Im ./ is the modified Bessel function of the first kind and order m. The
marginal and conditional distributions of the sine model were also given by Singh
et al. [660].
The marginal density of is given by the expression
where 2 ./2 D 22 C 2 sin2 . /. Note that the marginal density is symmetric
about D but not von Mises, except for the trivial case of D 0. The marginal
probability density of is given by an analogous expression.
It can be shown from Eqs. 6.9 and 6.11 that the conditional probability of
The cosine model with positive interaction was introduced and studied by Mardia
et al. in 2007 [478], while the naming convention was given by Kent et al. [369].
This model is obtained by setting ˛ D ˇ D 3 in the 6-parameter model given in
Eq. 6.8 and has the probability density function
For the cosine model with positive interaction, the marginal probability density of
is given by
where 13 . /2 D 12 C 32 21 3 cos. / [478]. The marginal distribution of
is symmetric about and for small values of 3 it is approximately a von Mises
distribution. For 1 D 3 D 0 the marginal distribution is von Mises with mean
angle and concentration parameter 2 , and trivially the marginal is uniform for
1 D 2 D 3 D 0. The marginal density of is given by an expression analogous
to Eq. 6.14.
It can also be shown that the conditional distribution of given
D is a
von Mises distribution M. ; 13 . //, where tan D 3 sin. /=.1 3
6 Statistics of Bivariate von Mises Distributions 165
cos. // [478]. The the conditional distribution of given ˚ D is von Mises
with analogous parameters.
An alternative cosine model was also given by Mardia et al. in 2007 [478]. This
model is called the cosine model with negative interaction. It is obtained by setting
˛ D 30 and ˇ D 30 in the 6-parameter model and has the probability density
function
where 1 ; 2 0 and the normalizing constant is the same as for the model with
positive interaction given in Eq. 6.13. Note that the cosine model with negative
interactions can be obtained by applying the transforming .; / 7! .; / in
the model with positive interactions, which corresponds to a rotation of the density
function in Eq. 6.14. So far the cosine model with negative interaction has only been
discussed briefly in the literature. This will also be reflected in this chapter, where
we will primarily be concerned with the model with positive interaction.
In 2008 Kent et al. [369] suggested a new model which is a hybrid between the
sine and cosine models. The authors gave the following motivation for the model.
Unimodal cosine and sine models have elliptical equiprobability contours around
the mode. Generally, this elliptical pattern becomes distorted away from the mode.
However, for the cosine model with positive interaction this pattern becomes the
least distorted under positive correlation, that is 3 < 0, while for the cosine model
with negative interaction the pattern is least distorted under negative correlation, that
is 30 < 0 (see Fig. 6.2). Thus, to attain the least distortion in the contours of constant
probability, it would be ideal to use the cosine model with positive correlation for
positively correlated sin and sin and the cosine model with negative interaction
for negatively correlated sin and sin .
To address this issue, Kent et al. [369] suggested a hybrid model that provides a
smooth transition between the two cosine models via the sine model. The probability
density function for this hybrid model is given by
˚
f .; / / exp 1 cos C 2 cos
C Œ.cosh 1/ cos cos C sinh sin sin ;
where is a tuning parameter which Kent et al. suggest setting to 1 for simplicity. If
was a free parameter, the hybrid model would just be a reparameterization of the
6-parameter model [369].
For large > 0 the hybrid model is approximately a cosine model with positive
interaction where 3 exp. /=2, while for large > 0 the hybrid model is
approximately a cosine model with negative interaction where 30 exp. /=2.
For 0 the hybrid model is approximately a sine model with . In other
166 K.V. Mardia and J. Frellsen
κ3 = 0.5 κ3 = −0.5
3 3
2 2
1 1
ψ (radians)
ψ (radians)
0 0
−1 −1
−2 −2
−3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
φ (radians) φ (radians)
Fig. 6.2 An illustration of the distortion in the contours of constant probability density in the
density function for the cosine model with positive interaction. The densities in the two plots have
the same values for the parameters ; D 0 and 1 ; 2 D 1, but different values for 3 . The
left plot shows the density with 3 D 0:5 and the right plot show the density with 3 D 0:5.
In general, the elliptical pattern is distorted least for the positive interaction cosine model when
3 < 0
words, for small correlations the hybrid model is approximately a sine model, and
in line with the motivation above, the hybrid model is approximately a cosine model
of suitable choice for large correlations [369].
The hybrid model is an promising construction. However, we will not discuss
this model further in the remaining of the chapter, since at the time of writing the
model has not been fully developed.
In this section we will give some of the properties of the bivariate von Mises
models. We will discuss the conditions under which the models become bimodal,
the approximative normal behavior of the models and finally give some interim
conclusions on how to choose a model. As we will see in Sects. 6.5 and 6.6,
these properties become important in parameters estimation and sampling for the
models.
Here, we will state some of the key results on bimodality for the sine and
cosine models. The proofs are given by Mardia et al. [478], except the proof for
Theorem 6.3 which is given by Singh et al. [660].
6 Statistics of Bivariate von Mises Distributions 167
The following two theorems describe the conditions under which the sine model
and the cosine model with positive interaction are bimodal.
Theorem 6.1. The joint density function of the sine model in Eq. 6.9 is unimodal if
1 2 > 2 and is bimodal if 1 2 < 2 when 1 > 0, 2 > 0 and 1 < < 1.
Theorem 6.2. The joint density function of the positive interaction cosine model in
Eq. 6.12 is unimodal if 3 < 1 2 =.1 C 2 / and is bimodal if 3 > 1 2 =.1 C 2 /
when 1 > 3 > 0 and 2 > 3 > 0.
Now we will consider the conditions under which the marginal distributions
for these two models are bimodal. It turns out that these conditions in general
are different from those of the joint densities. This may not be directly apparent,
but there exist sets of parameters for these models, where the marginal density is
unimodal although the bivariate density is bimodal. An example of this is illustrated
in Fig. 6.3. The following two theorems state the conditions under which the
marginal distributions are bimodal for the sine model and the positive interaction
cosine model.
Theorem 6.3. For the sine model given in Eq. 6.9 with ¤ 0, the marginal
distribution of ˚ is symmetric around D and unimodal (respectively bimodal)
with mode at (respectively with the modes at and C ) if and only if
A.2 / 1 2 = 2
.respectively A.2 / > 1 2 = 2 / ;
3
0.25
2
1
ψ (radians)
0.20
f c+(φ)
0
0.15
−1
−2 0.10
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
φ (radians) φ (radians)
Fig. 6.3 An example where the bivariate density of a von Mises distribution is bimodal while
the marginal density is unimodal. The left plot shows the contours of the bimodal density for the
positive interaction cosine model with parameters 1 D 1:5, 2 D 1:7, 3 D 1:3 and ; D 0.
The right plot shows the unimodal density for the corresponding marginal distribution of
168 K.V. Mardia and J. Frellsen
Theorem 6.4. For the positive interaction cosine model given in Eq. 6.12 with
3 ¤ 0, the marginal distribution of ˚ is symmetric around D and uni-
modal (respectively bimodal) with mode at (respectively with the modes at
and C ) if and only if
The sine and cosine models all behave as bivariate normal distributions under high
concentrations, and we have approximately
.˚ ;
/ N2 .0; †/ ;
Note, that for the sine model it is possible to choose the parameters in such a way
that the matrix above, †1
s , can match any given positive definite inverse covariance.
However, by definition the cosine models have the additional restriction 1 ; 2 0,
which limits the set of positive definite inverse covariance matrices that can be
matched by the cosine models.
6 Statistics of Bivariate von Mises Distributions 169
Closed expressions for the maximum likelihood estimators cannot be derived for
any of the models we have considered. However, the value of the normalizing
constant can be calculated using numerical integration. This means that maximum
likelihood estimates can be obtained using numerical optimization (e.g. the Nelder-
Mead downhill simplex algorithm [540]).
170 K.V. Mardia and J. Frellsen
Here, we will give an example of maximum likelihood estimation for the cosine
model with positive interaction using numerical optimization and integration. The
approach is similar for the other models.
The density for the cosine model with positive interaction is given in Eq. 6.12.
For a set of observations f.i ; i /gniD1 this gives us the log-likelihood function
n
LLc+ .1 ; 2 ; 3 jf.i ; i /gi D1 / D
n
X
n log Cc C f1 cos.i / C 2 cos. i / 3 cos.i i C /g:
i D1
Using Eq. 6.14, the normalizing constant can be expressed as the single integral
Z 2
Cc1 D 2I0 .13 . // expf2 cos. /g d : (6.16)
0
This expression is somewhat simpler to calculate than the original expression for
the normalization in Eq. 6.13 or the double integral of the bivariate density function
from Eq. 6.12.
The log-likelihood function LLc+ can be optimized using a numerical optimiza-
tion method. It is important for such method to have good initial values. These
values can be obtained by assuming that the marginal distributions are von Mises,
that is M.; 1 / and M.; 2 /. The initial values are then set to the
maximum likelihood estimate under this assumption. This means that the initial
values for the estimates O and O are given respectively by the circular means of
fi gniD1 and f i gniD1 , as described in Sect. 6.2.1. Similarly, the initial values for
the estimates O 1 and O 2 can be obtained using the approximation to the maximum
likelihood estimate of the concentration parameter for the von Mises distribution
given in Eq. 6.6. For the estimate O 3 the mean of the initial values of O 1 and O 2 can
be used a starting value.
The approach described here was used in a study by Mardia et al. to fit the
dihedral angles in the Ramachandran plot for a small set of proteins [478].
6 Statistics of Bivariate von Mises Distributions 171
While the maximum likelihood estimates using numerical optimization are quite
accurate, the method is highly computational demanding. A faster strategy might
therefore be preferable in some cases. The method of moments offers such
an alternative. The basic idea behind the method of moments is to equate the
sample moments with the corresponding distribution moments. Estimators for the
distribution’s parameters are then constructed by solving this set of equations with
respect to the parameters.
Let us consider how to obtain moment estimates for the sine and the cosine
models. Assume that we have a set of observation f.i ; i /gniD1 . Using the first
moment, the means and can simply be estimated by the marginal circular means
N and N , as given in Eq. 6.3.
For highly concentrated data, the models behave approximately as normal
distributions (see Sect. 6.4.2). The moment estimate for the covariance matrix in
the normal distribution is given by the second moment of the population, that is
O N SN1 SN12
†DSD N N ; (6.17)
S12 S2
where
n
1X 2
SN1 D sin .i /
N
n i D1
n
1X 2
SN2 D sin . i /
N
n i D1
n
1X
SN12 D sin.i /
N sin. i /
N :
n i D1
By taking the inverse of † O and equating it to the expression for the inverse
covariance matrix from Eq. 6.15, we can obtain estimates of the concentrations
parameters 1 , 2 and /3 /30 (depending on choice of model). For the sine model
the moment estimates become
and for the cosine model with positive interaction the estimates become
A similar result can be obtained for the cosine model with negative interaction.
The moment estimates are not as accurate as the maximum likelihood estimates
obtained by numeric optimization, but they are considerably faster to calculate and
have shown to be sufficiently accurate in some areas of application [68]. Alterna-
tively, the moment estimates can be used as initial values in the numerical maximum
likelihood approach. Moment estimates was used for fitting the TORUSDBN model
described in Chap. 10.
n
Y
PL D f .i j i ; q/f . i ji ; q/ ;
i D1
where f .j; q/ are the marginal probability densities. This is also known as
full conditional composite likelihood, which in the bivariate case also equals the
pairwise conditional composite likelihood [480]. The maximum pseudolikelihood
method proceeds by maximizing this pseudolikelihood with respect to the parameter
vector, which yields the estimate of q.
It turns out that the sine and cosine models are non-closed exponential family
models [480]. However, they can be regarded as approximately closed, and the
estimates obtained by maximizing the pseudolikelihood are an approximation to the
maximum likelihood estimate. For the sine model the maximum pseudolikelihood
estimator has been shown to have high efficiency in most cases [480].
For the sine model the conditional distributions are von Mises distributed, as
described in Sect. 6.3.1. With the parameter vector qs D .; ; 1 ; 2 ; /, the
6 Statistics of Bivariate von Mises Distributions 173
where
2
tan i D . =1 / sin. i / 1 . i/ D 12 C 2 sin2 . i /
tan i D . =2 / sin.i / 2 .i /2 D 22 C 2 sin2 .i / :
The expression for the pseudolikelihood can be shown to be equal to the somewhat
simpler expression [476]
n
"
Y expfk1 cos.i / C sin.i / sin. i /g
PLs D
i D1
2 I0 .1 . i //
#
expfk2 cos. i / C sin.i / sin. i /g
: (6.18)
2 I0 .2 .i //
In this section we will discuss how to sample from the sine model and the
cosine model with positive interaction. Generally, there are two approaches: Gibbs
sampling and rejection sampling. Both approaches work well in practice, but the
latter is the most efficient [478]. They both make use of the fact that there exist
an efficient algorithm for simulating from the univariate von Mises distribution, as
described in Sect. 6.2.1 and Algorithm 1.
174 K.V. Mardia and J. Frellsen
A straightforward approach to generating samples from the sine model and the
cosine models is Gibbs sampling. In Gibbs sampling, a sequence of samples is
obtained by alternating simulating values from the two conditional distributions
f .j / and f . j/. That is, we start with an initial random value for 0 and then
sample 0 f .j
D 0 /. In the next step we sample 1 f . j˚ D 0 /,
continue by sampling 1 f .j
D 1 / and so forth. This is straightforward,
since the conditional distributions are all von Mises for these models.
The main problem with this sampling approach is that a suitable burn-in
period is needed to make the samples independent of the starting point, 0 , and
thinning is needed to ensure that there are no sequential dependencies between the
samples [478]. This makes the Gibbs sampling approach inefficient.
A more efficient approach to sampling from the sine model and the cosine models
is to first use rejection sampling to simulate a value 0 from the marginal density
f . /, and then simulate a value 0 from the conditional distribution f .j
D 0 /
[68, 478]. The description here is mainly based on the one of Mardia et al. [478].
In order to draw a sample 0 from the marginal density f . / using rejection
sampling, we first need to determine whether the marginal density is unimodal or
bimodal. This is done by checking the conditions of respectively Theorems 6.3 or
6.4. Depending on the modality we use either a single von Mises or a mixture of
two von Mises as proposal distribution:
• If the marginal distribution is unimodal, we propose a candidate value 0 by
sampling from the univariate von Mises distribution M.; /.
• If the marginal distribution is bimodal, we propose a candidate value 0 by
sampling from an equal mixture of the two von Mises densities M. ; /
and M. C ; /, where is given by either Theorems 6.3 or 6.4.
In the expressions above, should be chosen so that the distance between the
proposal density and the marginal density f . / is minimized. The next step is
to sample a value u from the uniform distribution on Œ0; 1 and test the condition
u < f . 0 /=.L g. 0 //, where g is the probability density function of the proposal
distribution and L > 1 is an upper bound on f . /=g. /. If the condition is
true then 0 is accepted as a realization of f . /, otherwise 0 is rejected and the
candidate sampling step is repeated.
The proportion of proposed samples that are accepted is given by 1=L, which
is also called the efficiency. L should therefore be minimized under the conditions
L > 1 and L > f . /=g. / for all 2 .; . In practice this can be done
by finding max.f . /=g. // over 2 .; using numerical optimization.
6 Statistics of Bivariate von Mises Distributions 175
For many sets of parameters the marginal densities are close to von Mises and
the rejection sampling algorithm has high efficiency. However, for large values of
1 2 ; 3 ; 30 the algorithm becomes quite inefficient. As an example, for the
positive interaction cosine model with .1 ; 2 ; 3 / D .100; 100; 90/ the density is
very bimodal and the efficiency is around 69% [478].
Once a sample 0 has been obtained from the marginal distribution f . / using
rejection sampling, 0 can be sampled from the conditional distribution f .j
D
0
/, which is von Mises for both the sine and cosine model with positive interaction.
Using this procedure we can generate a sequence of samples f.i ; i /gniD1 from
either the sine model or cosine models. These sample are independent and contrary
to the Gibbs sampler no burn-in period or thinning is required. This sampling
approach has been employed for models of local protein structure [68, 478], and
in particular it was used for TORUSDBN described in Chap. 10.
In this section, we will look beyond the bivariate angular distribution and consider
a general multivariate angular distribution. In 2008, Mardia et al. [479] presented a
multivariate extension of the bivariate sine model from Eq. 6.9. This distribution is
called the multivariate von Mises distribution and is denoted Mp .; ; / in
the p-variate case. The probability density function for D .
1 ;
2 ; : : : ;
p /T is
given by
˚
fp ./ D fT .; /g1 exp T c.; / C s.; /T s.; /=2 ; (6.19)
D .
1 ;
2 ; : : : ;
p /T Np .0; †/ ;
176 K.V. Mardia and J. Frellsen
with (
i for i D j
.†1 /ij D ;
ij for i ¤ j
where Np .0; †/ denotes a p-variate normal distribution with mean 0 and covariance
matrix †.
Inference in the multivariate von Mises distribution can be done using methods
of moments or the maximum pseudolikelihood method. The approaches are similar
to those described for the bivariate sine model in Sects. 6.5.2 and 6.5.3. Sampling
can be done using Gibbs sampling. We refer to Mardia et al. for further details [479].
As an example of usages in structural bioinformatics, Mardia et al. [479]
considered a dataset of gamma turns in protein structures divided into triplets of
amino acids. A trivariate von Mises distribution was fitted to the triplets of and
separately, and a reasonable data fit was reported.
There has been renewed interest in directional Bayesian analysis since the paper of
Mardia and El-Atoum [473], as indicated below. In this section we will consider
conjugate priors for the various von Mises distributions, in particular we focus on
priors for the mean vector. For a general introduction to conjugate priors we refer to
Chap. 1.
Consider the von Mises distribution with probability density function given in
Eq. 6.4. Recall that in this expression is the mean direction and is the con-
centration (precision) parameter. It has been shown by Mardia and El-Atoum [473]
that for a given , the von Mises distribution, M. ; /, is a conjugate prior for .
In other words, the von Mises distribution is self-conjugate for fixed .
Guttorp and Lockhart [265] have given the joint conjugate prior for and ,
and Mardia [470] has considered a slight variant. However, the distribution for
is not straightforward. Various suggestions have appeared in the literature; for
example, take the prior for independently as a chi-square distribution, use the
non-informative prior and so forth.
Mardia [472] has shown that a conjugate prior of the full bivariate von Mises
distribution for the mean vector .; / given 1 ; 2 and A is the full bivariate von
6 Statistics of Bivariate von Mises Distributions 177
Mises itself, as defined in Eq. 6.7. Furthermore, Mardia has also obtained a compact
form of the normalizing constant for this general case in order to write down the full
conjugate prior and the posterior. We will now show how this results applies to the
bivariate sine model from Sect. 6.3.1. Let .i ; i /niD1 be distributed according to a
sine model with known concentration .1 ; 2 / and dependence parameter . Denote
the center of mass for .i /niD1 by .C1 ; S1 /, and the center of mass for . i /niD1 by
.C2 ; S2 /, as defined in Eq. 6.1. Now, let the prior distribution of .; / be given by
a sine model with mean .0 ; 0 /, concentration .01 ; 02 / and dependence 0 . It can
be shown [472] that the posterior density for .; / is given by the full bivariate von
Mises distribution with mean .0 ; 0 /, concentration .1 ; 2 / and the matrix A as
defined below.
1 cos 0 D 01 cos 0 C 1 C1 ; 1 sin 0 D 01 sin 0 C 1 S1 ;
2 cos 0 D 02 cos 0 C 2 C2 ; 2 sin 0 D 02 sin 0 C 2 S2 ;
P P
0 sin 0 sin 0 C niD1 sin i sin 0 sin 0 cos 0 niD1 sin i cos
A D P i
Pn i
0 cos 0 sin 0 niD1 cos i sin i 0 cos 0 cos 0 C iD1 cos i cos i
This result was also obtained independently by Lennox et al. [435] (see correction).
A key point is that the posterior density for .; / is not a sine density. However,
for the cosine model, it can be shown that the posterior distribution of the mean is
another cosine submodel of the full bivariate von Mises, but with six parameters
only; details are given by Mardia [472].
Lennox et al. [434] have provided a template-based approach to protein structure
prediction using a semiparametric Bayesian model which uses the Dirichlet process
mixture model. This work relies on priors for the bivariate sine model, which has
been studied in details by Lennox et al. [435] and Mardia [471, 472].
Finally, we will consider priors for the multivariate von Mises distribution from
Sect. 6.7. When the multivariate von Mises distribution is used as prior for the mean
vector and the likelihood is a multivariate von Mises with know and , then
the posterior density of has been shown [472] to belong to an extension of the
expression given by Mardia and Patrangenaru [475] in their Eq. 3.1. For ; we
can use an independent prior distribution as Wishart for , where . /i i D i and
. /ij D ij , following the proposal for p D 2 by Lennox et al. [435]. Full details
are given by Mardia [472].
178 K.V. Mardia and J. Frellsen
6.9 Conclusion
The various von Mises distributions are of prime interest in structural bioinformatics
as they are extremely suited to formulate probabilistic models that involve dihedral
angles, arguably the most important degree of freedom in the parameterization
of biomolecules. Together with the Kent distribution discussed in Chap. 7, which
concerns unit vectors and thus covers another potentially important and common
degree of freedom, these distributions form a powerful tool for the structural
biologist in search for rigorous statistical solutions to challenging problems.
Part IV
Shape Theory for Protein Structure
Superposition
Chapter 7
Statistical Modelling and Simulation Using
the Fisher-Bingham Distribution
John T. Kent
7.1 Introduction
is included because we generally think of vectors as column vectors. Thus the unit
sphere can be written as ˝3 D fx 2 R3 W x T x D x12 C x22 C x32 D 1g.
It is useful to use polar coordinates to represent x,
Here 2 Œ0; and 2 Œ0; 2/ define the colatitude and longitude, respectively.
For a point on the earth, with the x3 -axis pointing towards the north pole, ranges
from 0 to as x ranges between the north pole and the south pole. If D 0 or ,
the longitude is undefined; otherwise for fixed ; 0 < < , the set of points
fx W x3 D cos g defines a small circle, and the longitude identifies the position
of x along this small circle.
The uniform measure on the sphere can be written as
The left-hand side is more convenient for mathematical descriptions; the right-hand
side is useful for explicit calculations. Note the sin factor on the right-hand side
occurs because a small circle has a smaller radius for near the poles, 0 or , than
for near the equator, D =2. The surface area of the sphere can be found as
Z 2 Z
sin d d D 4:
0 0
so that the sphere, excluding the south pole, is mapped into the disk f.u1 ; u2 / W
u21 C u22 < 4g. The uniform measure on the sphere takes the forms
˚ 1=2
!.d x/ D 1 .x12 C x22 / dx1 dx2 D du1 du2 (7.3)
Let us start by recalling the key properties of the bivariate normal distribution for a
random vector x D .x1 ; x2 /T . This distribution has the density
1=2 1 1
f .x/ D j2˙j exp .x / ˙ .x / ;
T
x 2 R2 : (7.4)
2
There are five parameters: the means D .1 ; 2 /T of x1 and x2 , the variances
11 ; 22 of x1 and x2 , and the covariance 12 D 21 between x1 and x2 , where the
2 2 covariance matrix has elements
˙ D 11 12 :
21 22
This distribution is the most important distribution for bivariate data. One way to
think about this distribution is to plot contours of constant probability. These are
given by ellipses centered at the mean vector of the distribution, and with size and
orientation governed by ˙. A typical ellipse is given in Fig. 7.1a, together with the
major and minor axes.
The surface of a sphere is a (curved) two-dimensional surface. Hence it is natural
to look for an analogue of the bivariate normal distribution. Thus we wish to find
a distribution which can have any modal direction, with ellipse-like contours of
constant probability about this modal direction.
a b
bivariate normal FB5
Fig. 7.1 (a) An ellipse representing a contour of constant probability for the bivariate normal
distribution, together with the major and minor axes. (b) A oval-shaped contour of constant
probability for the FB5 density (Eq. 7.5). The plot is given in equal area coordinates; the point at
the center represents the north pole, the outer circle represents the south pole and the intermediate
circle represents the equator. The distribution is centered at the north pole and the major and minor
axes have been rotated about the north pole to coincide with the horizontal and vertical axes
182 J.T. Kent
where 2 3
1j
.j / D 4
2j 5
3j
denotes the j th column of . In matrix form the orthogonality condition can be
written T D I3 , the identity matrix in three dimensions, and so the inverse
of is its transpose 1 D T . Hence the determinant satisfies j j D ˙1. If
an orthogonal matrix satisfies j j D C1, it is called a rotation matrix. For our
purposes, the sign of the final column .3/ is important However, for the other two
columns, only the axis information is relevant; i.e. .1/ and .1/ contain the same
information; similarly for .2/ and .2/ . Hence, without loss of generality, we may
restrict to be a rotation matrix.
The more general FB5 density takes the form
n o
f .x/ D C.; ˇ/1 exp T.3/ x C ˇŒ. T.1/ x/2 . T.2/ x/2 / : (7.6)
7 Statistical Modelling and Simulation Using the Fisher-Bingham Distribution 183
In this case the modal direction is given by .3/ and the major and minor axes by
.1/ and .2/ , respectively. We write the distribution as FB5.; ˇ; / in terms of the
concentration parameters ; ˇ, and the rotation matrix .
A general orthogonal matrix G can be decomposed as a product of two simpler
matrices, G D HK, where
2 3 2 3
cos ı cos sin sin ı cos cos sin 0
6 7 6 7
H D H.ı; / D 4 cos ı sin cos sin ı sin 5 ; K D K. / D 4 sin cos 05 : (7.7)
sin ı 0 cos ı 0 0 1
The angles ı 2 Œ0; and 2 Œ0; 2/ represent the colatitude and longitude of
the modal direction, and the angle 2 Œ0; 2/ represents the rotation needed to
align the major and minor axes with the coordinate axes, once the modal axis has
been rotated to the north pole. Note that K. C / changes the sign of the first two
columns of K. / so that and C determine the same two axes for .1/ and
.2/ . Hence can be restricted to the interval Œ0; /. If x is a random unit vector
following the density given by Eq. 7.6 and we set y D H T x and z D K T y, so that
z D G T x, then z follows the standardized density given by Eq. 7.5.
Under high concentration, that is, for large with D 2ˇ= held fixed, 0 <
1, the FB5 distribution (Eq. 7.5) is approximately a bivariate normal distribution. In
this case most of the probability mass is concentrated near x1 D 0; x2 D 0; x3 D 1.
In particular virtually all the mass lies in the northern hemisphere x3 > 0, which
we can represent by the orthogonal tangent coordinates, x1 and x2 , with x1 and x2
small. Using the approximation x3 D f1 .x12 C x22 /g1=2 1 12 .x12 C x22 /2 , and
noting f1 .x12 C x2 /2 g1=2 1 in Eq. 7.3 yields the asymptotic formula for the
density given by Eq. 7.5 (with respect to dx1 dx2 ),
˚
g.x1 ; x2 / / exp .1 /x12 C .1 C /x22 : (7.8)
2
That is, 1=2 x1 and 1=2 x2 are approximately independently normally distributed
with 0 means and variances 1=.1 / and 1=.1 C /, respectively.
7.4 Estimation
2. Project the data onto the tangent plane at the north pole, and compute the 2 2
matrix of second moments about the origin.
3. Rotate the data in the plane so that the second moment matrix is diagonal and
use these diagonal values to estimate and ˇ.
Next we fill P out this sketch, using a hat O to indicate estimated quantities.
Let x D n1 x i denote the sample mean vector of the data, also called the
resultant vector, and write it in the form x D r1 x 0 , where the resultant length
˚ 1=2
r1 D x T x is the norm of x and where x 0 is a unit vector pointing in the same
direction as x. In practice 0 < r1 < 1 with r1 close to 1 for concentrated data.
T
Write x 0 D sin ıO cos ; O sin ıO sin ;
O cos ıO in polar coordinates, and define the
orthogonal matrix HO D H ı; O O as in Eq. 7.7. Define y D HO T x i ; i D 1; : : : ; n
i
to be the rotated data. The mean vector of the fy i g now points towards the north
pole .0; 0; 1/T .
Define the 2 2 covariance matrix about the origin of the first two coordinates
of the fy i g data,
n
1 X yi1 2
yi1 yi 2
SD :
n yi 2 yi1 yi22
i D1
1n
O D .2 2r1 r2 /1 C .2 2r1 C r2 /1 ; ˇO D .2 2r1 r2 /1
2
o
.2 2r1 C r2 /1 : (7.9)
For data that are not highly concentrated, [365] developed a numerical algorithm
based on a series expansion for the normalization constant C.; ˇ/. This algorithm
is straightforward to implement on a computer and a program in R is available from
the author.
7 Statistical Modelling and Simulation Using the Fisher-Bingham Distribution 185
7.5 Simulation
In modern Monte Carlo statistical methods, distributions such as FB5 are building
blocks in a larger construction, and efficient algorithms are needed to simulate from
such distributions. Here we give an exact simulation method with good efficiency
properties for the whole range of and ˇ values, 0 2ˇ . In this case note that
the exponent in (7.5), f cos C ˇ sin2 .cos2 sin2 /g, is a decreasing function
of 2 Œ0; for each . (On the other hand, if ˇ > =2, the density increases and
then decreases in when D 0.) The algorithm developed here was first set out
in [367].
For the purposes of simulation it is helpful to use the equal area projection
(Eq. 7.2) with coordinates .u1 ; u2 /. For algebraic convenience, set t1 D u1 =2; t2 D
u2 =2, so that r 2 D t12 C t22 < 1.
In .t1 ; t2 / coordinates, the probability density (with respect to dt1 dt2 in the unit
disk t12 C t22 < 1) takes the form
˚
f .t1 ; t2 / / exp 2r 2 C 4ˇ.r 2 r 4 /.cos2 sin2 /
˚
D exp 2.t12 C t22 / C 4ˇŒ1 .t12 C t22 /.t12 t22 /
1 2
D exp Œat1 C bt2 C
.t1 t2 / :
2 4 4
(7.10)
2
satisfy 0 a b and
b=2. Here we have used the double angle formulas,
cos D 1 2 sin2 .=2/; sin D 2 sin.=2/ cos.=2/.
Note that the density splits into a product of a function of t1 alone and t2 alone.
Hence t1 and t2 would be independent except for the constraint t12 C t22 < 1.
Our method of simulation, as sketched below, will be to simulate jt1 j and jt2 j
separately by acceptance-rejection using a (truncated) exponential envelope, and
then additionally to reject any values lying outside the unit disk. For a general
background in acceptance-rejection sampling, see, for example [131].
The starting point for our simulation method is the simple inequality
186 J.T. Kent
0.8
0.6
density 0.4
0.2
−3 −2 −1 0 1 2 3
w
Fig. 7.2 Acceptance-rejection simulation using Eq. 7.14. The lower curve is a Gaussian density.
The upper curve is proportional to a double exponential density and always sits above the lower
curve
1
.jwj
/2 0 (7.12)
2
for any parameters ;
> 0 and for all w. Hence
1 1
2 w2
2
jwj: (7.13)
2 2
After exponentiation, this inequality provides the basis for simulating a Gaus-
sian random variable from a double exponential random variable by acceptance-
rejection,
1 1 jwj
f .w/ D .2/1=2 e 2 w ;
2
g.w/ D e ; f .w/ C g.w/; (7.14)
2
for all w, where C D .2e=/1=2 1:3. Simulation from the bilateral exponential
distribution is straightforward using the inverse method for jwj [131]; set w D
s log u, where u is uniform on .0; 1/, independent of s which takes the values ˙1
with equal probability. The constant C gives the average number of simulations
from g needed to generate each simulation from f , and 1=C 0:78 is called the
efficiency of the method. Figure 7.2 gives a plot of f .w/ and C g.w/ in this case.
For our purposes some refinement of this simple method is needed. For t1 we
need to apply Eq. 7.13 twice, first with D
1=2 ;
D 1 and w D t12 , and second
with D .a C 2
1=2 /1=2 ;
D 1 and w D t1 , to get
1 1 1
.at12 C
t14 / .a C 2
1=2 /t12
2 2 2
c1 1 jt1 j (7.15)
where
c1 D 1; 1 D .a C 2
1=2 /1=2 : (7.16)
7 Statistical Modelling and Simulation Using the Fisher-Bingham Distribution 187
If b D 0 (and so
D 0) then Eq. 7.17 continues to hold with 2 D 0 and c2 D 0.
This construction is most effective for large . For small , say 1, it is more
efficient to use a simple uniform envelope on the square jt1 j 1; jt2 j 1. Table 7.1
summarizes some sample efficiencies for various values of and ˇ, with D 2ˇ=,
where the uniform envelope has been used if 1. Note that the efficiencies range
between about 25% and 75% for all choices of the parameters.
Our first model for the protein structure will be a third-order Markov process.
Consider an FB5.; ˇ; R/ distribution with fixed parameters, where R is a 3 3
rotation matrix. Then given vi , vi C1 D vi C e i C1 is simulated by letting
G T e i C1 FB5.; ˇ; R/:
where the different FB5 simulations are independent for each i . The process is
third-order Markov because of the need to determine a frame of reference at each
vertex.
188 J.T. Kent
Fig. 7.3 Simulation from three FB5 distributions; 1,000 samples from each are shown in red,
green and blue, respectively. The mean directions are indicated by arrows. Figure taken from [275]
The previous model is a bit too simplistic to be useful in practice. Hence we let
the parameters .; ˇ; R/ vary according to a hidden Markov model (HMM) with a
finite number of states. This model is discussed in more detail in Chap. 10, so we
only give a short outline here. We call this HMM with discrete hidden nodes and
observed FB5 nodes FB5HMM [275]. The discrete hidden nodes of the FB5HMM
can be considered to model a sequence of fine-grained, discrete ‘local descriptors’.
These can be considered as fine grained extensions of helices, sheets and coils. The
observed FB5 nodes translate these ‘local descriptors’ into corresponding angular
distributions.
Hamelryck et al. [275] gives a thorough investigation of the FB5HMM model.
Figure 7.3 is taken from that paper and shows a mixture of three FB5 distributions
on the sphere with different location and concentration parameters. The three
groups correspond to nodes which are typical representatives for different types of
secondary protein structure: blue for coil, red for ˛-helix, and green for ˇ-strand.
Chapter 8
Likelihood and Empirical Bayes Superposition
of Multiple Macromolecular Structures
Douglas L. Theobald
8.1 Introduction
8.1.1 Overview
that, given certain general assumptions, the least-squares criterion renders estimates
of unknown parameters that are optimal in the following specific statistical sense.
A least-squares estimate is unbiased (i.e., on average it equals the true value of the
parameter), and of all unbiased estimates the least-squares estimate has the least
variance (i.e., on average it is closest to the true value, as measured by the mean
squared deviation). For the guarantee of the Gauss-Markov theorem to be valid, two
important requirements must be met: the data points must be uncorrelated, and they
must all have the same variance (i.e., the data must be homoscedastic) [641].
In terms of a macromolecular superposition, the homoscedastic assumption
stipulates that the corresponding atoms in the structures must have equal variances.
However, the requirement for homogeneous variances is generally violated with
macromolecular superpositions. For instance, the experimental precision attached
to the atoms in a crystal structure vary widely as gauged by crystallographic
temperature factors. Similarly, the variances of main chain atoms in reported
superpositions of NMR models commonly range over three orders of magnitude.
In comparisons of homologous protein domains, the structures deviate from each
other with varying degrees of local precision: some atoms “superimpose well”
and others do not. In molecular dynamics simulations, particular regions of a
macromolecule may undergo relatively large conformational changes (a violation of
the Eckart conditions [167], which have been used to provide a physical justification
for using least-squares superpositions to refer different conformational states of
a molecule to a common reference frame [404]). Ideally, the atomic positions
should be uncorrelated, but this assumption is also violated for macromolecular
superpositions. Adjacent atoms in a protein main chain covary strongly due to
covalent chemical bonds, and atoms remote in sequence may covary due to
other physical interactions. These theoretical problems are not simply academic.
In practice, researchers usually perform an OLS superposition, identify regions
that do not “superposition well”, and calculate a new superposition in which the
more variable regions have been subjectively excluded from the analysis. Different
superpositions result depending on which pieces of data are discarded.
for uneven variances and correlations by weighting by the inverse of the covariance
matrix.
Estimation of the covariance matrix has historically been a significant impedi-
ment to a viable non-isotropic likelihood-based Procrustes analysis [159, 224, 242,
243, 430–432]. Simultaneous estimation of the sample covariance matrix and the
translations is generally impossible. We permit joint identifiability by regularizing
the covariance matrix using a hierarchical, empirical Bayes treatment in which the
eigenvalues of the covariance matrix are considered as variates from an inverse
gamma distribution.
In general, all the estimates of the unknown parameters are interdependent and
cannot be solved for analytically. Furthermore, the smallest eigenvalues of the
sample covariance matrix are zero due to colinearity imparted by the centering
operation necessary to estimate the unknown translations. We treat these smallest
eigenvalues as “missing data” using an Expectation-Maximization (EM) algorithm.
For simultaneous estimation, we use iterative conditional maximization of the
joint likelihood augmented by the EM algorithm. This method works very well
in practice, with excellent convergence properties for the thousands of real cases
analyzed to date.
Xi D . C Ei / RTi 1k Ti ; (8.1)
196 D.L. Theobald
where i is a 31 column vector for the translational offset, 1k denotes the K 1
column vector of ones, and Ri is a proper, orthogonal 33 rotation matrix.
The likelihood equation for the model given in Eq. 8.1 is obtained from a matrix
normal distribution [136]. First, we define
Ei D .Xi C 1k Ti /Ri :
1 X ˚ T 1
n
3n
` .R; ; ; †jX/ D ln j†j tr Ei † E : (8.3)
2 2 i
In order to find the maximum likelihood solution for the Gaussian model given
in Eq. 8.1, the maximum likelihood estimates of the four classes of unknowns
(R; ; ; †) must be determined jointly. In the following sections we provide the
ML solutions for the translations, rotations, and mean form, assuming that the
covariance matrix is known. In general the covariance matrix is not known and
must be estimated as well. However, joint estimation of the translations and the
covariance matrix presents certain challenges that will be given special attention
(Sects. 8.5–8.8).
Ri T † 1 1k XTi † 1 1k
O gen;i D (8.4)
1Tk † 1 1k
L i D Xi C 1k O Ti
X (8.6)
L i D CXi
X (8.7)
where
1k 1Tk † 1
C D I (8.8)
1Tk † 1 1k
The centering matrix C is also both singular and idempotent, since C1k D 0 and
CXi D CCXi . These properties of the centering matrix will be important later for
understanding the difficulties with estimation of the covariance matrix.
The conditional ML estimates of the rotations are calculated using a singular value
decomposition (SVD). Let the SVD of an arbitrary matrix D be UƒVT . The optimal
O i are then estimated by
rotations R
L i D UƒVT
T † 1 X
O i D VPUT
R (8.9)
198 D.L. Theobald
Rotoinversions are prevented by ensuring that the rotation matrix R O i has a positive
determinant: P D I if jVj jUj D 1 or P D diag .1; 1; 1/ if jVj jUj D 1. If the
mean is row centered, as discussed in the previous section, then the estimate of
the rotation is in fact independent of the translations, and it is strictly unnecessary
to use the centered structure XL i in Eq. 8.9; one can just use Xi in the SVD, a fact
which can simplify some superposition algorithms.
The mean structure is estimated as the arithmetic average of the optimally rotated
and translated structures:
1XL
n
O D Xi Ri (8.10)
n i
1X
n
O D CXi Ri
n i
!
1 X
n
D C Xi Ri (8.11)
n i
It follows that C
O D ,
O due to the idempotency of the centering matrix.
X n
OsD 1
† L i Ri /.
.X O X L i Ri /
O T (8.12)
3n i
However, as is clear from the likelihood function 8.2, the Gaussian model requires
that the covariance matrix be invertible, i.e., all its eigenvalues must be positive.
The ML estimates of the translations and rotations also both require an invertible
covariance matrix. These facts present three crucial difficulties for joint ML
estimation of the parameters in the superposition problem.
8 Likelihood and Empirical Bayes Superposition 199
First, the ML estimate of the covariance matrix exists only when sufficient data
are present to avoid rank deficiency, i.e., when n .k=3/ C 1 [165]. In practice,
this condition is rarely satisfied due to limited data.
Second, the ML estimate of the covariance matrix requires centering the struc-
tures (see Eq. 8.5), which imparts a common linear constraint on the columns of the
structure matrices. This can be seen by representing † O s in terms of the centering
matrices:
X n
OsD 1
† .CXi Ri C/.CX
O i Ri C/
O T
3n i
" n #
1 X
D C .Xi Ri /.X
O i Ri /
O T CT (8.13)
3n i
Thus, † O s is both row and column centered, and since C is singular, the sample
covariance matrix † O s is also singular. Even with sufficient data, the sample
covariance matrix is rank deficient, with at least one zero eigenvalue [431, 432],
and therefore it is non-invertible.
Third, the displacements from the mean due to the covariance matrix † and
the translations i are linearly entangled (see Eq. 8.1), and simultaneous estimation
is not possible [242, 243, 431]. This last problem is particularly acute, since the
unrestrained ML estimates of the translations i and of the covariance matrix † are
strongly interdependent. Translations that closely superpose a given atom decrease
the estimated variance, and a small variance in turn weights the translations to center
on that atom. It is always possible to find translations that exactly superposition any
selected atom, making its sample variance zero, the covariance matrix † O singular,
and the likelihood infinite. The possibility of an infinite likelihood suggests that the
covariance matrix should be regularized so that small eigenvalues are penalized. If
all eigenvalues are constrained to be finite and positive, then the covariance matrix
will be invertible and each of these problems solved.
In real applications the variances cannot take arbitrary values. For instance, the
atoms in a macromolecule are linked by chemical bonds, and the atomic variances
are similar in magnitude. Very small or large variances are improbable and
physically unrealistic. Thus, simultaneous estimation of the covariance matrix and
the translations can be enabled by restricting the variances to physically reasonable
values.
To estimate the covariance matrix, we therefore adopt a hierarchical model where
the eigenvalues (j ) of the covariance matrix † are inverse gamma distributed.
The eigenvalues of the covariance matrix can be considered as variances with all
200 D.L. Theobald
r
˛ 32 ˛j
p j D e (8.14)
j
where ˛ is the scale parameter. This distribution is also known as the Lévy
distribution, which is one of the few analytically expressible stable distributions. It
corresponds to a scaled inverse chi squared distribution with one degree of freedom,
and thus has a straightforward Bayesian interpretation as a minimally informative,
conjugate hierarchical prior. The inverse gamma distribution conveniently places a
low probability on both small and large eigenvalues, with zero probability on zero-
valued eigenvalues. The corresponding log-likelihood for the K eigenvalues is (up
to a constant)
3X X 1
k k
k
`.˛j/ D ln ˛ ln j ˛
2 2 i i
j
k 3
D ln ˛ ln j†j ˛ tr † 1 (8.15)
2 2
The complete joint log-likelihood `h for this hierarchical model (an extended
likelihood [63, 567], also known as an h-likelihood [429] or penalized likelihood
[255]) is then the sum of the “pure” log-likelihood from Eq. 8.3 and the log-
likelihood of an inverse gamma distribution for the random eigenvalues (Eq. 8.15):
1 X ˚ T 1
N
3N
`h D ln j†j tr Ei † E
2 2 i
K 3
C ln ˛ ln j†j ˛ tr † 1 : (8.17)
2 2
The hierarchical model described by the likelihood in Eq. 8.17 is in fact identical to
putting a diagonal inverse Wishart prior on the covariance matrix, with one degree
of freedom and scale matrix is equal to ˛I.
8 Likelihood and Empirical Bayes Superposition 201
Various pragmatic methods are available for likelihood inference using the extended
likelihood function presented above. The first that we examine, presented in [712],
is to treat the covariance matrix (with its associated eigenvalues) and the hyperpa-
rameter ˛ as parameters of interest and maximize over them. This is appropriate,
for instance, whenever the covariance matrix itself is considered informative and
the correlation structure of an ensemble of molecular conformations is desired (e.g.,
[382, 714]).
The extended ML estimate † O h of † is a linear function of the unrestricted
O
conditional ML estimate † s from Eq. 8.12:
OhD 3n 2˛ Os
† IC† (8.18)
3n C 3 3n
In this ML hierarchical model, the point estimate of the inverse gamma parameter
˛ is determined by the data, unlike when using a bona fide Bayesian prior. The † Oh
estimate can be viewed as a shrinkage estimate that contracts the eigenvalues of the
covariance matrix to the mode of the inverse gamma distribution.
It will also be useful to specify the conditional ML estimate of the inverse gamma
distributed eigenvalues ƒ O h of the covariance matrix:
Oh D 3n 2˛ Os
ƒ ICƒ (8.19)
3n C 3 3n
The marginal distribution (Eq. 8.20) has the form of a matrix Student-t density,
which is difficult to treat analytically. However, the Expectation-Maximization
202 D.L. Theobald
where † O s is the sample covariance matrix from Eq. 8.12. Due to the rank deficiency
of the sample covariance matrix, when using the marginal likelihood model it
will generally be necessary to assume a diagonal structure (i.e., † D ƒ) for the
covariance matrix so that the expected inverses of the eigenvalues can be found
easily:
1
3n C 1 2˛
E ƒ1 Os
h jR; ; ; ˛; X D ICƒ : (8.22)
3n 3n
Note that there is an important complication with each of the estimates of the
covariance matrix given above in Eqs. 8.18, 8.19, and 8.21. Namely, the smallest
sample eigenvalues are nonidentifiable due to the rank degeneracy of the sample
covariance matrix. Without special care for the rank degeneracy problem, then, the
naive covariance estimates given in Eqs. 8.18, 8.19, and 8.21 are invalid (see, for
example, the algorithm presented in [497], which results in degenerate solutions
having arbitrary atoms perfectly superpositioned with zero variance). We deal with
this problem by treating the missing eigenvalues with the EM algorithm. Recall
that the sample covariance matrix has multiple zero eigenvalues, regardless of
the number of structures used in the calculation. The sample covariance matrix
is of maximum effective rank k 4 (one linear translational constraint and three
non-linear rotational constraints) and can be less when there are few structures
(rank D min.3n 7; k 4/). We treat these missing eigenvalues as missing data
(from a left-truncated distribution) and estimate ˛ conditional on the “observed”
sample eigenvalues. In previous work, for simplicity we calculated the parameters
of the inverse gamma distribution using the usual ML estimates but omitting the
zero eigenvalues [713], a reasonable but inexact approximation. Here we give an
exact solution, using an EM algorithm that determines the expected values for the
inverses of the missing eigenvalues.
8 Likelihood and Empirical Bayes Superposition 203
In the following equations it is assumed that the eigenvalues are ordered from
largest to smallest. The naive conditional ML estimate of ˛ is given by
k k k
˛O D 1
D D (8.23)
2 tr † h 2 tr ƒ1
h X
k
2 1
i
i
Because the m smallest eigenvalues are missing, the eigenvalue distribution is left-
truncated. Hence, the EM estimate of ˛ is given by
k
˛O D ! (8.24)
X
km
2 m E 1
sm j˛; ; c C 1
i
i
where m is the number of missing eigenvalues, and E 1
sm j˛; ; c is the expected
value of the inverse of the m smallest missing eigenvalues, conditional on the
smallest observed eigenvalue c. The expected inverse of the smallest eigenvalues
can be expressed analytically:
. C 1; x/
E 1
sm j˛; ; c D (8.25)
˛O .; x/
where x D ˛=c,
O c is the smallest observed eigenvalue, is the shape parameter of
the inverse gamma distribution, and .a; s/ is the (unnormalized) upper incomplete
gamma function:
Z 1
.a; s/ D t a1 e t dt
s
for a real and s 0. Since we here assume that D 12 , Eq. 8.25 can be simplified:
1 32 ; x
E sm j˛; c D (8.26)
˛O 12 ; x
p
1 e x x
D C p p (8.27)
2˛O ˛O erfc . x/
This EM algorithm, then, allows for valid estimation of the covariance matrix and
its eigenvalues. Given a positive ˛ parameter, the hierarchical model guarantees
O h by ensuring that all its eigenvalues (and variances) are positive,
an invertible †
as can be seen from Eqs. 8.19 and 8.22. Hence the hierarchical model is sufficient
to overcome all three of the difficulties with estimation of the covariance matrix
enumerated above.
204 D.L. Theobald
8.9 Algorithm
E.ƒ1
h / for the marginal likelihood model), which has already been determined.
8. Loop: Return to step 2 and loop until convergence.
When assuming that the variances are all equal and that there are no correlations
(i.e., when † / I), then the above algorithm is equivalent to the classic least-squares
algorithm for simultaneous superpositioning of multiple structures [149, 216, 363,
647]. Examples of ML superpositions, with the corresponding OLS superpositions,
are presented in Figs. 8.1 and 8.2.
8 Likelihood and Empirical Bayes Superposition 205
Fig. 8.3 Superpositions of a simulated ensemble of protein structures. (a) In grey is the
true superposition, before random rotation and translation. (b) The least squares superposition.
(c) A maximum likelihood superposition using the inverse gamma hierarchical matrix normal
model (Modified from [712])
Fig. 8.4 Comparison of true and estimated covariance matrices, by both LS and ML superposi-
tioning methods. In these plots, blue indicates positive covariance whereas red indicates negative
covariance; white indicates no covariance. The top row (panels a, b, and c) shows results from
a simulation with very weak correlations. The bottom row (d, e, and f) shows results from a
simulation with strong correlations. The true covariance structure is shown in the left-most column
(panels a and d). Covariance matrices were estimated using the least-squares criterion (middle
panels b and e) and the maximum likelihood method (right-most panels c and f)
8 Likelihood and Empirical Bayes Superposition 207
assuming a matrix Gaussian distribution with known mean and known covariance
matrices. The simulations were based on two different covariance matrices. The first
has extremely weak correlations and uneven variances that ranged over three orders
of magnitude (Fig. 8.4a). The second has the same variances but contained strong
correlations/covariances (Fig. 8.4d). The covariance matrices were then estimated
from these simulations using both least-squares (Fig. 8.4b and e) and ML (Fig. 8.4c
and f).
The ML estimate of the covariance matrix is considerably more accurate than
the least-squares estimate (Fig. 8.4). The least-squares estimate is markedly biased
and shows a strong artifactual pattern of correlation (Fig. 8.4b and e). The ML
estimate, in contrast, is nearly visually indistinguishable from the matrix assumed
in the simulations (Fig. 8.4c and f).
The likelihood analysis described above does not address the uncertainty in the
estimates of the parameters. Bayesian analysis, however, can provide posterior
distributions for the parameters and can also incorporate prior knowledge from
other data (e.g., crystallographic B-factors or NMR order parameters). Due to
the close philosophical relationship between likelihood and Bayesian methods, the
likelihood treatment described above provides a natural theoretical foundation for a
full Bayesian treatment of the multiple superposition problem.
208 D.L. Theobald
where p.Xj†; ; R; / is the likelihood given in Eq. 8.2, and p.†/, p./, p.R/,
and p./ are the prior distributions. To enable a fully heteroskedastic, correlated
model we adopt a hierarchical, diagonal inverse Wishart prior for † in which the
scale matrix of the inverse Wishart is proportional to I:
In this Bayesian formulation we use standard conjugate priors, and thus the
conditional distributions of all of the parameters except the rotations have conve-
nient analytical representations. The posterior for the rotations belongs to matrix
von Mises-Fisher distribution, which has a normalization constant with no known
analytical form. We have solved the MAP estimates for this Bayesian superposition
model and have developed a hybrid Gibbs-Metropolis sampling algorithm to
approximate the joint posterior distribution of the parameters [481, 710, 711].
Future versions of our THESEUS software will include the option of performing
a Bayesian superposition analysis based on this methodology.
Acknowledgements Much of this methodology was initially developed with Deborah S. Wuttke
at the University of Colorado at Boulder. I thank Phillip Steindel, Thomas Hamelryck, Kanti
Mardia, Ian Dryden, Colin Goodall, and Subhash Lele for helpful comments and criticism. This
work was supported by NIH grants 1R01GM094468 and 1R01GM096053.
Chapter 9
Bayesian Hierarchical Alignment Methods
Two objects have the same form or size-and-shape if they can be translated and
rotated onto each other so that they match exactly. In other words, the objects
are rigid-body transformations of each other [159]. Quantitatively, form is all the
geometrical information remaining when orientation and translation are filtered
out. On the other hand, shape is the geometrical information remaining when
transformations in general – including orientation, translation and additionally
scaling – are filtered out. Landmark points on an object can be used to describe
or represent its form or shape. Two objects with the same form are equivalent up to
translation and rotation. That is, they not only have the same shape, but also the same
size. Thus, shape analysis deals with configurations of points with some invariance.
Landmarks in labeled shape analysis are supposed to be uniquely defined for similar
objects [333]. Thus, for labeled shape analysis, landmark matrices and alignments
are known and given. Figure 9.1 is an example of a set of pre-specified landmarks
that describe shape. Shown in the figure is an outline and three landmarks for a
fossil, microscopic planktonic foraminiferan, Globorotalia truncatulinoides, found
in the ooze on the ocean bed. Given a set of fossils, a basic application is estimating
9 Bayesian Hierarchical Alignment Methods 211
1
2
1
an average shape and describing shape variability.1 Now consider a simple example:
a case of triangles in Fig. 9.2. In shape space, these triangles are equivalent under
rotation and translation. In reflection shape analysis, the equivalence class also
includes reflected triangles. For m fixed landmarks and d dimensions, let X be the
.m d / configuration matrix with m rows as landmarks xi ; i D 1; : : : ; m. The
standard model for rigid transformation is (see for example [159], p. 88)
Xi D . C Ei /Ri C 1k Ti
1
See p. 15 in [159] for other applications of shape analysis in biology.
212 K.V. Mardia and V.B. Nyirongo
parameters are removed a priori. For similarity shape, one needs some modification
to allow for scaling. For Bayesian methods in labeled shape analysis see Chap. 8
and [712]. In machine vision, active shape models play a key role [119, 120]. In
these models, is allowed to dynamically deform within constraints learned from
a training set of previous configurations.
In unlabeled shape analysis, the labeling is unknown. For example, the triangles in
Fig. 9.3 have the same form even after allowing for the six label permutations of
the vertices. The matching solution is the set of pairs .1; 20 /, .3; 10 / and .2; 30 /. The
pairwise matching can be represented by the matching or permutation matrix M:
0 1
010
M D @0 0 1A:
100
3'
3 1'
1
2
2'
Fig. 9.3 Triangles with the same form but arbitrary labels
Fig. 9.4 Alignment of two configurations of points schematically representing atomic positions
in two proteins
9 Bayesian Hierarchical Alignment Methods 213
represent atomic positions in two proteins. Only a few points match. That is, the
corresponding points and the number of matches in addition to the transformation
are unknown beforehand. Figure 9.5 shows configurations after a superposition that
involves rotating the second frame clockwise.
The aim of unlabeled shape analysis is somewhat different from labeled shape
analysis. The aim of unlabeled shape analysis is to estimate the matching matrix
M, while in labeled shape analysis the aim is to estimate . Note that in practice,
specific coordinate representations of labeled shape have been useful; they include
Bookstein coordinates [67] and Procrustes tangent coordinates [366]. For a set of
configurations close to , specific coordinate representations could be useful even
for unlabeled shape analysis.
A basic problem in unlabeled shape analysis is to find matching points. Suppose
we have two configurations X and Y, for example representing the atomic positions
in two proteins. Matching can be represented by a matching matrix M:
(
1 if the j th point in X corresponds to the kth point in Y;
Mj k D
0 otherwise.
model is given in Sects. 9.5–9.8. Practical approaches with illustrations using the
model are given in [256].
where p.R/, p./ and p./ denote prior distributions for R, and ; jRj is the
Jacobian of the transformation from the space of X into the space of Y and ./ is
the standard normal probability density function. measures the tendency a priori
for points to be matched and can depend on attributes of xj and yk ; for example, the
atom types when matching protein structures [254].
For two-dimensional configurations such as protein gels, d D 2 and for three-
dimensional configurations such as active sites, d D 3. We describe the motivation
of the model and the mathematical derivation in the next section.
region does not affect the occurrence of any other point in its neighborhood.
However, the probability of points overlapping is zero. The number of points
in a given disjoint subregion is random and independently follows the Poisson
distribution with rate v where v is volume of the region. The intensity of points
from a homogeneous spatial Poisson process does not vary over the region and
points are expected to be uniformly located [128].
We suppose that N locations spanning a super-population over a region of
volume v are realized. Each point may belong to X, Y, neither, or both with
probabilities px , py , 1 px py px py , and px py respectively, where is
the tendency of configuration points to match a priori [135, 254]. Under this model,
the number of locations which belong to X, Y, neither and both are independent
Poisson random P variables with counts m L, n L, N m n C L and L
where L D j k Mj k . Thus, both point sets X and Y are regarded as noisy
observations on subsets of a set of true locations D fi g, where we do not know
the mappings from j and k to i . There may be a geometrical transformation between
the coordinate systems of X and Y, which may also be unknown. The objective is
to make model-based inference about these mappings, and in particular to make
probability statements about matching: which pairs .j; k/ correspond to the same
true location?
Conditional on m and n, L follows a hypergeometric distribution, which belongs
to the exponential family:
KD L
p.Ljm; n; D/ D (9.2)
.m L/Š.n L/ŠLŠ
Thus we have
xj Nd . j ; 2 Id / and Ryk C Nd . k ; 2 Id /:
Integrating over and assuming that v is large enough relative to the supports of
densities for xj j and Ryk C k , the likelihood is
p !
Y r 1 .x j ; k/= 2
p.X; YjM; R; ; / / jRjn Mj k p (9.5)
. 2/d
j;k
where jRj is the Jacobian of the transformation from the coordinate system of Y
into the coordinate system of X.
Although the model was motivated with a spatial homogeneous Poisson process
with points expected to be uniformly located and no inhibition distance between
points, none of the parameters for the process are in the joint model of interest; N ,
v and are integrated out. This allows the model to be applicable to many patterns
of configurations, including biological macromolecules with some given minimum
distances between points due to steric constraints.
where Fd d .F0 / is the matrix Fisher distribution on which more details are given
in Sect. 9.5.4. G.˛; ˇ/ is a gamma distribution. M is assumed uniform a priori, and
conditional on L
p.MjL/ / L :
The matrix Fisher distribution plays an important role in the ALIBI model. This
distribution is perhaps the simplest non-uniform distribution for rotation matrices
[157, 254, 474]. The distribution takes an exponential form, i.e. for a matrix R:
˚
p.R/ / exp tr.FT0 R/
p.R/ / 1:
The first two matrices represent longitude and co-latitude, respectively, while the
third matrix represents the rotation of the whole frame; see for example [254]. For
i < j , let Rij . ij / be the 3 3 matrix with ai i D ajj D cos ij , aij D aj i D
sin ij ,arr D 1 for r ¤ i; j and other entries 0.
The full posterior distribution is given by Eq. 9.1. Inference is done by sampling
from conditional posterior distributions for transformation parameters and variance
(R, and ) and updating the matching matrix M using the Metropolis-Hastings
algorithm. As discussed in Chap. 2, the Metropolis-Hastings algorithm is a Markov
218 K.V. Mardia and V.B. Nyirongo
chain Monte Carlo method used for obtaining random samples from a distribution
where direct sampling is difficult. Details on updating M are given in Sect. 9.6.2.
With p.R/ / exp.tr.FT0 R// for some matrix F0 , the posterior has the same form
with F0 replaced by
F D F0 C S (9.8)
where X
S D 1=2 2 Mj k .xj /yTk : (9.9)
j;k
In the three-dimensional case the joint full conditional density for Euler angles is
/ expŒtrfFT Rg cos 13
for 12 ; 23 2 .
;
/ and 13 2 .
=2;
=2/. The cosine term arises since the
natural dominating measure corresponding to the uniform distribution of rotations
has volume element cos 13 d 12 d 13 d 23 in these coordinates.
From the representation in Eq. 9.7
where
a12 D .F22 sin 13 F13 / cos 23 C .F23 sin 13 F12 / sin 23 C cos 13 F11
b12 D . sin 13 F23 F12 / cos 23 C .F13 sin 13 F22 / sin 23 C cos 13 F21
a13 D sin 12 F21 C cos 12 F11 C sin 23 F32 C cos 23 F33
b13 D . sin 23 F12 cos 23 F13 / cos 12
C. sin 23 F22 cos 23 F23 / sin 12 C F31
a23 D .F22 sin 13 F13 / cos 12 C . sin 13 F23 F12 / sin 12 C cos 13 F33
b23 D .F23 sin 13 F12 / cos 12 C .F13 sin 13 F22 / sin 12 C cos 13 F32 :
The cij are ignored as they are combined into the normalizing constants. Thus, the
full conditionals for 12 and 23 are von Mises distributions. The posterior for 13 is
proportional to
expŒa13 cos 13 C b13 sin 13
cos 13 :
In the two-dimensional case the full conditional distribution for is of the same
von Mises form (Eq. 9.6), with cos updated to . cos CS11 CS22 /, and sin to
. sin S12 C S21 /: The von Mises distribution is the conjugate prior for rotations
for the spherical Gaussian error distributions (Model 9.1).
9 Bayesian Hierarchical Alignment Methods 219
For updating M conditionally, we need some new ways. The full posterior for
the matrix M is not required, and is actually complex. The matching matrix M
is updated, respecting detailed balance, using Metropolis-Hastings
P moves that only
propose changes to a few entries: the number of matches L D j;k Mj k can only
increase or decrease by one at a time, or stay the same. The possible changes are as
follows:
(a) Adding a match, which changes one entry Mj k from 0 to 1;
(b) Deleting a match, which changes one entry Mj k from 1 to 0;
(c) Switching a match, which simultaneously changes one entry from 0 to 1 and
another in the same row or column from 1 to 0.
The proposal proceeds as follows. First a uniform random choice is made from
all the m C n data points x1 ; x2 ; : : : ; xm ; y1 ; y2 ; : : : ; yn . Suppose without loss of
generality, by the symmetry of the set-up, that an X is chosen, say xj . There
are two possibilities: either xj is currently matched, in that there is some k such
that Mj k D 1, or not, in that there is no such k. If xj is matched to yk , with
probability p we propose to delete the match, and with probability 1 p we
propose switching it from yk to yk 0 , where k 0 is drawn uniformly at random from
the currently unmatched Y points. On the other hand, if xj is not currently matched,
we propose to add a match between xj and a yk , where again k is drawn uniformly at
random from the currently unmatched Y points. We give MCMC implementation in
Sect. 9.6.3.
220 K.V. Mardia and V.B. Nyirongo
Attributes of points in the configurations can be taken into account. For example,
atom type, charge or binding affinity are some of the attributes which can be
9 Bayesian Hierarchical Alignment Methods 221
The model in Eq. 9.1 can be extended for matching configurations with bonded
points and also for matching multiple configurations simultaneously. Furthermore,
the model can be used for refining matching solution found by heuristic but fast
methods, such as approaches based on graph theory [15,235,477,635]. We consider
extensions to matching bonded points and multiple configurations in Sects. 9.8.1
and 9.8.2 below.
The model in Eq. 9.1 can be extended for matching bonded points in a configuration
[477, 549]. Bonded points in a configurations are dependent. This is motivated
by the requirement in bioinformatics to prefer matching amino acids with similar
orientation. Thus, we can take into account relative orientation of side chains by
using C ˛ and Cˇ atoms in matching amino acids. The positions of these two
covalently bonded atoms in the same amino acid are strongly dependent.
We use the example of superimposing a potential active site, called the query site,
on a true active site, called the functional site. Let y1k and x1j denote coordinates
for C ˛ atoms in the query and the functional site. We denote Cˇ coordinates for
the query and functional site by y2k and x2j respectively. Thus x1j and x2j are
dependent. Similarly, y1k and y2k are dependent. We take into account the position
of y2k by using the conditional distribution given the position of y1k . Given x1j , y1k ,
it is plausible to assume that
p.x1j ; x2j ; y1k ; y2k / D p.x1j ; y1k /p.x2j ; y2k jx1j ; y1k /;
x2j jx1j N .x1j ; 2 I3 /;
Ry2k jy1k N .Ry1k ; 2 I3 /
We assume for “symmetry” that p.x2j Ry2k jx1j ; y1k / depends only on the
displacement as in the likelihood in Eq. 9.1. Let r2 .x2j ; k/ D xp
2j x1j R.y2k y1k /:
Thus .:/ in Eq. 9.1 is replaced by .:/ .r2 .x2j ; k/= 2/ for the new full
likelihood. Now the likelihood becomes
Y p p
.r1 .xj ; k/= 2/ .r2 .x2j ; k/= 2/
L.#/ / jRjn Mj k p d (9.14)
. 2/
j;k
Some probability mass for the distribution of x2j jx1j and Ry2k jy1k is unac-
counted for because there are distance constraints between x2j and x1j and also
between y2k and y1k . Thus x2j x1j is not isotropic. In theory, a truncated
distribution is required. However, using the Gaussian distribution is not expected to
affect the performance of the algorithm because the relative matching probabilities
for x2j x1j vectors are unaffected.
The additional term in the new full likelihood does not involve . Hence, the
posterior and updating of is unchanged. The full conditional distribution of
R is
p.RjM; ; ; X1 ; Y1 ; X2; Y2 / /
Y r1 .xj ; k/ r2 .x2j ; k/
jRj p.R/ Mj k
2n
p p : (9.15)
j;k
2 2
Thus
˚
p.RjM; ; ; X1 ; Y1 ; X2 ; Y2 / / p.R/ exp tr .B C B /R
where X
B D 1=2 2 Mj k y1k .x1j /T
j;k
and X
B D 1=22 Mj k .y2k y1k /.x2j x1j /T :
j;k
Similar to Eq. 9.8, with p.R/ / exp.tr.FT0 R// for some matrix F0 , the full
conditional distribution of R – given data and values for all other parameters – has
the same form with F0 replaced by
F D F0 C S C S (9.16)
9.8.1.2 Updating M
Similar to Expression 9.10, the acceptance probability for adding a match (j; k) is
224 K.V. Mardia and V.B. Nyirongo
( p p )
.r1 .xj ; k/= 2/p nu .r2 .x2j ; k/= 2/
min 1; p p :
. 2/d . 2/d
.c/ .c/
R.c/ xj C .c/ D .c/ C "j : for j D 1; 2; : : : ; nc ; c D 1; 2; : : : ; C: (9.18)
j
The unknown transformation R.c/ X.c/ C .c/ brings the configuration X.c/ back into
the same frame as the -points, and .c/ is a labeling array linking each point in
configuration X.c/ to its underlying -point. As before, the elements within each
labeling array arenassumed to be distinct. o In this context a match can be seen as a set
.i1 / .i2 / .ik / .i / .i / .i /
of points Xj D xj1 ; xj2 ; : : : ; xjk such that j11 D j22 D : : : D jkk . Define
I
the tendency a priori for points to be matched in the I th set. The posterior model
has the form
C n
Y ˇ ˇn c o
p .A; M j X / / p.R.c/ /p. .c/ / ˇR.c/ ˇ
cD1
n o
YY exp 1=2 2 A XjI
I ;
I JI
jI jd=2 .2
2 /d.jI j1/=2
˚ ˚
where A D R.1/ ; R.2/ ; : : : R.C / and M D M.1/ ; M.2/ ; : : : M.C / is the matching
array and
X jI j
.i /
A XjI D jjR.ik / xjkk cjj2
kD1
PjI j .ik / .ik /
with c D 1=jI j kD1 R xjk and jj jj denotes the Euclidean norm.
Prior distributions for the .c/ ; R.c/ ; and 2 are identical to those in Eq. 9.1. For
c D 1; 2; : : : ; C ,
.c/ Nd .d / ; c2 Id I 2 G.˛; ˇ/I R.c/ Fd d .Fc /:
That is, here the prior distribution for the matches M also assume that -points
follow a Poisson process with constant rate over a region of volume v. Each point
in the process gives rise to a certain number of observations, or none at all. Let qI
be the probability that a given hidden location generates an I -match. Then, consider
the following parameterization:
Y
qI D I qfcg ;
c2I
where I is the tendency a priori for points to be matched in the I th set and
I D 1 if jI j D 1. Define LI as the number of I -matches contained in M, and
assume the conditional distribution of M given the LI number is uniform. The
prior distribution for the matches can be expressed as
226 K.V. Mardia and V.B. Nyirongo
Y I LI
p.M/ / ;
I
vjI j1
where I D I =jI j1 . This is a generalization of the prior distribution for the
matching matrix M [616]. Similar to the role of in the pairwise model (Eq. 9.1),
I measures the matching tendency a priori for I -matches. If I >> I 0 then one
would see more I -matches than I 0 -matches. For example, for C D 3, if f1;2g , f1;3g
and f2;3g are much larger than f1;2;3g , then one would mostly see pairwise matches
and fewer triple matches.
P
and L D j;k Mj k denotes the number of matches. The RMSD is the focus of
study in combinatorial algorithms for matching. In the Bayesian formulation, the
log likelihood (with uniform priors) is proportional to
X X p
constant 2 Mj k ln C Mj k ln Q=2 2 2:
9.9.2 EM Algorithm
The problem can be recast as mixture model estimation. This might suggest
considering maximization of the posterior or likelihood using the EM algorithm
[254]. In the EM formulation, the “missing
data” are Mj k s. The EM algorithm
alternates between finding E Mj k jX; Y at current values of R, and and
but requires repeatedly re-normalizing the matching matrix P. Thus, for maximizing
the posterior the E-step involves calculating Pj k D Wj k =.1 C Wj k /, where
p
Wj k D .r1 .xj ; k/= 2/: (9.20)
The M-step involves maximizing
X
ln fjRjn p.R/p./p./g C Pj k ln Wj k (9.21)
j;k
X
m
PI D p.Ljm; n; D/ (9.22)
LDLobs
9.10 Discussion
In this chapter, we have mainly considered the ALIBI model of Green and Mardia
[254] and its extensions. However there are some other Bayesian alignment methods
9 Bayesian Hierarchical Alignment Methods 229
Acknowledgements We are thankful to Professor Peter Green and Dr. Yann Ruffieux for
many useful discussions on ALIBI based approaches. Our thanks also go to Chris Fallaize and
Zhengzheng Zhang for their helpful comments.
Part V
Graphical Models for Structure Prediction
Chapter 10
Probabilistic Models of Local Biomolecular
Structure and Their Applications
10.1 Introduction
W. Boomsma
Dept. of Astronomy and Theoretical Physics, Lund University, Lund, Sweden
e-mail: [email protected]
Dept. of Biomedical Engineering, DTU Elektro, Technical University of Denmark,
Lyngby, Denmark
J. Frellsen T. Hamelryck
Bioinformatics Centre, Dept. of Biology, University of Copenhagen, Copenhagen, Denmark
[email protected]; [email protected]
The set of recurring structural motifs found by Corey and Pauling has been extended
substantially in the last decades. In addition to ˛-helices and ˇ-sheets, the list of
known motifs now includes ˇ-turns, ˇ-hairpins, ˇ-bulges and N-caps and C-caps at
the ends of helices. These motifs have been studied extensively, both experimentally
and by knowledge-based approaches, revealing their amino acid preferences and
structural properties [327].
Starting in the late 1980s, attempts were made to automate the process of
detecting local structural motifs in proteins, using the increasing amount of publicly
available structural data. Several groups introduced the concept of a structural
building block, consisting of a short fragment spanning between four and eight
residues. The blocks were found using various clustering methods on protein
fragments derived from the database of solved structures [326, 610, 732].
At the time, the low number of available solved structures severely limited
the accuracy of the local structure classification schemes. However, the sequence
databases contained much more information and grew at a faster rate. This fact moti-
vated an alternative approach to local structure classification. Instead of clustering
known structures and analyzing amino acid preferences of these structures, the idea
was to find patterns in sequence space first, and only then consider the corresponding
structural motifs [277]. This approach was later extended to a clustering approach
that simultaneously optimized both sequence and structure signals, leading to the
I-sites fragment library [93].
of local structure, and will therefore be the definition of fragment assembly used in
this chapter.
In 1994, Bowie and Eisenberg presented the first complete fragment assembly
method for ab initio protein structure prediction. For a given target sequence, all
nine-residue sequence segments were extracted and compared with all fragments
in their library, using a sequence-structure compatibility score. Longer fragments
(15–25 residues) were identified using sequence alignment with the protein data
bank (PDB) [45]. Using various fragment-insertion techniques, an initial population
of candidate structures was generated. To minimize the energy, an evolutionary
algorithm was designed to work on the angular degrees of freedom of the structures,
applying small variations (mutations) on single candidate structures, and recombi-
nations on pairs of candidate structures (cross-over). The study reported remarkably
good results for small helical proteins [74].
In 1997, two other studies made important contributions to the development
of fragment assembly based techniques. Jones presented a fragment assembly
approach based on fragments with manually selected supersecondary structure, and
demonstrated a correct prediction of a complete protein target from the second
CASP experiment [350]. The second study was by Baker and coworkers, who
presented the first version of their Rosetta protein structure prediction method,
inspired by the Bowie and Eisenberg study. This method included a knowledge-
based energy function and used multiple sequence alignments to select relevant
fragments [658]. Reporting impressive results on a range of small proteins, the
paper made a significant impact on the field, and remains heavily cited. The Rosetta
method itself has consistently remained among the top-performing participants
in CASP experiments – a testament to the great potential of incorporating local
structural information into the conformational search strategy.
Although fragment assembly has proven to be an extremely efficient tool in the
field of protein structure prediction, the technique has several shortcomings. An
obvious concern is how to design a reasonable scheme for merging fragments. Either
an overlap of fragments will occur, requiring an averaging scheme for the angular
degrees of freedom in that region, or the fragments are placed side-by-side, which
introduces an angle configuration at the boundary that is not generally present in
the fragment library. This might seem like a minor issue, and it can be remedied
by a smoothing scheme that adjusts unrealistic angles. However, in the context of a
Markov chain Monte Carlo (MCMC) simulation, the boundary issue is symptomatic
for a more fundamental problem. In principle, using fragment assembly to propose
candidate structures corresponds to an implicit proposal distribution. However, as
the boundary issue illustrates, this implicit distribution is not well-defined; when
sampling, angles that should have zero probability according to the fragment library
do occur in boundary regions. The introduction of a smoothing strategy improves
the angles in the boundary region, but the non-reversibility of this additional step
constitutes another problem for the detailed balance property (see Chap. 2). The
construction of a reversible fragment library is possible, but comes at the expense
of extending the library with artificial fragments [112].
236 W. Boomsma et al.
10.2.2 Rotamers
Fig. 10.1 Dihedral angles in glutamate. Dihedral angles are the main degrees of freedom for the
main chain and the side chain of an amino acid. The and angles describe the main chain
conformation, while a number of angles describe the side chain conformation. The number
of angles varies between zero for alanine and glycine and four for arginine and lysine. The
figure shows a ball-and-stick representation of glutamate, which has three angles. The light
gray conformations in the background illustrate a rotation around 1 . The figure was made using
PyMOL (http://www.pymol.org) (Figure adapted from Harder et al. [290])
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 237
As in the case of the main chain, the discretization of the conformational space
is also inherently problematic. The hydrophobic core of proteins typically consist of
many tightly packed side chains. It is far from trivial to handle such dense systems
using rotamer libraries. The problem is that very small differences in side chain
conformations can lead to large differences in energy. By not considering possible
conformations that fall in between rotamers, one might miss energetically favorable
conformations. In practice, these inherent shortcomings of rotamer libraries are
addressed using various heuristics [257].
of atomic coordinates in a local reference frame, which were then modeled using
normal distributions. A Bayesian scheme was used to automatically determine the
number of clusters supported by the data. For each cluster obtained in this fashion,
an analysis of secondary structure and amino acid preferences was conducted.
In addition, the Markov transition probabilities between clusters were calculated,
effectively turning the method into a hidden Markov model (HMM). The method
was designed for classification purposes, not for sampling. However, for the sake
of our discussion on optimal designs of probabilistic models, it is important to
understand why such a model is not useful in the context of simulation. The problem
lies in the choice of representation of local structure. First, since the basic unit
of the model is a structure fragment, it is faced with similar problems as the
fragment assembly technique. Second, and more importantly, the representation
of the method prevents consistent sampling of the fragments themselves, since
the sampling of atom positions according to Gaussian distributions will tend to
violate the strong stereochemical constraints proteins have on bond lengths and bond
angles.
In 1999, Camproux et al. presented an HMM of local protein structure [96].
Much along the lines of the work by Hunter and States, their model represented
a protein chain as a number of overlapping fragments. Fragments of length four
were used, where the internal structure of a fragment was captured through the
distances between the C ˛ atoms. The sequential dependencies along the chain were
modeled by a Markov chain, where each state in the HMM corresponded to specific
parameters for a four-dimensional Gaussian distribution. While this approach
proved quite useful for classification purposes, from a sampling perspective, their
representation suffers from the same problems as the Hunter and States model. The
representation of neighboring HMM states are overlapping, failing to satisfy the
non-redundancy requirement, and even within a single state, the representation is
not consistent, since four-valued vectors can be sampled that do not correspond to
any three-dimensional conformation of atoms. De Brevern, Etchebest and Hazout
presented a similar model in 2000, using a representation of overlapping five-
residue long fragments [139]. In this method, the eight internal dihedral angles of
each fragment were used as degrees of freedom, thereby solving the problem of
internal consistency within a fragment. However, also this model has a redundant
structural representation, and was therefore not ideal for simulation purposes.
Several variations of these types of models haven been proposed, with similar
representations [35, 97, 175].
In 1996, Dowe et al. [156] proposed a model using only the .; / angles of
a single residue to represent states in structural space. The angles were modeled
using the von Mises distribution, which correctly handled the inherent periodicity of
the angular data. While their original method was primarily a clustering algorithm
in .; / space, the approach was extended to an HMM in a later study [168].
Since .; / angles only contain structural information of one residue at a time, this
representation is non-redundant. Although the model does not incorporate amino
acid sequence information, thereby violating one of our requirements, this work was
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 239
the first to demonstrate how angular distributions can be used to elegantly model
local protein structure.
The methods described above are all fundamentally geometrical, in that they do
not take amino acid or secondary structure information into account directly in the
design of the model. For simulation and prediction purposes, it is important that
models can be conditioned on any such available input information. The first model
to rigorously solve this problem was the HMMSTR method by Bystroff, Thorsson
and Baker, from 2004 [94]. Their model can be viewed as a probabilistic version
of the I-sites fragment library described previously. Although it was trained on
fragments, the model avoids the representation issues mentioned for some of the
earlier methods. The sequential dependency in HMMSTR is handled exclusively
by the HMM, and emitted symbols at different positions are independent given
the hidden sequence. The model was formulated as a multi-track HMM that
simultaneously models sequence, secondary structure, supersecondary structure and
dihedral angle information. Based on this flexible model architecture, the authors
identified a wide range of possible applications for the model, including gene
finding, secondary structure prediction, protein design, sequence comparison and
dihedral angle prediction, and presented impressive results for several of these
applications. Unfortunately, for the purpose of protein simulation or prediction,
HMMSTR had one significant drawback. The .; / dihedral angle output was
discretized into a total of eleven bins, representing a significant limitation on the
structural resolution of the model.
Recently, several models have been proposed specifically with MCMC simula-
tion in mind, designed to fulfill the requirements presented in the previous section.
These models differ in the level of detail used in the representation of the protein
main chain. Two common choices are the C ˛-only representation and the full-atom
representation, illustrated in Fig. 10.2. The C ˛-only representation is more coarse-
grained, involving fewer atoms, and can thus, in principle, lead to more efficient
simulations. The full-atom representation more closely reflects the underlying
physics, and it is often easier to formulate force fields in this representation. The
corresponding models for these representations are the FB5HMM model [275]
and the TORUSDBN model [68], respectively. The structure of the two models is
similar. They are both dynamic Bayesian networks (DBNs), consisting of a Markov
chain of hidden states emitting amino acid, secondary structure, and angle-pairs
representing the local structure. The main difference between the models is the
angular output given by the representation.
In the C ˛-only scenario, each residue is associated with one pseudo-bond angle
2 Œ0ı ; 180ı / and a dihedral angle 2 Œ180ı ; 180ı /, assuming fixed bond
lengths (Fig. 10.2a). Each pair of these angles corresponds to a position on the
240 W. Boomsma et al.
Fig. 10.2 Two different coarse-grained representations of the protein main chain. In the full-
atom model, all atoms in the main chain are included, while the C˛ representation consists
only of C˛ atoms and pseudo-bonds connecting them. The nature of the degrees of freedom in
these representations is different, giving rise to two different angular distributions. (a) C˛-only
representation and (b) heavy-atom-only main chain representation
Fig. 10.3 Examples of the angular distributions used to model the angular preferences of proteins.
(a) samples from two FB5 distributions and (b) samples from two bivariate von Mises distributions
1
This follows directly from the definition of the spherical coordinate system. Note that the sphere
is the two-dimensional surface of the three-dimensional ball.
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 241
The full-atom main chain representation is usually also associated with two degrees
of freedom: the and angles (Fig. 10.2b). Both of these are dihedral angles,
ranging from 180ı to 180ı . All bond lengths and bond angles are in this
representation typically assumed to be fixed. The dihedral angle ! around the bond
connecting the C and N atoms usually has the value 180ı (trans state), with an
occasional value of 0ı (cis state).
The .; / dihedral angles are well known in the biochemistry literature, where
the scatter plot of .; / is referred to as the Ramachandran plot, often used to detect
errors in experimentally determined structures. The fact that both angular degrees
of freedom span 360ı has the effect that .; / pairs should be considered as points
on the torus. The bivariate Gaussian equivalent on this manifold is the bivariate von
Mises distribution. Chapter 6 describes different variants of this distribution, and
demonstrates the efficient parameter estimation and sampling techniques that have
recently been developed for it. In the TORUSDBN model, the cosine variant was
used. Figure 10.3b shows samples from two distributions of this type.
The FB5HMM and TORUSDBN models are single-chain DBNs2 [68, 275]. The
sequential signal in the chain is captured by a Markov chain of hidden nodes, each
node in the chain representing a residue at a specific position in a protein chain
(Fig. 10.4). The hidden node can adopt a fixed number of states. Each of these states
corresponds to a distinct emission probability distribution over angles (x), amino
acids (a) and secondary structure (s). The amino acid and secondary structure nodes
are simple discrete distributions, while the angular distribution is either the FB5
distribution or the bivariate von Mises (cosine variant). In addition, the TORUSDBN
model has a discrete trans/cis node (c), determining whether the ! angle is 180ı or
0ı , respectively.
For ease of reference, we denote the set of observable nodes at position i 2
f1; : : : ; ng by oi , where
fai ; xi ; si g FB5HMM
oi D (10.1)
fai ; xi ; si ; ci g TORUSDBN
2
The models can equally well be considered as multi-track HMMs, but the graphical formalism
for DBNs is more convenient for the TORUSDBN and FB5HMM models, since they use fully
connected transition matrices (see Chap. 1).
242 W. Boomsma et al.
Fig. 10.4 The TORUSDBN model. The circular nodes represent stochastic variables, while the
rectangular boxes along the arrows illustrates the nature of the conditional probability distribution
between them. The lack of an arrow between two nodes denotes that they are conditionally
independent. Each hidden node has 55 states and the transition probability are encoded in a
55 55 matrix. A hidden node emits angle pairs, amino acid information, secondary structure
labels (H:helix, E:strand, C:coil) and cis/trans information (C:cis, T:trans). The arbitrary hidden
node value 20 is highlighted to demonstrate how the hidden node value controls which mixture
component is chosen (Figure adapted from Boomsma et al. [68])
Given the structure of the model, the joint probability for a sequence of observables
o D .o1 ; : : : ; on / can be written as a sum over all possible node sequences h D
.h1 ; : : : ; hn /
X
p.o/ D p.h/p.ojh/
h
X Q Y Q
D p.h1 / o2o1 p.ojh1 / p.hi jhi 1 / o0 2oi p.o0 jhi /
h i >1
where n denotes the length of the protein. As the number of possible hidden
node sequence grows exponentially with the protein length, calculating this sum
directly is generally intractable. Instead, this is normally done using the forward
algorithm [62].
The emission nodes (a, x, s, and c) can each be used either for input or output.
When input values are available, the corresponding emission nodes are fixed to
these specific values, and the node is referred to as an observed node. The emission
nodes for which no input data is available will be sampled during simulation. For
example, in the context of structure prediction simulations, the amino acid sequence
and perhaps a predicted secondary structure sequence will be available as input,
while the angle node will be repeatedly resampled to create candidate structures.
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 243
10.4.4 Sampling
Sampling from these types of models involves two steps: First, the hidden node
sequence is sampled conditioned on the observed emission nodes, after which, in
the second step, values for the unobserved emission nodes are sampled given the
sampled hidden node sequence.
In order for a model to be useful for simulation, it must be possible to resample a
subsequence of the chain. This can be efficiently done using the forward-backtrack
algorithm [102] – not to be confused with the more well known forward-backward
algorithm. More precisely, we wish to resample a subsequence of hidden nodes from
index s to index t while taking into account the transition at the boundaries. Given
observed values of os ; : : : ot , and hidden values of hs1 and ht C1 at the boundaries,
the probability distribution for the last hidden value can be written as
p.ht ; os : : : ; ot ; hs1 ; ht C1 /
p.ht jos ; : : : ; ot ; hs1 ; ht C1 / D
p.os : : : ; ot ; hs1 ; ht C1 /
/ p.os ; : : : ; ot ; hs1 ; ht /p.ht C1 jht / : (10.2)
The first factor can be efficiently calculated using the forward algorithm3 [594].
The second is simply given by the transition matrix of the model. Equation 10.2 thus
represents a discrete distribution over ht , from which a value can be sampled directly
(after normalizing). The key insight is that the situation for ht 1 is equivalent, this
time conditioned on ht at the boundary. For ht 1 , the calculation will involve the
factor p.os ; : : : ; ot 1 ; hs1 ; ht 1 /, which is available from the same forward matrix
as before. The entire sampling procedure can thus be reduced to a single forward
pass from s to t, followed by a backtrack phase from index t to s, sampling values
based on Eq. 10.2.
Once a new hidden node (sub)sequence has been sampled, values for the unob-
served nodes are sampled directly from the emission distributions corresponding to
the hidden state at each position. For instance, the angles-pairs at a given position
are sampled according to the bivariate von Mises or FB5 distribution component
associated to the current hidden node value at that position.
10.4.5 Estimation
When training models of the type described above, a method is needed that can
deal with the hidden (latent) variables in the model. Without these hidden variables,
the maximum likelihood estimate of the parameters of the model would simply
3
This requires that the probability of hs1 is included, by taking it into consideration when filling
in the first column of the forward matrix (position s).
244 W. Boomsma et al.
be the observed frequencies in the dataset. However, the data contains no direct
information on the distribution of hidden values. The expectation maximization
(EM) algorithm is a common maximum likelihood solution to this problem [144,
219]. It is an iterative procedure where in each iteration, parameters are updated
based on estimated frequencies given the parameters of the previous iteration. The
algorithm is guaranteed to produce estimates of non-decreasing likelihood, but
occasionally gets trapped in a local optimum, failing to find the global likelihood
maximum. Several variants of the EM algorithm exist. In cases where large amounts
of data are available, a stochastic version of EM algorithm can be an appealing
alternative known to avoid convergence to local optima [223, 544]. It is an iterative
procedure consisting of two steps: (1) draw samples of the hidden node sequence
given the data and the parameter estimates obtained in the previous iteration, (2)
update the parameters in the model as if the model was fully observed, using the
current transition and emission frequencies. Just like standard EM, the two steps
are repeated until the algorithm convergences. EM algorithms with a stochastic E-
step come in two flavors [62, 223]. In Monte Carlo EM (MC-EM), a large number
of samples is generated in the E-step, while in Stochastic EM (S-EM) only one
sample is generated for each hidden node [103, 223, 544]. Accordingly, the E-step
is considerably faster for S-EM than for MC-EM. Furthermore, S-EM is especially
suited for large datasets, while for small datasets MC-EM is a better choice.
The sampling in step (1) could be implemented using the forward-backtrack
algorithm described previously. However, it turns out that a less ambitious sampling
strategy may be sufficient. For the training of the models presented above, a
single iteration of Gibbs sampling was used to fill-in the hidden node values.4
Since this approach avoids the full dynamic programming calculation, it speeds up
the individual cycles of the stochastic EM algorithm and was found to converge
consistently.
The model design of TORUSDBN and FB5HMM gives rise to a single hyperpa-
rameter that is not automatically updated by the described procedure: the hidden
node size, which is the number of states that the hidden node can adopt. This
parameter was optimized by estimating a range of models with varying hidden
node size, and evaluating the likelihood. The best model was selected based on the
Bayesian information criterion (BIC) [638]. The optimal values were 55 states for
TORUSDBN, and 75 for FB5HMM.
Both the TORUSDBN and FB5HMM models have been demonstrated to success-
fully reproduce the local structural properties of proteins [68, 275]. In particular,
4
In random order, all hidden nodes were resampled based upon their current left and right
neighboring h values and the observed emission values at that residue.
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 245
Fig. 10.5 Samples from TORUSDBN with various types of input on a hairpin fragment: (a) no
input, (b) sequence input, (c) predicted secondary structure, (d) sequence and predicted secondary
structure. The black structure represents native, the cloud of light grey structures are 100 samples
from the model. In dark grey, we highlight the centroid of this cloud as a representative structure
(the structure with lowest RMSD to all other samples in the cloud) (Figure adapted from Boomsma
et al. [68])
the marginal distribution of angle pairs – marginalized over all positions and
amino acid inputs – is indistinguishable from the distribution found in naturally
occurring proteins. This marginalized distribution conceptually corresponds to the
Ramachandran plot of TORUSDBN samples. Figure 10.5 illustrates the types of
local structure that can be expected as varying degrees of input are given to the
model. In the example, when sampling without input data, the model produces
random structures. When sequence information is available, the model narrows the
angle distributions to roughly the correct secondary structure, but fails to identify
the correct boundaries. The distributions are further sharpened when providing a
predicted secondary structure signal. Finally, when both sequence and predicted
secondary structure are provided, the hairpin region becomes confined to allow only
samples compatible with the characteristic sharp turn.
We now investigate how models such as the ones described above can be used
in MCMC simulations. In particular, we will consider the move corresponding to
the resampling of a subsequence of the chain using the forward-backtrack method,
discussed above in Sect. 10.4.4.
Let hsWt and xsWt denote a resampled subsequence of hidden nodes and angle-
pairs, respectively, while hsWt and xsWt refers to the hidden nodes and angle-pairs
that are not resampled. The probability of proposing a move from angle-pair and
hidden node sequence .x; h/ to a new angle-pair and hidden node sequence .x0 ; h0 /
is given by
246 W. Boomsma et al.
.x; h/ D .x/p.hjx/ ;
where p.hjx/ is the conditional distribution of the hidden node sequence given the
angle sequence according to the local model. In order to obtain a detailed balance in
the Markov chain, the Metropolis-Hastings algorithm prescribes that the acceptance
probability is given by
.x0 ; h0 /p.x0 ; h0 ! x; h/
˛.x; h ! x0 ; h0 / D min 1;
.x; h/p.x; h ! x0 ; h0 /
.x0 /p.h0 jx0 /p.xsWt jhsWt /p.hsWt jhsWt /
D min 1; : (10.3)
.x/p.hjx/p.x0sWt jh0sWt /p.h0sWt jhsWt /
Using the conditional independence structure encoded in the DBN, the conditional
distribution p.hjx/ can be rewritten in terms of the resampled subsequences,
By inserting this expression into Eq. 10.3, the acceptance probability simply
reduces to
0 0 .x0 /p.x/
˛.x; h ! x ; h / D min 1; : (10.4)
.x/p.x0 /
This means that we have to sum over all hidden sequences in the network, and
accordingly calculate the full forward array, to find the acceptance probability for
each proposed move. Since the transition matrix in an HMM only has limited
memory, the acceptance probability can be well approximated by only calculating
the probability in a window around the changed sequence. For a window size of
w 0, the acceptance probability becomes
!
.x0 /p.x.sw/W.t Cw/jhsw1 ; ht CwC1 /
˛.x; h ! x0 ; h0 / min 1; : (10.5)
.x/p.x0.sw/W.t Cw/jh0sw1 ; h0t CwC1 /
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 247
The acceptance probability is simply given by the ratio of the global energy in the
new and the old configuration. If it is true that the target distribution can be written
as a product of a local and a global term, we can thus remove the local contribution
from the energy evaluation, and instead sample from it, thereby increasing the
acceptance rate. Note that this is indeed the case for distributions constructed using
the reference ratio method, described in Chaps. 4 and 3.
The probabilistic models discussed so far only describe the main chain of proteins,
omitting a parameterization of the major degrees of freedom in the protein side
chain. In this section we will consider a probabilistic model of another biomolecule,
RNA, which incorporates all major degrees of freedom in a single model. This
model is called BARNACLE [197], which loosely stands for Bayesian network
model of RNA using circular distributions and maximum likelihood estimation. In
the next section, we will show that a more complete model can also be constructed
for proteins by combining TORUSDBN with an additional model, BASILISK.
The BARNACLE model is conceptually related to the protein models presented
previously. However, it is not a trivial extension of these, as the RNA main chain
contains many more relevant degrees of freedom than the protein main chain.
Furthermore, if a similar model design was to be applied to RNA, this would require
a higher dimensional multivariate von Mises distribution (see Chap. 6), which was
not available when the model was developed. Instead, a very different model design
was used.
bond lengths are fixed to idealized values, the geometry of an RNA molecule can
be characterized by the remaining free dihedral angles. It can be shown that seven
dihedral angles per residue are sufficient to describe RNA in atomic detail [197].
These are the six dihedral angles ˛ to that describes the course of the main chain
and the -angle, which describes the dihedral angle around the bond connecting the
sugar ring and the base, as depicted in Fig. 10.6.
The BARNACLE model was expressed as a DBN that can capture the marginal
distribution of each of the seven dihedral angle and the local dependencies between
the angles (Fig. 10.7). The DBN has one slice per angle and each slice consists of
three stochastic variables: dj , hj and xj . The angle identifier dj is a bookkeeping
variable that specifies which of the seven angles are described in the given slice,
O2P
χ
C1’ C2’
P O5’ α
O4’ C3’
α
ζ β C5’ δ ζ
C4’ ε
O1P O3’
γ
Fig. 10.6 The dihedral angles in an RNA fragment. The fragment is shown in ball-and-stick
representation and the dihedral angles are placed on the central bond of the four consecutive atoms
that defined the dihedral angle. For clarity, the base is only shown partially. The six dihedral angles
˛ to describe the course of the main chain, while the -angle is the rotation of the base relative
to the sugar ring (Figure adapted from Frellsen et al. [197])
Fig. 10.7 The BARNACLE dynamic Bayesian network. Nine consecutive slices of the network
are shown, where the seven central slices describes the angles in one nucleotide. Each slice contains
three variables: an angle identifier, dj , a hidden variable, hj , and an angular variable, xj . The angle
identifier, dj , controls which type of angle (˛ to ) is described by the given slice, while the value
of the angle is represented by the variable xj . The hidden node describes the dependencies between
all the angles along the sequence (Figure adapted from Frellsen et al. [197])
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 249
while the actual angle values are represented by the stochastic variable xj , which
takes values in the interval Œ0; 2/. Finally, hj is a hidden variable that is used for
modeling the dependencies between the angles and for constructing a mixture of
the angular output. Both the angle identifier, dj , and the hidden variable, hj , are
discrete variables, and the conditional distributions p.dj jdj 1 / and p.hj jhj 1 ; dj /
are described by conditional probability tables (see Chap. 1). The conditional
distribution of the angular variable, p.xj jhj /, is specified by an independent
univariate von Mises distribution for each value of hj (see Chap. 6). The joint
probability distribution of n angles in an RNA molecule encoded by this model
is given by
X Y
n
p.x1 ; : : : ; xn jd1 ; : : : ; dn / D p.h1 jd1 /p.x1 jh1 / p.hj jhj 1 ; dj /p.xj jhj / ;
h j D2
where the sum runs over all possible hidden node sequences h D .h1 ; : : : ; hn /. In
the model, it is assumed that the angle identifiers are always specified in the order
˛, ˇ, , , ı, , , ˛, ˇ and so forth.
A full sequence of angles, a D .x1 ; : : : ; xn /, can be sampled from the model
by hierarchical sampling (see Chap. 1). First a sequence of identifier values are
specified by the user, that is
d1 D ˛; d2 D ˇ; d3 D ; d4 D ; d5 D ı; d6 D ; d7 D ; d8 D ˛; : : : ; dn D :
Then a sequence of hidden states are sampled conditioned upon the values of the
identifier variables, and subsequently a sequence of angles are sampled conditioned
on the sampled hidden values. When the model is used in an MCMC simulation,
new structures can be proposed by resampling a subsequence of angles. This can
be done by the forward-backtrack algorithm, using an approach similar to what is
described for the protein models in Sects. 10.4.4 and 10.4.7. Note, that although
amino acid dependent sampling was one of the four requirements for an appropriate
probabilistic model of protein structure (see Sect. 10.3), BARNACLE does not
incorporate nucleotide information. In the case of RNA the influence of the sequence
on the local structure is more subtle, and presumably better captured by adding a
model of nonlocal interactions.
The parameters of the BARNACLE model were estimated using stochastic EM
[223, 544] and the optimal number of hidden states was selected using the Akaike
information criterion (AIC) [91], see Sect. 1.7.3. As training data the RNA05 dataset
from the Richardson lab [531] was used, containing 9486 nucleotides from good
quality X-ray structures. The optimal number of hidden states was twenty.
of the individual seven dihedral angles and the joint distribution of dihedral angle
pairs. In order to have a simple continuous baseline, BARNACLE was compared
to a mixture model, where the seven angles are modeled as independent mixtures
of von Mises distributions. An example of this is illustrated in Fig. 10.8, where
the marginal distribution of the ˛-angle in BARNACLE is compared to both the
distribution observed in the data set and the distribution according to the mixture
model. This figure shows that BARNACLE is on par with the mixture model for
the individual angles. However, in contrast to the mixture model, BARNACLE
also captures the length distribution of helical regions correctly (Fig. 10.9), and the
model is consistent with a previously published rotameric description of the RNA
main chain [531]. Finally, the model has been tested in MCMC simulations, using
a simple geometric base pair potential based on secondary structure. A comparison
with the FARNA method by Das and Baker [133] shows that this approach readily
generates state-of-the-art quality decoys for short RNA molecules.
2.0
Experimental
BARNACLE
1.5 Mixture model
density
1.0
0.5
0.0
0 1 2 3 4 5 6
α (radians)
Fig. 10.8 The marginal distribution of the ˛-angle in RNA. The distribution of the training data
is shown as a histogram, and the density function for BARNACLE is shown as a black line. For
comparison, the density function according to a simple mixture model is shown as a gray line
(Figure adapted from Frellsen et al. [197])
Experimental
BARNACLE
0.4 Mixture model
frequency
0.2
0.0
1 3 5 7 9 11 14 17 20 23
length of helical regions
Fig. 10.9 Histogram of the lengths of helical regions in RNA. The distribution in the training data
and in data sampled from BARNACLE are shown in white and black, respectively. For comparison,
the length distribution in data sampled from a mixture model is shown in gray (Figure adapted from
Frellsen et al. [197])
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 251
From a statistical point of view, modeling the side chain is similar to modeling the
main chain; the challenge consists in modeling a sequence of dihedral angles. Each
amino acid type has a fixed, small number of such dihedral angles, ranging from
zero for glycine and alanine, over one for serine and valine, to four for arginine
and lysine. These angles are labelled 1 to 4 . In total, 18 types amino acid types
need to be modeled. As we also want to capture the dependency on the main chain,
the model also includes nodes that represent the .; / angle pair for that amino
acid. In principle, 18 different models could be formulated and trained; one for each
amino acid type excluding glycine and alanine. However, we decided to include
all 18 amino acid types in a single probabilistic model. This approach is known
as multitask or transfer learning in the field of machine learning and has several
advantages [101, 556]. As the same set of distributions is used to model all amino
acids, it leads to a lower amount of free parameters. Moreover, it makes “knowledge
transfer” possible during training between amino acids with similar conformational
properties. Finally, for rotamer libraries, one needs to determine the optimal number
of rotamers for each amino acid type separately; in our approach, only the size of
the hidden node needs to be determined.
252 W. Boomsma et al.
Fig. 10.10 The BASILISK dynamic Bayesian network. The network shown represents an amino
acid with two angles, such as for example leucine. In the case of leucine, the DBN consists of
four slices: two slices for the ; angles, followed by two slices for the angles. Sampling a set
of angles is done as follows. First, the values of the input nodes (top row) are set to bookkeeping
indices that determine both the amino acid type and dihedral angle position. For example, in the
case of leucine, the first two indices denote and angles, followed by two indices that denote
the 1 and 2 angles of leucine. In the next step, the hidden node values (middle row, discrete
nodes) are sampled conditioned upon the observed nodes. These observed nodes always include
the index nodes (top row, discrete nodes), and optionally also the and nodes (first two nodes in
the bottom row) if the sampling is conditioned on the main chain. Finally, a set of angles is drawn
from the von Mises nodes (bottom row), whose parameters are specified by the sampled values of
the hidden nodes (Figure adapted from Harder et al. [290])
BASILISK was formulated as a DBN whose structure (see Fig. 10.10) is very
similar to the structure of BARNACLE, the model of local RNA structure that is
described in the previous section [197]. Each slice in the DBN represents a single
dihedral angle using the von Mises distribution [474] as child node. The first two
slices represent the main chain angles and ; they are included to make it possible
to sample the side chain angles conditional on the main chain conformation. The
third and subsequent slices represent the dihedral angles of the side chain itself.
As in the case of BARNACLE, bookkeeping input nodes specify which angle is
modeled at that position in the DBN. The input nodes for the first two slices indicate
that these slices concern the and main chain angles, without specifying any
amino acid type. Specifying the type is superfluous, as the model is exclusively used
for sampling conditional upon the values of the and values. It is TORUSDBN
that provides a generative model for these angles. For the subsequent slices, the
bookkeeping input nodes not only specify the angles that are represented, but also
the amino acid type. Let us consider the example of leucine, which has two dihedral
angles 1 and 2 . The first two input nodes simply indicate that the and main
chain angles are modeled. The subsequent two input nodes indicate that the two
associated slices represent the 1 and 2 values of a leucine residue.
10 Probabilistic Models of Local Biomolecular Structure and Their Applications 253
Lysine χ1 χ2 χ3 χ4
900
Training
0
0 180 360 0 180 360 0 180 360 0 180 360
BASILISK 900
0
0 180 360 0 180 360 0 180 360 0 180 360
Arginine χ1 χ2 χ3 χ4
700
Training
0
BASILISK
0
Fig. 10.11 Univariate histograms of the angles for lysine (top) and arginine (bottom). The
histograms marked “Training” represent the data set used for training BASILISK. The histograms
marked “BASILISK” represent samples generated using BASILISK. For each amino acid, all
histograms are plotted on the same scale. The X-axis indicates the value of the dihedral angles in
degrees, while the Y-axis denotes the number number of occurrences (Figure adapted from Harder
et al. [290])
BASILISK was trained using data from 1,703 high quality crystal structures. The
optimal number of hidden node states was determined using the Akaike information
criterion (AIC) [91], see Sect. 1.7.3, and resulted in a model with 30 states.
10.7 Conclusions
Fragment libraries were a major step forward in the exploration of the conforma-
tional space of proteins. However, their ad hoc nature leaves much to be desired for
254 W. Boomsma et al.
Acknowledgements The authors acknowledge funding by the Danish Research Council for
Technology and Production Sciences (FTP, project: Protein structure ensembles from mathematical
models, 274-09-0184) and the Danish Council for Independent Research (FNU, project: A
Bayesian approach to protein structure determination, 272-08-0315).
5
It should be noted that these articles erroneously state that the models described in this chapter
only capture the dependencies between neighboring residues. Obviously, the presence of a Markov
chain of hidden nodes actually enforces dependencies along the whole sequence. In practice, such
Markov chains do have a finite memory.
Part VI
Inferring Structure from Experimental
Data
Chapter 11
Prediction of Low Energy Protein Side Chain
Configurations Using Markov Random Fields
11.1 Introduction
The task of predicting energetically favorable amino acid side chain configurations,
given the three-dimensional structure of a protein main chain, is a fundamental
subproblem in computational structural biology. Specifically, it is a key component
in many protocols for de novo protein folding, homology modeling, and protein-
protein docking. In addition, fixed main chain protein design can be cast as a
generalized version of the side chain placement problem. For all these purposes,
the objective of pursuing low energy side chain configurations is equivalent to
finding the most probable assignments of a corresponding Markov random field.
Consequently, this problem can be addressed using message-passing probabilistic
inference algorithms, such as max-product belief propagation (BP) and its variants.
In this chapter, we review the inference techniques that have been successfully
applied to side chain placement, discuss their current limitations, and outline
promising directions for future improvements.
C. Yanover ()
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
e-mail: [email protected]
M. Fromer
The Hebrew University of Jerusalem, Jerusalem, Israel
e-mail: [email protected]
Fig. 11.1 Top, left: Protein main chain (colored black) and side chains (colored gray), depicted
for a sequential protein fragment of 85 amino acids (PDB id: 1AAY). Top, right: Zoom-in for a
single pair of positions, ARG 146 and ASP 148. Bottom: A few main chain-dependent rotamers
for each of these amino acid residues (Taken from the Dunbrack rotamer library [163])
Interestingly, despite the seemingly reverse nature of the two problems, the
protein design problem can be formalized as a generalization of the side chain
placement problem (Sect. 11.2.1), which itself is a subtask of protein folding.
Whereas in side chain placement each protein position is modeled with the side
chain conformers of a single amino acid, in protein design the side chain conformers
of multiple amino acids are modeled at each design position. Thus, historically, all
algorithms previously devised for side chain placement were simply transferred to
the domain of protein design. Although this makes sense on a theoretical level, there
were, however, a number of practical problems with this approach.
Foremost of these is the fact that there are many more rotamers in the per-
position rotamer set. In fact, when considering all design positions together, there
are exponentially more possible rotamer combinations than there are for side
chain placement. Thus, clearly, algorithms that managed to successfully approach
low energy solutions for the protein side chain placement domain would not
necessarily successfully scale up to the larger protein design cases. Therefore,
numerous adaptations were made to the side chain placement algorithms in order to
enable their feasibility. Some of these were guaranteed to provide optimal solutions
(albeit with no guarantee of terminating within a short, or non-exponential, time)
[247, 374, 425, 448, 578], while others were more heuristic in nature but were found
to fare reasonably in practice [298, 389, 405].
The exponential growth in the world-wide protein structure data bank (PDB) [44] –
to currently include tens of thousands of structures – has deepened scientific
understanding of the biophysical forces that dominate in vivo protein folding.
Specifically, this large number of experimentally determined protein structures has
enabled the deduction of “knowledge-based” energy terms, which have facilitated
more accurate structural predictions [670]. A weighted sum of such statistically
derived and physically realistic energy terms is what constitutes most modern energy
functions.
In particular, the energy function used by the successful SCWRL program [98]
combines a (linear approximation of a) van der Waals repulsive term, with main
chain-dependent rotamer probabilities estimated from a large set of representative
protein structures. Other programs, such as Rosetta [405] and ORBIT [132],
include additional energy terms to consider attractive steric, hydrogen bond, and
electrostatic interactions, as well as interactions with the surrounding solvent (i.e.,
the water molecules).
260 C. Yanover and M. Fromer
X X
E.r/ D Ei .ri / C Eij .ri; rj / (11.1)
i i;j
Computationally, side chain placement attempts to find, for the input structure
and sequence, the rotamer configuration of minimal energy [776, 781]:
For the generalization to protein design, denote by T.k/ the amino acid type
of rotamer k and let T.r/ T.r1 ; : : : ; rN / D .T.r1 /; : : : ; T.rN //, i.e., the amino
acid sequence corresponding to rotamer assignment r. Let S D .S1 ; : : : ; SN / denote
an assignment of amino acids for all positions, i.e., an amino acid sequence.
Computational protein design attempts to find the sequence S of minimal energy
[207, 780]. Specifically:
S D argmin E.S/ (11.3)
S
where:
E.S/ D min E.r/ (11.4)
rWT.r/DS
is the minimal rotamer assignment energy for sequence S (as in Eq. 11.2). This
double minimization problem, over the sequence space (Eq. 11.3) and over the per-
sequence rotamer space (Eq. 11.4), is combined as:
Thus, both for fixed main chain side chain placement and for fixed main chain
protein design, the goal is to find the minimal energy rotamer configuration from
11 Prediction of Low Energy Protein Side Chain Configurations 261
This class of algorithms is the most natural approach to solving difficult discrete
combinatorial optimization problems and were thus the first to be applied for
the prediction of low energy rotamer configurations. These algorithms are usually
262 C. Yanover and M. Fromer
intuitive, simple to implement, and provide quick and reasonable results. However,
they typically do not come with a guarantee regarding the quality of the predictions,
theoretical insight as to how the results are affected by the multiple parameters of
the respective algorithms, or knowledge of the run times required to provide “good”
results in practice.
Monte Carlo simulated annealing (MCSA): Monte Carlo (MC) methods are a
category of computational search techniques that iteratively employ randomization
in performing calculations. In the case of searching for the minimal energy side
chain configuration [405,800], each protein position is initially assigned a randomly
chosen rotamer. Then, at each iteration, a position is randomly chosen and a
“mutation” to a random rotamer side chain is made at that position. This mutation is
then accepted or rejected in a probabilistic manner, dependent on the “temperature”
of the simulation and the change in energy due to the mutation, where smaller
increases (or a decrease) in side chain energy or a higher temperature will increase
the chance that the mutation is accepted and kept for the next iteration. The
concept behind simulated annealing is to start the Monte Carlo iterations at a
high temperature and slowly “cool” the system to a stable equilibrium (low energy
solution), analogously to the physical process of annealing in the field of statistical
mechanics. In either Monte Carlo variant, the purpose of randomly allowing for
rotamer configurations with higher energy than currently observed is to escape
local minima in the rugged configuration energy landscape and attempt to find the
globally optimal low energy side chain configuration.
Genetic algorithms (GAs): In a fashion similar to MCSA, genetic algorithms also
utilize randomized sampling in an iterative manner in an attempt to find the lowest
energy rotamer configuration [349]. However, as opposed to MCSA, which deals
with only a single configuration at a time, GAs maintain a population of hetero-
geneous, “mutant” rotamer configurations. And, at each iteration, the “individuals”
of the population are subjected to randomized mutations and recombinations, and
then to subsequent selection. This process rudimentarily mimics the evolutionary
process, where the survival of the “fittest” rotamer configuration is expected under
certain conditions on the population dynamics and mutation rates and with sufficient
time. As with MC methods, there is typically also an annealing component to the
algorithm (selection strength), so as to overcome energetic barriers in the energy
landscape.
Self-consistent mean field theory (SCMF): Self-consistent mean field theory
provides a method to calculate probabilities for each possible rotamer at each protein
position [95, 381, 389, 800]. Since high probabilities correspond to low energies,
the optimal rotamer configuration can then be predicted by choosing the highest
probability rotamer at each position. The per-position rotamer probabilities are
calculated in an iterative fashion. At each iteration, the multiple interactions with
a particular position are averaged into a “mean field” that is being exerted on this
position, based on the current rotamer probabilities at all other positions and their
11 Prediction of Low Energy Protein Side Chain Configurations 263
interactions with this position. This mean field is then utilized in order to recalculate
the rotamer probabilities at the position of interest. At convergence, rotamer
probabilities are obtained and the predicted lowest energy rotamer configuration
is output.
Fast and accurate side-chain topology and energy refinement (FASTER): The
FASTER method [148] is a heuristic iterative optimization method loosely based
on the framework of the DEE criteria (see below). An initial rotamer configuration
is chosen (possibly randomly). Then, in the simplest stage of FASTER, the current
rotamer assignment is held fixed at all but one protein position, and this remaining
position is exhaustively optimized, i.e., by considering all possible rotamers at that
position in a first-order “quenching” procedure. This process continues iteratively
for all positions until no further changes can be made, or until a predetermined
number of steps has been reached. Next, such a quenching procedure is performed
on all positions after fixing a particular position to a given rotamer choice; this is
repeated for all rotamer choices at the particular position, after which a different
position is chosen to be fixed. Despite the largely heuristic nature of FASTER,
it was demonstrated to find the optimal rotamer configuration in many real-world
problems [3].
Side-chains with a rotamer library (SCWRL): The SCWRL program [98] is
one of the longest-standing algorithms for rapid protein side chain placement. In
fact, due to its relatively high accuracy and speed, it has been incorporated into
various world-wide web servers, including 3D-PSSM [364], which uses SCWRL
to generate models of proteins from structure-derived profile alignments. SCWRL
3:0 is considerably faster than the original algorithm, since it uses simple but
elegant graph theory to make its predictions. In detail, it models each protein
position as a node in an undirected graph. Subsequently, it breaks up clusters of
interacting side chains into the biconnected components of this graph, where a
biconnected component is one that cannot be broken apart by the removal of a single
vertex. The minimal energy rotamer choices are then recursively calculated for each
component, and the approximate lowest energy configuration is recovered through
this “divide and conquer” framework. Recently, SCWRL 4:0 was released, which
achieves higher accuracy at comparable speed by using a new main chain-dependent
rotamer library, a more elaborate energy function, and a junction tree related search
algorithm (see Sect. 11.6.2) [398].
As opposed to the algorithms detailed above, the algorithms presented below are
accompanied with formal guarantees of optimality (i.e., they will find the lowest
energy configuration). Nonetheless, they come with no assurance that the run time
264 C. Yanover and M. Fromer
will be reasonably short, since, after all, the pertinent computational problem is
highly difficult (see Sect. 11.3.1).
Dead-end elimination (DEE): Dead-end elimination is a rotamer pruning tech-
nique that guarantees to only remove those rotamers that do not participate in the
lowest energy rotamer configuration [147, 237]. Thus, if a sufficient number of
rotamers are removed, then DEE is guaranteed to find the optimal configuration.
Conceptually, the basic DEE criterion compares two rotamers at a given protein
position and determines if the first one is “better” than the second, in that any
low energy configuration using the second can always be made to have lower
energy by using the first rotamer in its place. In such a case, the second rotamer
can be eliminated without detriment, thus simplifying the computational problem.
This procedure is iteratively repeated, reducing the conformational space at each
step. In order to account for more difficult problems, more sophisticated DEE
criteria have been developed, which include the elimination of pairs of rotamers
[247], conditioning on neighboring rotamers [448, 578], and unification of pairs
of positions [247]. Although DEE is often successful in many practical side chain
placement and design problems, it can require extremely long run times, and it is
sometimes not at all feasible.
Integer linear programming (LP/ILP): Linear programming (LP) is a general
mathematical method for the global optimization of a linear function, subject to
linear constraints. Integer linear programming (ILP) problems require that the
solution consists of integer values, but makes the computational problem more
difficult. The ILP approach to side chain placement and protein design [374] is based
on the observation that Eq. 11.1 decomposes the energy for a complete rotamer
configuration into a sum of energies for positions and pairs of positions. Thus,
by defining a variable for the rotamer choice at each position and for each pair
of positions, the energy minimization task can be written as a linear function of
these variables. Also, linear equalities are added to ensure that a unique rotamer is
consistently chosen at each position. In the ILP formulation, the rotamer variables
are constrained to be binary integers (0 or 1), where a value of 1 indicates that
the rotamer was chosen. In the LP relaxation, the variables are allowed to assume
continuous values between 0 and 1. This LP relaxation is solved with computational
efficiency, possibly yielding a solution in which all variables are integers. Otherwise,
a more exhaustive ILP solver is used, although with no guarantee of a non-
exponential run time. In practice, an LP solver provided quick solutions for many
side chain placement problems that were tested, whereas long runs of the ILP solver
were often required for the protein design cases assessed. See Sect. 11.6.3 for related
algorithms that use a Markov random field-based approach.
For a review of the MCSA, GA, SCMF, and DEE algorithms (and computational
benchmarking results), see [742]; see also [653] for a general review of search
algorithms for protein design. Table 11.1 provides a short, non-comprehensive list
of some of the more notable success stories of computational protein design, with
emphasis that the methods outlined above be represented.
11 Prediction of Low Energy Protein Side Chain Configurations 265
Table 11.1 A short summary of notable cases of experimentally validated protein design research
Method Design, redesign of
A protein with a novel ˛=ˇ fold [406]
MCSA Endonuclease DNA binding and cleavage specificity [16]
Biologically active retro-aldol enzyme [347] and kemp elimination enzyme
[615]
Core residues of the phage 434 cro protein [146]
GA Ubiquitin [422]
Native-like three-helix bundle [393]
88 residues of a monomeric helical dinuclear metalloprotein, in both the apo
and the holo forms [95]
SCMF
Four-helix bundle protein that selectively binds a non-biological cofactor [117]
Ultrafast folding Trp-cage mutant [89]
A monoclonal antibody directed against the von Willebrand factor that inhibits
FASTER its interaction with fibrillar collagen [685]
Engrailed homeodomain [643]
Zinc finger domain [132]
DEE Enzyme-like protein catalysts [66]
Novel sensor protein [449]
ILP 16-member library of E. Coli/B. Subtilis dihydrofolate reductase hybrids [623]
by these inequalities [374, 622]. One of the drawbacks with this approach is
that, in practice, very often these inequalities require that the computationally
expensive ILP solver be used. Nonetheless, there does exist a state-of-the-art method
(STRIPES: Spanning Tree Inequalities and Partitioning for Enumerating Solutions),
which adds spanning tree-based inequalities, that has been empirically shown to
perform well without requiring an ILP solver [204]. Finally, for cases of protein
design, direct application of these inequalities for the prediction of low energy
rotamer configurations will not necessarily preclude the iteration over multiple low
energy configurations corresponding to the same amino acid sequence; note that
this same issue arises with the DEE-based methods as well. On the other hand, see
Sect. 11.6.4.1 for an example of how Markov random field-based approaches have
been readily generalized for this task.
Since each of the side chain placement and protein design tasks pose a discrete
optimization problem (Sect. 11.3) and the energy function consists of a sum of
pairwise interactions, the problem can be transformed into a probabilistic graphical
model (Markov random field, MRF) with pairwise potentials [776]; see Fig. 11.2. A
random variable is defined for each position, whose values represent the rotameric
choices (including amino acid) at that position. Clearly, an assignment for all
variables is equivalent to rotamer choices for all positions (where, for the case of
protein design, the rotamer choices uniquely define an amino acid sequence).
The pre-calculated rotamer energies taken as input to the problem (see
Sect. 11.2.3) are utilized to define probabilistic potential functions, or probabilistic
factors, in the following manner. The singleton energies specify probabilistic factors
describing the self-interactions of the positions in their permitted rotamer states:
Ei .ri /
i .ri / De T (11.6)
And, the pairwise energies define probabilistic factors describing the direct interac-
tions between pairs of rotamers in neighboring positions:
Eij .ri ;rj /
ij .ri ; rj / De T (11.7)
Fig. 11.2 Side chain placement formulated as a structure-based graphical model. Top: Short
segment of the protein main chain for PDB 1AAY (see Fig. 11.1) and its corresponding graph,
where an edge between two positions describes the pairwise energies between them (Eq. 11.7).
Bottom: An example pairwise potential matrix for the energetic interactions between the rotamer
side chains of ARG 146 and those of ASP 148
side chain prediction essentially ignore interactions occurring between atoms more
distant than a certain threshold, this implies that the corresponding graph will often
have a large number of missing edges (positions too distant to directly interact).
Thus, the locality of spatial interactions in the protein structure induces path
separation in the graph and conditional independence in the probability distribution
of the variables. Formally, it can be shown that, if node subset Y separates nodes
X and Z, then X and Z are independent for any fixed values of the variables in Y :
p.X; ZjY / D p.X jY / p.ZjY /, or X ? ? ZjY .
Mathematically, the probability distribution for the rotamer assignment r D
.r1 ; : : : ; rN / decomposes into a product of the singleton and pair probabilistic
factors:
1 Y Y
p.r/ D i .ri / ij .ri ; rj / (11.8)
Z i i;j
1 E.r/
D e T (11.9)
Z
where Z is the probability normalization factor (partition function), and Eq. 11.9
derives from substitution of Eqs. 11.6 and 11.7 into Eq. 11.8 and the energy
268 C. Yanover and M. Fromer
decomposition of Eq. 11.1. Thus, minimization of rotamer energy (Eqs. 11.2 and
11.5) is equivalent to the maximization of rotamer probability (a probabilistic
inference task):
r D argmax p.r/ (11.10)
r
Now, we proceed to discuss how Eq. 11.10 is solved in practice using general-
purpose algorithms devised for probabilistic inference in graphical models.
where N.i / is the set of nodes neighboring variable i . Note that mi !j is, in essence,
a message vector of relative probabilities for all possible rotamers rj , as determined
at a specific iteration of the algorithm.
11 Prediction of Low Energy Protein Side Chain Configurations 269
Fig. 11.3 Message-passing in the max-product belief propagation (BP) algorithm. From left to
right, messages are passed (solid, bold arrows) along the following cycle: from SER 147 to
SER 145, from SER 145 to ARG 146, and from ARG 146 to SER 147. Note that, for each
message passed, all incoming messages (dotted arrows) are integrated into the calculation, with
the exception of that from the target node; for example, SER 147 ignores the message sent by SER
145 when calculating the message to send it
yields the most probable rotamer assignment r (as defined in Eq. 11.10).
The belief propagation algorithm was originally formulated for the case where
the graphical model is a tree graph, i.e., no “loops” exist [568]. However, since
270 C. Yanover and M. Fromer
typical side chain placement and protein design problems will have numerous
cycles (Fig. 11.2), inexact max-beliefs are thus obtained, and the predicted rotamer
configurations are not guaranteed to be optimal. Nonetheless, loopy BP has been
shown to be empirically successful in converging to optimal solutions when run
on non-tree graphs (e.g., [776]). Furthermore, loopy BP has conceptual advantages
over related (statistical) inference techniques, since it does not assume independence
between protein positions and yet largely prevents self-reinforcing feedback cycles
that may lead to illogical or trivial fixed points. Equation 11.11 attempts to prevent
the latter by exclusion of the content of what variable j most recently sent to
variable i . On the other hand, for example, self-consistent mean field algorithms
are forced to make certain positional independence assumptions and thus may fail
to perform as well [206, 742].
Whereas the standard belief propagation algorithm for side chain placement
involves the passing of messages between neighboring variable nodes (Fig. 11.3),
in the generalized belief propagation (GBP) algorithm [786], messages are passed
between sets of variables (known as clusters or regions). These regions in the graph
are often chosen based on some problem-specific intuition, typically along with
methods already developed in the field of physics (e.g., the Kikuchi cluster variation
method). GBP is also an approximate inference algorithm, but it incorporates
greater complexity in capturing dependencies between a large number of nodes
in the original graph (Fig. 11.2). This is done with the intention of achieving
higher accuracy and more frequent convergence [776, 786], while still maintaining
computational feasibility. Additional research related to side chain placement using
GBP, with regions of sequential triplets of protein positions, is detailed in [358].
The junction tree (JT) algorithm [420] can be considered to be a special case of
GBP, where the regions are chosen in such a way so that message-passing between
these nodes is guaranteed to converge to the optimal solution. It is thus considered
the “gold standard” in probabilistic inference systems. However, the reason that the
JT algorithm is not actually relevant in most real-world scenarios (specifically for
side chain placement and protein design) is that the sizes of the resulting regions are
so great as to make it infeasible to store them in any computer, yet alone perform
computations such as message-passing between them. Nonetheless, it has proven
useful in showing that, for small enough problems where the JT algorithm can be
applied and the exact GMEC is found, BP usually finds this low energy solution
as well [776], validating the use of BP on larger problems where the JT algorithm
cannot be utilized.
11 Prediction of Low Energy Protein Side Chain Configurations 271
As noted above, the loopy belief propagation (BP) algorithm has no theoretical
guarantees of convergence to the correct results, although BP does, in practice, often
converge to optimal results. Nevertheless, for cases for which BP does not converge,
or when it is desired that the side chain predictions are guaranteed to be of minimal
energy, there do exist algorithms related to BP, and message-passing in general,
that are mathematically proven to yield the optimal side chain configuration upon
convergence and under certain conditions.
In general, these message passing algorithms essentially aim at solving dual
linear programming (LP) relaxations to the global minimum energy configuration
(GMEC) integer program (as defined in Sect. 11.4.2 and [374]). In particular, they
compute a lower bound to the energy functional in Eq. 11.1 and suggest a candidate
minimal energy configuration. If the energy associated with this configuration is
equal to the energy bound, then it is provably the GMEC and the bound is said
to be tight. Otherwise, either the candidate configuration is not optimal but there
exists a configuration which does attain the (tight) bound; or, the bound is not tight
and hence unattainable by proper (integral) configurations. If the latter is the case,
a tighter lower bound can be sought, e.g., by calculating additional, higher order
messages [499, 680], albeit with increasing computational complexity. Below we
list some of the more notable algorithms that follow this scheme and provide a short
description of each.
The tree-reweighted max-product (TRMP) message-passing algorithm pro-
vides an optimal dual solution to the LP relaxation of the GMEC integer program
[747]. As pointed out in Sect. 11.6.1, the belief propagation message-passing
algorithm is guaranteed to converge to the GMEC if the underlying graph is
singly connected, i.e., it is an acyclic, or tree, graph. The idea behind the tree-
reweighted max-product algorithm [748] is to utilize the efficiency and correctness
of message-passing on a tree graph to attempt to obtain optimal results on an
arbitrary graph with cycles. This is implemented by decomposing the original
probability distribution into a convex combination of tree-structured distributions,
yielding a lower bound on the minimal energy rotamer configuration in terms of
the combined minimal energies obtainable for each of the tree graphs considered. A
set of corresponding message update rules are defined within this scheme, where the
minimal energy configuration can be exactly determined if TRMP converges and the
output “pseudo”-max-marginal beliefs are uniquely optimized by a single rotamer
configuration. It was also shown that even in cases where these max-beliefs are not
uniquely maximized by a particular configuration, it is sometimes still possible to
determine the lowest energy configuration with certainty [763]. A comparison of
simple LP relaxations (Sect. 11.4.2) and TRMP was performed in [779].
Similarly, the max-product linear programming (MPLP) algorithm [229] is
derived by considering the dual linear program of a relaxed linear program (LP)
for finding the minimal energy rotamer configuration (similar to that used in [374]).
272 C. Yanover and M. Fromer
where ri1 ¤ ri2 , i.e., position i differs between these two configurations. Note that
the identities of position i and rotamer ri2 were determined using the maximal
sub-optimal rotamer beliefs (Eq. 11.12). After this, an additional run of BP is
11 Prediction of Low Energy Protein Side Chain Configurations 273
Fig. 11.4 Calculation of the four lowest energy configurations by the BMMF sub-space partition-
ing algorithm, for the side chain placement of the five positions in the protein structure modeled in
Fig. 11.2. In this toy example, each of the five positions has two possible rotamers, A and B, and
the four lowest energy conformations (and their energies) are as listed at the top. At each iteration
(marked by the vertical axis), the next rotamer configuration (denoted by a circle) is determined
using BP, and the max-marginal beliefs resulting from BP are utilized to determine which rotamer
choice at which position (marked by a gray box) should be used in subsequent partition steps to
find an additional low energy configuration (denoted by a star). At each iteration, positive and
negative constraints are as marked (Adapted from [207])
required in order to calculate the next lowest energy configuration within the current
configurational sub-space, where this second run is performed while requiring that:
Again, the next lowest energy configuration within this sub-space (stars in Fig. 11.4)
is distinguished by having the best rotamer max-belief at a particular position
274 C. Yanover and M. Fromer
As opposed to side chain placement, for the case of protein design, it is highly
undesirable to simply apply the BMMF algorithm in a straightforward manner.
This derives from the fact that such a run would often entail the prediction of
multiple low energy rotamer configurations that are available to each low energy
amino acid sequence, without providing additional sequence design predictions.
To overcome this obstacle, the tBMMF (type-specific BMMF) algorithm [207]
exploits the formulation for protein design described in Sect. 11.5 and generalizes
the BMMF algorithm. It does so by requiring that the successive low energy rotamer
configurations sought out by the BMMF algorithm correspond to distinct amino acid
sequences (see Fig. 11.6).
Conceptually, tBMMF operates by partitioning the rotamer search space by
the amino acid identities of the low energy rotamers, in order to prevent the
same amino acid sequence from being repeated twice (with various low energy
rotamer configurations for that sequence). Formally, tBMMF replaces the positive
(Eq. 11.15) and negative (Eq. 11.16) constraint operations of BMMF with:
where T.ri / denotes the amino acid corresponding to rotamer ri , and, analogously
to the BMMF algorithm, position i differs in amino acid identity between the
r1 and r2 configurations. Additionally, the identities of position i and rotamer ri2
were determined using the maximal sub-optimal rotamer beliefs (Eq. 11.12). Thus,
tBMMF proceeds through successive amino acid-based partitionings of the rotamer
11 Prediction of Low Energy Protein Side Chain Configurations 275
As described above, the primary goal of both protein side chain placement and
protein design is the prediction of the lowest energy rotamer configuration, either
for a given sequence (side chain placement) or among the rotamers for multiple
sequences (protein design). Recall that this configuration is also known as the global
minimum energy configuration (GMEC) [612] .
Alas, this minimum energy side chain configuration may be considerably
different from the “correct” configuration, as captured in the static protein crystal
structure. This incompatibility is, in part, a consequence of the widely used
modeling paradigm, which, for the sake of computational feasibility, limits the
conformational search space to discrete rotamers [574]. But, in fact, even within this
restricted conformational space, there often exist more “native-like” configurations
that are assigned (much) higher energy than the GMEC. Thus, the biological quality
of a side chain placement search algorithm is measured primarily using structural
attributes, such as the root mean square deviation (RMSD) between the minimal
energy configuration and the native structure, or the fraction of side chain dihedral
angles predicted correctly (up to some threshold), and only rarely using the energy
itself. However, since we focus here on the search techniques themselves, we also
consider the comparison of the actual minimal energies, as obtained by the various
algorithms.
which account for longer distance interactions [778]. Even better convergence rates
might be obtained by using a different message update schedule, e.g., as suggested
by [171]. Importantly, it is always known when BP has failed to converge; in these
cases, one can use the best intermediate results (that is, the one with lowest energy),
or run another algorithm (e.g., GBP or Monte Carlo simulated annealing).
Notwithstanding these convergence problems, BP has been shown to obtain the
global minimum side chain configuration, for the vast majority of models where
this configuration is known (using DEE, the junction tree exact algorithm, or one of
the message passing algorithms with an optimality certificate) [776,778]. Moreover,
when BP converged, it almost always obtained a lower energy solution than that of
the other state-of-the-art algorithms. This finding agrees with the observation that
the max-product BP solution is a “neighborhood optimum” and therefore guaranteed
to be better than all other assignments in a large conformational region around
it [762].
The run time of BP is typically longer than that devoted, in practice, to
exploring the conformation space by heuristic methods (Sect. 11.4.1). Message
passing algorithms with a certificate of optimality are, in general, even slower,
but they usually obtain the GMEC solution faster than other computationally exact
methods (see Sect. 11.4.2).
For cases of computational side chain placement, reasonably accurate results
have been observed using either BP [778] or the tree-reweighted max-product
message-passing variant (TRMP, see Sect. 11.6.3) [781] on a standard test set of
276 single chain proteins: approximately 72% accuracy for correct prediction of
both 1 and 2 angles. Furthermore, it was found that optimizing the energy
function employed (Sect. 11.2.3), also utilizing TRMP, was capable of significantly
improving the predictive accuracy up to 83%, when using an extended rotamer
library [781].
For the case of side chain placement, the BMMF algorithm (Sect. 11.6.4) [776]
utilizes the speed and observed accuracy of belief propagation (BP) to predict the
collection of lowest energy rotamer configurations for a single amino acid sequence
and structure (Fig. 11.5). In [778], computational benchmarking for this task of
finding multiple low energy rotamer configurations for side chain placement was
performed. It was found that BMMF performed as well as, and typically better
than, other state-of-the-art algorithms, including Gibbs sampling, greedy search,
generalized DEE/A , and Monte Carlo simulated annealing (MCSA). Specifically,
BMMF was unique in its ability to quickly and feasibly find the lowest energy
rotamer configuration (GMEC) and to output successive low energy configurations,
without large increases in energy. The other algorithms often did not succeed since
they were either not computationally practical (and thus did not converge at all), or
their collection of predicted rotamer configurations contained many rotamer choices
with overly high energies. On the other hand, whenever an exact algorithm was
tractable and capable of providing the optimal ensemble of configurations, BMMF
provided these optimal predictions as well.
For protein design, the message-passing-based tBMMF algorithm was computa-
tionally benchmarked against state-of-the-art algorithms on a dataset of design cases
of various biological qualities, sizes, and difficulties [207]. It was found that tBMMF
was always capable of recovering the optimal low energy sequence ensemble (e.g.,
see Fig. 11.6) when this set could be feasibly calculated (by DEE). As in the
BMMF benchmark above, tBMMF was the fastest, most adept, and most accurate
algorithm, since it almost always managed to converge to extremely low energy
sequences, whereas other algorithms might have converged quickly but to high
energy solutions, or not have converged at all. Furthermore, tBMMF did not suffer
from the sharp increases in predicted sequence energy that the MCSA algorithms
278 C. Yanover and M. Fromer
Fig. 11.5 Side chain placement using BMMF. Top: Three-dimensional structures of low energy
solutions for PDB 1AAY, using the SCWRL energy function [98]. Backbones are colored light
gray, as well as the side chains for the lowest energy configuration (Config. 1). For comparison,
the side chains of the 4 next lowest conformations are colored black. Bottom: Superposition of the
1; 000 lowest energy conformations, where the side chains of all but the lowest energy structure
are colored black; note the large amount of side chain variation within this structural ensemble, as
depicted by black side chains (Structures were drawn using Chimera [575])
encountered. Finally, the authors of [207] report that the tBMMF predictions
made using the Rosetta energy function [405], popularly used for computational
protein design, typically included many extremely similar low energy sequences.
To overcome this phenomenon, they suggested a modified tBMMF algorithm, which
successfully bypasses amino acid sequences of biochemical nature similar to those
already output. The BMMF and tBMMF methods described here are implemented
in the freely available Side-chain PRediction INference Toolbox (SPRINT) software
package [208].
11 Prediction of Low Energy Protein Side Chain Configurations 279
Fig. 11.6 tBMMF design output: multiple sequence alignment of predictions for 24 core calmod-
ulin positions (marked on the bottom) designed [207] for binding smooth muscle myosin light
chain kinase (PDB 1CDL) using the Rosetta energy function [405]. tBMMF was used to predict
low energy amino acid sequences; amino acid identities for the 10 lowest energy solutions are
shown, and the sequence logo summarizes 500 best sequences. Graphics were generated using
TeXshade [30]
Table 11.2 Summary of message-passing methods for the calculation of low energy protein side
chain configurations.
Task Application Configuration(s) Algorithm
obtained [Benchmark refs]
Belief propagation (BP)
Low energy (empirical)
Single low energy Side chain [171, 240, 776, 778]
side chain placement, TRMP [779, 781]
configuration protein design Globally optimal (provable) MPLP [229]
LP relaxations [681]
Side chain Low energy rotamer ensembles
Ensemble of low BMMF [777, 778]
placement (empirical)
energy side chain
configurations Low energy sequence
Protein design tBMMF [207]
ensembles (empirical)
Belief propagation, BMMF, and their variants have been benchmarked on large and
diverse sets of real world side chain placement and protein design problems and
shown to predict lower energy side chain configurations, compared to state-of-the-
art methods; for a summary, see Table 11.2. In fact, some of the algorithms outlined
in Sect. 11.6.3, including TRMP [781], MPLP [229], and tightened LP relaxations
[681], have provably obtained the global minimum side chain configurations for
280 C. Yanover and M. Fromer
many of the instances in the side chain placement and protein design benchmark
sets of [779].
The ability to predict lower energy configurations is, by all means, encouraging.
However, computational structural researchers are usually more concerned with
the “biological correctness” of the predicted models than with their corresponding
energies. Unfortunately, using contemporary energy functions, the lower energy
configurations obtained by BP and its variants are only negligibly (if at all)
more native-like than those obtained by sub-optimal algorithms [781]. The usually
slower run times of BP and other inference algorithms and the possibility of
non-convergence further reduce the attractiveness of these algorithms from the
perspective of biological researchers. Consequently, an important direction in future
research is the development of faster, better converging, and more accurate BP-
related algorithms; structural biology provides a challenging, real world test bed for
such algorithms and ideas.
The following subsections describe a few variants of the “classical” side chain
placement and protein design tasks and discuss the potential utility of message-
passing algorithms in addressing these related applications.
Throughout this chapter, we have formulated the procedures of side chain placement
and protein design as operating on a fixed main chain structure that is given as input
to the computational problem. In addition, the amino acid side chains are modeled
as being one of a number of discrete possibilities (rotamers) sampled from a large
library of protein structures. Both of these simplifying assumptions essentially
define the classical paradigm for these problems, within which researchers have
made major inroads into fundamentally understanding proteins at the atomic level,
e.g., the design of a sequence that assumes the structure of a zinc finger domain
without the need for zinc [132]. Also, see [612] for an overall review on the
successes of protein design in the past decades. Nevertheless, this paradigm of
a fixed main chain and rigid side chains artificially limits the natural degrees of
freedom available to a protein sequence folding into its stable conformation(s).
For example, in protein design, we may energetically disallow a certain amino
acid sequence due to steric clashes between side chains, whereas small main chain
perturbations would have completely removed the predicted atomic overlaps. We
282 C. Yanover and M. Fromer
now outline some of the methods that bypass the limitations inherent in making
these modeling assumptions.
The simplest technique that incorporates main chain flexibility into computa-
tional side chain placement and protein design procedures consists of the use of a
pre-calculated ensemble of fixed main chain structures similar to the target structure
[201, 418]. A more sophisticated approach entails cycles of fixed main chain design
and main chain improvements on an as-needed basis, e.g., when random Monte
Carlo steps are rejected at a high rate due to the main chain strains induced by
sequence mutation [787]. Some protein design methods take this a step further,
where the main chain improvements are regularly performed every number of
iterations by optimizing the main chain for the current amino acid sequence using
atomic-resolution structure prediction algorithms [406, 624]. The most elegant, yet
computationally burdensome, method involves simultaneous energetic optimization
within the main chain and side chain spaces [213, 214, 674]. Such optimization
should most closely resemble the simultaneous latitude available to proteins that
have naturally evolved within the structure-sequence space.
Similarly, permitting side chain flexibility within the framework of side chain
placement and design has been implemented in a number of ways. The most
straightforward of these approaches is to still model the side chains with a discrete
number of rotamer states, but to use more rotamers by “super-sampling” from the
empirical rotamer library. This technique has been employed in many works, e.g.,
[207, 325, 624]. More refined methods for permitting side chain flexibility include
the stochastic sampling of sub-rotamer space [151], flexible side chain redesign
(iterative application of sub-rotameric minimization within an MCSA search) [151],
and exact optimization within the sub-rotamer space using computationally costlier
DEE-based methods [215]. Another possibility for the sampling of sub-rotamer
space is to perform energy calculations for a particular rotamer from the library
based on stochastically generated versions of that rotamer [500]. Alternatively,
probabilistic modeling of side chain conformational space could allow sampling
in continuous space [290].
Most importantly, recent protein design studies have shown that the relaxation of
the rigid main chain and rotamers can produce more “realistic” side chain and amino
acid variability [151, 201, 624, 674], i.e., boosting the accuracy of protein modeling
procedures in describing natural phenomena.
The preliminary use of message-passing algorithms that somewhat model protein
main chain flexibility (in predicting the free energies of association for protein-
protein interactions) was demonstrated in [359], where a probabilistic ensemble of
protein main chains was employed. Future work that utilizes belief propagation
to model molecular flexibility will strive to directly incorporate simultaneous
optimization within the main chain and side chain spaces (e.g., as noted above
for [213, 214, 674]). An intriguing possibility for this approach would be to use
continuous-valued (Gaussian) Markov random fields [682] and their corresponding
belief propagation algorithms [465, 571, 761].
11 Prediction of Low Energy Protein Side Chain Configurations 283
Acknowledgements We would like to thank Amir Globerson and Talya Meltzer for their
discussions on message-passing algorithms with certificates of optimality.
Chapter 12
Inferential Structure Determination
from NMR Data
Michael Habeck
12.1 Introduction
12.1.1 Overview
M. Habeck ()
Department of Protein Evolution, Max-Planck-Institute for Developmental Biology, Spemannstr.
35, Tübingen, Germany
e-mail: [email protected]
Department of Empirical Inference, Max-Planck-Institute for Intelligent Systems, Spemannstr.
38, Tübingen, Germany
12.1.2 Background
where Edata evaluates the goodness of fit for a specific structure x (in case of
crystallographic data this could be the R-factor), and Ephys is a molecular force field
which guarantees the integrity of the structures in terms of covalent parameters and
van der Waals contacts. The weighting factor wdata balances the two terms governing
the hybrid energy function.
Jack and Levitt [330] introduced this way of refining protein structures against
crystallographic data. In the original publication they already remark that the best
choice of the weighting factor wdata “is something of a problem”. This hints at a more
general problem with minimization approaches, namely the question of how to set
nuisance parameters. Nuisance parameters are parameters that are not of primary
interest but need to be introduced in order to model the data and their errors. In the
above example, the weighting factor is such a nuisance parameter. Other nuisance
parameters are related to the theories that we use to calculate mock data from a
given protein structure. These could be crystallographic temperature factors, NMR
calibration scales, peak assignments and alignment tensors, etc. In optimization
approaches, nuisance parameters are chosen manually or set to some default value.
The problem with manual intervention in protein structure determination is that one
might impose personal beliefs and biases on the structure.
12 Inferential Structure Determination from NMR Data 289
Sometimes it is also stated that NMR ensembles reflect the dynamics of the
protein in solution. Also this statement has to be judged with some caution.
Indeed there may exist a correlation between the structural heterogeneity in
NMR ensembles and the conformational fluctuations predicted, for example, with
molecular dynamics [1]. But again this is not because the data are interpreted
correctly but more or less accidentally: nuclear spins that are subject to stronger
dynamics produce broader peaks that eventually disappear in the noise. That is,
one tends to observe and assign fewer NMR signals in mobile than in less flexible
regions. Variations in the completeness of data along the protein chain result in
ensembles that are in some parts better defined than in others. But it could also well
be that a lack of restraints is caused simply by missing chemical shift or cross-peak
assignments rather than dynamics.
The source common to the above-mentioned issues is that we use the wrong
tools to tackle protein structure determination. An optimization approach can only
determine a single, at best globally optimal structure but not indicate its quality.
Missing parameters have to be set manually or determined by techniques such as
cross-validation. Inferential structure determination (ISD) [267, 601] is a principled
alternative to the optimization approach. The reasoning behind ISD is that protein
structure determination is nothing but a complex data analysis or inference problem.
In this view, the main obstacle is in the incompleteness of the information that
is provided by experiments and in the inadequacy of optimization methods to
cope with this incompleteness. Instead of jumping the gun by formulating protein
structure calculation as an optimization problem, let us step back and contemplate
on the question: What are the adequate mathematical tools to make quantitative
inferences from incomplete information?
The basic problem with inferences from incomplete information is that, contrary
to deductive logic, no unique conclusions can be drawn. Rather, one has to allow
propositions to attain truth values different from just True or False and introduce
a whole spectrum spanning these two extrema. It was Cox [127] who showed
that, rather surprisingly, the rules that govern the algebra of such a scale of truth
values are prescribed by the simple demand that they comply with the rules of
standard logic. Cox showed that the calculus of probability theory is isomorphic
to the algebra of probable inference. Cox’s work was the culmination of a long-
lasting effort to establish the principles and foundations of Bayesian inference
beginning with the work of Bayes [27] and Laplace [416]. In recent years, the
view of probability theory as an extended logic has been promoted most fiercely
by Jaynes [342].
These developments established Bayesian inference as a unique and consistent
framework to make quantitative inferences from limited information. Today, past
quarrels between Bayesians and Frequentists (see for example Jaynes’ polemics
on significance tests [338]) appear as records of some ideological stone age. Also
the advent of powerful computers helped to develop a more constructive attitude
towards Bayesianism. Advanced numerical methods and computer power make it
possible to solve problems that resisted a Bayesian treatment because they were too
complex to be tackled with pencil and paper.
12 Inferential Structure Determination from NMR Data 291
ISD starts from the observation that protein structure determination is nothing but a
complex inference problem. Given experimental data (from X-ray crystallography,
NMR spectroscopy, electron microscopy) we want to determine a protein’s three-
dimensional structure. The Bayesian way to solve this problem is straight-forward.
First, we recall and formalize what we already know about our particular protein
and protein structures in general. We know the amino acid sequence of the protein,
we know that amino acids have a rather rigid covalent structure, we know that each
amino acid is composed of atoms that occupy some volume in space and should not
overlap. The second ingredient is a probabilistic model for the observations that we
want to use for structure determination.
Here an advantage of a Bayesian approach comes into play. Data analysis
problems in general are inverse problems: we measure data that are related to
our parameters of interest and hope to invert this relation to gain insight into the
parameters themselves. Bayesians do not try to solve inverse problems by direct
inversion but delegate the inversion task to the Bayesian inference machinery [341].
In our context this means that we do not derive from the NMR spectra distance
bounds which will then be used to generate conformations. Rather we aim at
modeling the data as realistically as possible and apply Bayes’ theorem to do the
inversion for us. Modeling the data probabilistically comprises two steps. First, we
calculate idealized “mock” data using a forward model, i.e. a theory or physical law.
Second, we need to take account of the fact that the observations will deviate from
our predictions. There are many possible sources for discrepancies (experimental
noise, theoretical short-comings, systematic errors because some effects such as
protein dynamics or averaging were neglected); they all need to be quantified using
an error model.
12.2.1 Formalism
Bayes’ theorem converts the probability of the data into a probability over the joint
space of protein conformations and nuisance parameters. This posterior distribution
p.x; jd; I / is everything we need to solve a structure determination problem.
12 Inferential Structure Determination from NMR Data 293
It determines the most likely protein conformations and specifies their precision
in terms of the posterior probability mass that is associated with each of them.
To arrive at the posterior distribution we simply consider the likelihood a function
of the parameters x and (instead of inverting the data directly). Furthermore, we
express our data-independent knowledge through a prior probability:
where p.xjI / is the prior distribution of protein conformations and p.jx; I / the
prior probability of the nuisance parameters given the structure. Here we will
assume that, a priori, the nuisance parameters are independent of the structure:
p.jx; I / D p.jI /. Moreover, it is not necessary to calculate the normalization
constant p.djI /, the “evidence” [457], for the questions addressed in this chapter
(further details on the use of evidence calculations in ISD can be found in [266]).
Because the probability of the data plays such an important role we will give it
its own name, likelihood, and denote it by L. The likelihood is the probability of the
data conditioned on the molecular structure x and nuisance parameters viewed as
a function of these parameters. That is,
Fig. 12.1 Bayesian modeling of structural data. To set up the likelihood function for some
observable y, we first use a forward model f .xI ˛/, a deterministic function to predict mock data
yO from the structure. The forward model is parameterized with additional unknowns ˛. The mock
data are then forwarded into an error model g.yjy;O / which assesses the discrepancies between
observed and mock data. g is a conditional probability distribution that is normalized for the first
argument (i.e. the observed data) and involves additional nuisance parameters
The likelihood can be viewed as a score for the fit between the distribution of
the data and a parameterized model. Additional prior information modulates this
score by down-weighting regions where parameter values are unlikely to fall and
up-weighting those regions that are supported by our prior knowledge. The prior
distribution of the molecular structure is the Boltzmann distribution [336]:
Note, that there is no simple way of calculating the partition function Z.ˇ/, which
is not a problem if we are only interested in parameter estimation.
In the current implementation of ISD, we use a very basic force field E.x/,
commonly employed in NMR structure determination [446]. The force field main-
tains the stereochemistry of the amino acids, i.e. it restrains bond lengths and bond
angles to ideal values as defined in the Engh-Huber force field [173]. In addition
we penalize the overlap of atoms using a purely repulsive potential [301]. Van
der Waals attraction and electrostatic interactions are usually not included in NMR
structure calculations. These simplifications are valid for data with favorable data-
to-parameter ratios. For sparse and/or low quality NMR data it becomes necessary
to include additional terms [266] either derived from the data base of known protein
structures [410] or a full-blown molecular force field [111]. One of the future
developments of ISD will be to incorporate a more elaborate force field such as
the full-atom energy of Rosetta [76] and to add new potentials that restrict the main
chain dihedral angles to the allowed regions of the Ramachandran plot, or the side
chain dihedral angles to the preferred rotamers. The latter goal can be accomplished
by the combined use of the probabilistic models presented in Chap. 10 as a prior
distribution, as was recently shown [552].
The generic approach for formulating an appropriate likelihood is to first come
up with a reasonable forward model that allows the calculation of idealized data
from a given structure (see Fig. 12.1). The forward model often involves parameters
˛ such that for an idealized measurement yO D f .xI ˛/. The second ingredient
12 Inferential Structure Determination from NMR Data 295
and the nuisance parameters comprise the theory parameters ˛ and the parameters
of the error models . Using the notation introduced in this paragraph the posterior
distribution is:
p.x; ˛; / / L.x; ˛; / .x/ .˛/ . /: (12.9)
Here we introduced prior distributions for the forward and error model parameters ˛
and . Again, we made the assumption that all parameters are independent a priori.
This could be relaxed if needed.
Let us illustrate the basic machinery of the ISD approach for a simple example.
We consider a structure that can be parameterized by a single conformational degree
of freedom, a dihedral angle ' (Fig. 12.2). We want to infer the angle from NMR
measurements. One way to determine dihedral angles by NMR is to measure three-
bond scalar coupling constants (also called J couplings). There is an approximate
theory developed by Karplus [361] that relates the strength of a three-bond scalar
coupling J to the dihedral angle of the intervening bond ':
The Karplus curve (Eq. 12.10) is basically a Fourier expansion involving three
expansion coefficients (Karplus coefficients) A, B, and C that we assume to be
known for now. The Karplus curve is our forward model. To analyze n measured
296 M. Habeck
where Z./ D .2 2 /n=2 is the nth power of the normalization constant of a
Gaussian distribution. The exponent involves the 2 residual:
X
2 .'/ D Œ Ji J.'/ 2 D n Œ J J.'/ 2 C n var.J / (12.12)
i
where J is the sample average and var.J / the sample variance over all measured
scalar couplings.
As we saw in Sect. 12.2.1, the likelihood function can be interpreted in terms of
the cross-entropy between the empirical distribution of the data and a parametric
model. In this example, the model has two free parameters, the conformational
degree of freedom ' and the error of the couplings . By varying these parameters
we can control how much the model and the distribution of the couplings overlap.
Let us assume that we know the error and focus on the angle '. Changes in
the angle shift the center of the Gaussian model in a non-linear fashion. The
optimal choice for the model maximizes the overlap and is obtained for angles '
such that J.'/ D J . Because the Karplus curve is a non-linear forward model
the posterior distribution of the angle is not a Gaussian density (see Fig. 12.3).
Moreover, the Karplus curve is not one-to-one, we therefore obtain a multi-modal
posterior distribution.
Without taking into account prior structural knowledge on the dihedral angle ',
the posterior distribution is multi-modal. However, we know that in proteins for all
amino acids except glycine the right half of the Ramachandran plot is almost not
populated. This fact is encoded in the force field E.'/ used in ISD. Figure 12.4
shows the prior distribution, the likelihood and the resulting posterior distribution
for the alanine dipeptide example. The posterior distribution is the product of the
prior and likelihood. Hence, regions in conformational space that are unlikely to
be populated are masked out. The resulting posterior distribution has only a single
mode, in contrast to the multi-modal likelihood function.
Similar arguments hold for . Assuming that we know the correct angle, the
posterior distribution of the inverse variance (precision) D 2 is the Gamma
distribution:
Prob(J)
0 5 10 15
J
Prob(ϕ)
Prob(J)
0 5 10 15 0 5 10 15
J J
Fig. 12.3 Posterior distribution of the ' angle in the alanine dipeptide example. Regions with
high posterior probability p.'/ corresponding to ' values that maximize the overlap between
the empirical distribution of the data (grey) and the model (dashed) where only the ' angle is
considered a free parameter and the error is fixed. Low posterior probability regions correspond
to angular values that minimize the fit between the distribution of the data and the model. Because
the Karplus curve is a non-linear, many-to-one forward model the posterior density is non-Gaussian
and multi-modal
Fig. 12.4 Prior probability, likelihood, and posterior density of the alanine dipeptide example.
A priori the ' angle is mostly restricted to negative values (left panel). This helps to disambiguate
the likelihood function (middle panel). Multiplication of the prior with the likelihood results in a
posterior distribution (right panel) that exhibits a single mode whose width encodes how accurately
' is determined by the data
n 2n
hj'i D ; h. /2 j'i D : (12.14)
2 .'/ Œ2 .'/2
298 M. Habeck
This example illustrates that often the conditional posterior distribution of the
nuisance parameters are of a standard form, which allows us to estimate them
analytically. The estimation of shows that it is indispensable to use a probabilistic
framework to estimate nuisance parameters. As discussed before, the likelihood
function evaluates the overlap between the empirical distribution of the data and
a parametric model. If the model were not normalized, the cross-entropy score
would be meaningless, and the overlap could be maximized by simply increasing the
amplitude or width of the model. However, in normalized models the amplitude
and width of the model are directly related via the normalization condition. When
working in a hybrid energy framework it becomes very difficult, if not impossible to
derive normalization constants and thereby additional terms that allow for a correct
extension of the hybrid energy function.
log p.x; ˛; / D log L.x; / C ˇE.x/ log .˛/ log . / (12.15)
where constant terms have been dropped on the right hand side. Considering only
terms that depend on the conformational degrees of freedom we obtain a pseudo-
energy
that has the same functional form as the hybrid energy (Eq. 12.1) if we identify
log L.x; / with wdata Edata .x/ and ˇE.x/ with Ephys .x/. Assuming our generic
model for NMR data (Eq. 12.8) we obtain the correspondence:
X
wdata Edata .x/ D log g.yi jf .x; ˛/; /: (12.16)
i
Therefore, log g.yjf .xI ˛/; /, viewed as a function of the structure, implies a
restraint potential resulting from measurement y.
Bayesian theory can be used to set up the hybrid energy and clarifies some of the
shortcomings of the optimization approach to biomolecular structure calculation.
12 Inferential Structure Determination from NMR Data 299
There are two major problems with the hybrid energy approach: First, it fails to
treat nuisance parameters as unknowns. The nuisance parameters are treated as fixed
quantities that need to be set somehow, additional terms in Eq. 12.15 originating in
the prior and normalization constant of the likelihood are neglected. Second, the
minimization approach fails to acknowledge the probabilistic origin of the hybrid
energy. The hybrid energy is treated merely as a target function for optimization,
its interpretation as a conditional posterior distribution over protein conformational
space is not seen. Moreover, hybrid energy minimization only locates the maxima
of the posterior distribution but does not explore the size of the maxima. Often it
is argued minimization by MDSA “samples” the hybrid energy function. We will
show later that the meaning of “sampling” by multi-start MDSA from randomized
initial conformations and velocities has no statistical meaning.
functions and the canonical ensemble. To sample protein structures properly from
the canonical ensemble is already a formidable, highly non-trivial task [288]. This is
so because in a protein structure the conformational degrees of freedom are strongly
correlated. Changes in one atom position can influence the allowed spatial region of
atoms that are very far away in sequence. A random walk Metropolis scheme will be
hopelessly inefficient because even small changes in dihedral angles can add up to
cause major structural changes in the following part of the polypeptide chain. It is
therefore important to take these correlations into account when proposing a new
structure.
We use the Hamiltonian Monte Carlo (HMC) method [161, 536] to sample
dihedral angles from their conditional posterior probability. The trick of HMC is
to first blow up the problem by the introduction of angular momenta that follow a
Gaussian distribution. The negative logarithm of the product of the distributions
of the momenta and dihedral angles defines a Hamiltonian. The proposal of a
new conformation proceeds as follows. A Molecular dynamics (MD) calculation in
dihedral angle space is started from the current angles and from random momenta
generated from a Gaussian distribution. The angles and momenta are updated using
the leapfrog method [5]. Finally, the move is accepted according to Metropolis’
criterion. The acceptance probability is determined by the difference in Hamiltonian
before and after running the MD calculation.
The Gibbs sampler has various shortcomings, among which the most serious is non-
ergodicity. The Markov chain gets trapped in a single mode and fails to sample the
entire posterior distribution. Posterior samples that have been generated with a non-
ergodic sampler will yield biased estimates. The remedy is to apply an idea from
physics. A system that is trapped in a meta-stable state can reach the equilibrium
after heating and subsequent annealing (as opposed to quenching). This idea is used,
for example, in simulated annealing (SA). A problem here is that SA can still get
trapped if the cooling schedule is not appropriate (e.g. if the system is not cooled
slow enough). Moreover SA is an optimization not a sampling method. SA does
not aim at generating properly weighted samples from a probability distribution. An
inherent problem is that SA generates a single trajectory in state-space and thus can
still end up in a suboptimal state.
The replica-exchange Monte Carlo (RMC) or parallel tempering method [704]
fixes some of the flaws of simulated annealing. Similar to SA, RMC uses a
temperature-like parameter to flatten the probability density from which one seeks
to generate samples. However, in contrast to SA, RMC does not keep only a single
heat-bath whose temperature is lowered but maintains multiple copies of the system,
so-called replicas, at different temperatures. These systems do not interact and
are sampled independently but they are allowed to exchange sampled states if the
probability for such an exchange satisfies the Metropolis criterion. Hence RMC is a
302 M. Habeck
hierarchical sampling algorithm. The individual replicas are simulated using some
standard sampling method (referred to as “local sampler”) such as random walk MC,
molecular dynamics, or Gibbs sampling. The local samplers are combined using
a meta-Metropolis sampler that generates “super-transitions” of the entire replica
system.
Obviously, the choice of the temperatures influences the sampling properties
of the replica-exchange method. Several issues need consideration. First, the
temperature of the “hottest” replica has to be chosen such that the local sampler can
operate ergodically. Second, the temperature spacing between neighboring replicas
has to be such that the exchange rate is high enough (i.e. the energy distributions
need to overlap considerably). The advantages of RMC over SA are manifold. First,
RMC is a proper sampling method that, after convergence, yields correctly weighted
samples, SA does not generate statistically correct samples. Second, RMC maintains
heat-baths at all temperatures throughout the entire simulation. Therefore states
that are trapped in local modes can still escape the modes when continuing the
simulation. Third, several indicators monitor the performance of RMC sampling.
Low exchange rates indicate that mixing of the replica chain is probably not
achieved within short simulation times. Trace plots of the replica energies and the
total energy can be used to check for non-convergence.
In ISD, we use two temperature-like parameters that separately control the
influence of the data and force field [268]. The data are weighted by scaling
the likelihood function L . The force field is modified by using the Tsallis ensemble
instead of the Boltzmann ensemble. In the Tsallis ensemble, a new parameter q is
introduced that controls the degree of non-linearity in the mapping of energies. For
q > 1, energies are transformed using a logarithmic mapping:
q
Eq .x/ D log f1 C ˇ .q 1/ .E.x/ Emin /g C Emin (12.17)
ˇ .q 1/
where Emin E.x/ must hold for all configurations x. In the low energy regime
ˇ.q 1/.E.x/ Emin / 1, the Tsallis ensemble reduces to the Boltzmann
ensemble. In particular it holds that EqD1 .x/ D E.x/. The logarithmic mapping
of energies facilitates conformational changes over high-energy barriers.
space using a recursive algorithm. The stepsize used during HMC is updated using
the following heuristic. Acceptance of a new sample increases the stepsize by
multiplication with a factor slightly larger than one. Rejection of a new sample
reduces the stepsize by multiplication with a factor slightly smaller than one. After
a user-defined exploration phase, the stepsizes are fixed to their median values.
The library provides support for running RMC posterior simulations on computer
clusters.
14 1 1
12
0.8 0.8
10
8 0.6 0.6
p( j )
J( j )
p( j )
6 0.4 0.4
4
0.2 0.2
2
0 0 0
–180 –90 0 90 180 –180 –90 0 90 180 –180 –90 0 90 180
j j j
Fig. 12.6 Comparison of RMC with multi-start MDSA. The left panel shows the Karplus curve
used to generate noise-free mock data indicated by the dashed horizontal line. The middle and
right panels show the posterior distribution of the dihedral angle as solid black line. Histograms
have been compiled (shown in gray) from sampled dihedral angles using the RMC method (middle
panel) and multi-start MDSA (right panel)
304 M. Habeck
global maxima. If the MDSA protocol were perfect the same structure would be
found irrespective of the starting positions and velocities, and the ensemble would
collapse to a single structure. That is, “sampling” multi-start MDSA produces
ensembles that are not statistically meaningful. MDSA ensembles may be too sharp
if the protocol is very efficient. But they can also produce ensembles that are too
heterogeneous if the protocol is not adapted to the quality of the data. This is often
the case for sparse data [601].
12.4 Applications
Nuclear Overhauser effect (NOE) data are by far the most important and informative
measurements for protein structure determination [772]. The NOE is a second order
relaxation effect. Excited nuclear spins relax back to thermal equilibrium, the speed
of the relaxation process is modulated by the environment of the spins. The isolated
spin-pair approximation (ISPA) [677] neglects that the relaxation involves the entire
network of nuclear spins and is governed by spin diffusion. Still the ISPA is the most
widespread theory to model NOE intensities and volumes. The ISPA simplifies the
exchange of magnetization to a pair of isolated, spatially close spins of distance r.
A more formal derivation builds on relaxation matrix theory [460] and considers
short mixing times (initial-rate regime). According to the ISPA the intensity I of an
NOE is proportional to the inverse sixth power of the distance such that our forward
model for NOE data is
and involves the calibration factor > 0 which will be treated as a nuisance
parameter.
What is an appropriate error model for NOEs? The absolute scale of a set of
NOEs has no physical meaning. Therefore deviations between scaled and unscaled
intensities should have the same likelihood. Mathematically this means that the error
model has to show a scale invariance:
This relation must hold for any choice of the scale. For D 1=I.x/ we have:
where h./ D g.j1/ is a univariate density defined on the positive axis. We are free
in the choice of h. A maximum entropy argument yields the lognormal distribution
as the least biasing distribution if we assume that the average log-error of the ratio
I =I.x/ is zero and its variance to be 2 . Our error model for NOE intensities is
1 1
g.I jI.x/; / D p exp 2 log2 .I =I.x// :
2 2 I 2
If we combine this error model with the forward model (ISPA, Eq. 12.18) we obtain:
1 1
p.I jx; ; / D p exp 2 log2 .I = Œr.x/6 / (12.19)
2
2 I 2
One neat feature about the lognormal model (Eq. 12.19) is that it does not distin-
guish between restraints involving intensities or distances. A problem with classical
relaxation matrix calculations that try to fit NOE intensities in a more quantitative
way than the ISPA is that a harmonic restraint potential has the effect that it
over-emphasizes large intensities corresponding to short, non-informative distances
between spins of the same residue or sequentially adjacent residues. On the other
hand, a harmonic potential defined on intensities under-emphasizes small intensities
that correspond to tertiary contacts defining the protein topology. The functional
form of the log-normal model is invariant under scaling transformations of the
intensities. Therefore, calculations based on intensities or distances lead to the same
results, only the range of the error is different.
Ubiquitin
p(RMSD)
We could show that use of the lognormal model results in more accurate
structures than the standard approach involving distance bounds [602]. Figure 12.7
shows the distribution of C ˛-RMSD values between X-ray and NMR structures
for two structure calculations from NOE data. Both data sets were analyzed using a
lognormal model and a lower-upper-bound potential (FBHW, flat-bottom harmonic-
wall) as implemented in CNS. The RMSD histograms are systematically shifted
towards smaller values when using a lognormal model indicating that this model
is able to extract more information from the data which results in more accurate
structures. The lognormal model is also useful in conventional structure calculation
where it implies a log-harmonic restraint potential, which was proven to be superior
to distance bounds [547].
The basic model for scalar coupling constants was already introduced in the
illustrative example (Sect. 12.2.2). In a fully Bayesian approach, also the Karplus
coefficients need to be considered as free nuisance parameters [269]. The posterior
distribution of the three Karplus coefficients is a three-dimensional Gaussian
because they enter linearly into the forward model and because the error model is
also Gaussian. Figure 12.8 shows the estimated Karplus curves from data measured
on Ubiquitin. The sampled Karplus curves correspond well to those fitted to the
crystal structure and reported in the restraint file 1D3Z.
d D
rTS r=r 5 (12.20)
where r is the three-dimensional bond vector that is involved in the dipolar coupling
and r is the length of the bond vector. Because of the constraints on S, we use a
parameterization involving five independent tensor elements s1 ; : : : ; s5 :
0 1
s1 s2 s3 s4
S D @ s3 s1 s2 s5 A : (12.21)
s4 s5 2 s2
12 Inferential Structure Determination from NMR Data 307
3
J(C’−C’) J(HN−Cβ )
3
J(C’−Hα )
3
10 10 10
8 8 8
6 6 6
J(ϕ)
4 4 4
2 2 2
0 0 0
−180 −90 0 90 180 −180 −90 0 90 180 −180 −90 0 90 180
3
J(HN−C’) 3
J(C’−Cβ ) J(HN−Hα )
3
10 10 10
8 8 8
6 6 6
J(ϕ)
4 4 4
2 2 2
0 0 0
−180 −90 0 90 180 −180 −90 0 90 180 −180 −90 0 90 180
ϕ ϕ ϕ
Fig. 12.8 Estimated Karplus curves for six different scalar couplings involving the dihedral
angle '. The couplings were measured for ubiquitin. All parameters (structures, Karplus coeffi-
cients, and errors) were estimated simultaneously using the replica algorithm. The gray curves are
sampled Karplus curves, the black line indicates the Karplus curve given in the literature (used in
the determination of 1D3Z). The coupling type is indicated in the title of the panels
d D
sTa.r/ (12.22)
where
and r D .x; y; z/T . The elements of the alignment tensor thus enter the forward
model linearly. If we use a Gaussian error model the conditional posterior dis-
tribution of the alignment tensor will be a five dimensional Gaussian. We can
therefore estimate the alignment tensor simultaneously with the structure [271]
in very much the same way as we estimate the parameters of the Karplus curve.
This has advantages over standard approaches that estimate the eigenvalues of the
alignment tensor before the structure calculation and optimize only the relative
orientation. Figure 12.9 shows the posterior histograms of the five tensor elements
for two different alignments of Ubiquitin (data sets from PDB entry 1D3Z).
308 M. Habeck
1.2
1
0.8
0.6
0.4
0.2
0
1.2
1
0.8
0.6
0.4
0.2
0
–15 –10 –5 0 –7.5 –2.5 2.5 7.5 –7.5 –2.5 2.5 7.5 –7.5 –2.5 2.5 7.5 0 5 10 15
s 1 [×10 4 ] s 2 [×10 4 ] s 3 [×10 4 ] s 4 [×10 4 ] s 5 [×10 4 ]
Fig. 12.9 Estimated alignment tensors for two different alignments of ubiquitin. The top row
shows the five tensor elements for the first alignment, the bottom row for the second alignment
RMSD [Å]
0.3 3
R factor
the posterior distribution of
the weights in a Bayesian
analysis. The red curve shows
the accuracy of the structures 0.2 2
as monitored by the
C˛-RMSD to the crystal
0.1 1
structure
0.0 0
1 10 100
wdata
the crystal structure becomes minimal for these weights. That is, cross-validation
and Bayesian weighting provide very similar results and yield the most accurate
structures. The advantages of the Bayesian approach over cross-validation is that
it uses all data, whereas cross-validation relies on some test data. If the data are
intrinsically coupled the question arises whether to choose the test data randomly
or in a more sophisticated manner that takes the correlations into account [82].
Moreover, in case of sparse data the removal of test data may significantly affect and
eventually spoil the convergence of the optimization protocol. A Bayesian analysis
is extensible to multiple data sets without increasing the computational burden
significantly, whereas multi-dimensional cross-validation would become very time-
consuming.
The estimated error not only balances the various sources of structural information
but also provides a useful figure of merit. This is intuitively clear because the error
evaluates the quality of a data set, and high quality data provide more reliable
structures than low quality data. In the same way cross-validated “free” R-values
[79] or RMS values [82] are used to validate crystal or NMR structures. To illustrate
the use of for validation let us discuss the structure determination of the Josephin
domain. At the time of analysis, two NMR structures [466, 542] of the Josephin
domain were available. The two structures agree in the structural core but differ
significantly in the conformation of a long helical hairpin that is potentially involved
in functional interactions. Basically, one structure showed an “open” hairpin (PDB
code 1YZB) whereas the alternative structure has a “closed” hairpin (PDB code
2AGA). A recalculation of the two controversial structures using ISD showed a
clear difference in their estimated errors [543]. Figure 12.11 shows the distribution
310 M. Habeck
1YZB 2AGA
p(σ)
D E
C
B
0.15 0.2 0.25 0.3
σ
Fig. 12.11 Estimated error as figure of merit. Shown are posterior histograms of the estimated
error . The light gray histograms are results from ISD calculations on reference data sets (A:
Ubiquitin, B: Fyn-SH3 domain [601], C: Bpti [53], D: Tudor domain, E: HRDC domain). The
dark gray histograms are the distributions for the two NMR structures of the Josephin domain
of values for 1YZB and 2AGA in comparison with other NMR data sets. Several
conclusions can be drawn from this figure. First, the 2AGA structure with the closed
hairpin has a larger error than the reference data sets. This indicates that the 2AGA
data are of a lower quality than the other data sets. Moreover, the errors of the
1YZB data sets lie well within the region that is expected from the reference data.
The left peak of the 1YZB data set corresponds to unambiguously assigned NOE
data, whereas the right peak is the error distribution of the ambiguous distance
data. This is reasonable because the unambiguous data are more precise and have a
higher information content than the ambiguous data. These findings suggest that the
1YZB structure (“open”) is more reliable than the 2AGA structure (“closed”). This
is confirmed by complementary data that were not used in the structure calculation
[543]: Additional RDC measurements can be better fitted with 1YZB and also small-
angle scattering curves are more compatible with an open structure.
Bayesian inference has some distinct advantages over conventional structure deter-
mination approaches based on non-linear optimization. A probabilistic model can
be used to motivate and interpret the hybrid energy function. A probabilistic hybrid
energy (the negative log-posterior probability) comprises additional terms that
determine nuisance parameters that otherwise need to be set heuristically or by
cross-validation. A Bayesian approach requires that we generate random samples
from the joint posterior distribution of all unknown parameters. This can take
significantly more time (one or two orders of magnitude) than a structure calculation
by minimization, which is the major drawback of a Bayesian approach. Table 12.1
12 Inferential Structure Determination from NMR Data 311
Table 12.1 Comparison of conventional NMR structure determination by hybrid energy mini-
mization and inferential structure determination
Acknowledgements ISD has been developed in close collaboration with Wolfgang Rieping and
Michael Nilges (Institut Pasteur, Paris). This work has been supported by Deutsche Forschungsge-
meinschaft (DFG) grant HA 5918/1-1 and the Max Planck Society.
Chapter 13
Bayesian Methods in SAXS and SANS
Structure Determination
Steen Hansen
13.1 Introduction
S. Hansen
Department of Basic Sciences and Environment, University of Copenhagen, Faculty of Life
Sciences, Thorvaldsensvej 40, DK-1871 FRB C, Copenhagen, Denmark
e-mail: [email protected]
Fig. 13.2 SAXS and SANS scattering lengths relevant for biological samples
flux available for X-rays SAXS is more frequently used. Using high-flux SAXS
from e.g. a synchrotron it may be possible to follow processes involving structural
changes in real time.
The information content of experimental SAS data is usually relatively low. For
SAXS it is not unusual to be able to determine less than ten parameters from the
data, while SANS suffering from experimental smearing and lower counting statis-
tics may only offer half this number. Therefore small angle scattering is frequently
used in combination with other experimental techniques for solving structural
problems.
13 Bayesian Methods in SAXS and SANS Structure Determination 315
The data analysis applied to small-angle scattering data has undergone significant
changes since the introduction of the technique. In the early days of small-angle
scattering only a few simple parameters characterizing the scatterer were deduced
from the experimental data, like e.g. the scattering-length-weighted radius of
gyration or the volume of the molecule.The computational advances of the 1970s
and 1980s made it possible to make simple trial and error models of the scatterer
using a few basic elements (spheres, cylinders etc.). Also, it allowed the estimation
of a one dimensional real space representation (a “pair distance distribution
function” [225]) of the scattering pattern by indirect Fourier transformation. The
real space representation of the scattering data facilitated the interpretation of the
measurements. In combination these direct and indirect procedures made it possible
to extract much more information about the scatterer than previously.
During the last decade estimation of three-dimensional structures from the one-
dimensional scattering pattern [699, 701] has proven to be a new powerful tool for
analysis of small-angle scattering data – especially for biological macromolecules.
However the difficulties associated with the assessment of the reliability of the
three dimensional estimates still complicate this approach. Much prior information
(or assumptions) has to be included in the estimation to obtain real space structures
which do not differ too much when applying the method repeatedly. Furthermore
the three dimensional estimation is frequently rather time consuming compared to
the more traditional methods of analysis.
A direct Fourier transform of the scattering data to obtain a real space representation
of the scattering data would be of limited use due to noise, smearing and truncation
of the data. An indirect Fourier transformation (IFT) also preserves the full
information content of the experimental data, but an IFT is an underdetermined
problem where several (and in this case often quite different) solutions may fit
the data adequately. Consequently some (regularization-)principle for choosing one
of the many possible solutions must be used if a single representation is wanted.
Various approaches to IFT in SAS have been suggested [278,282,520,698,700], but
the most frequently used principle is that of Glatter [225–227, 524], who imposed a
smoothness criterion upon the distribution to be estimated giving higher preference
to smoother solutions. This is in good agreement with the prior knowledge that
most SAS experiments have very low resolution. Consequently this method has
demonstrated its usefulness for analysis of SAS data for more than three decades.
For application of the original method of Glatter it was necessary to choose (1) a
number of basis functions (usually on the order of 20–40), (2) the overall noise level
316 S. Hansen
as well as (3) the maximum dimension of the scatterer. Various ad hoc guidelines
for how to make these choices were provided in the original articles
Due to improved computing facilities with regard to both hardware and software,
the restriction in the number of basis functions mentioned above must now be
considered redundant. However the choice of the overall noise level as well as
the maximum dimension of the scatterer still hamper the applicability of Glatter’s
method – following the guidelines on how to choose these hyperparameters one is
often left with a relatively wide range of choices. Other methods for IFT in SAS
face similar problems with the determination of hyperparameters. As the method of
Glatter is by far the most frequently used for IFT in SAS the applicability of the
Bayesian method for selection of hyperparameters is demonstrated in the following
using Glatter’s method as an example.
Estimation of the noise level of the experiment may not be of great interest on its
own and consequently this parameter may often be integrated out using Bayesian
methods. However the maximum diameter of the scatterer is clearly an important
structural parameter, which has been estimated by various ad hoc methods (for
example, [526]).
Using a Bayesian approach a two-dimensional probability distribution for the
hyperparameter associated with the noise level and for the maximum diameter may
be calculated by probability theory. From this distribution it is possible to make
unique choices for these two hyperparameters. Also reliable error estimates for the
real space distribution to be estimated as well as for the hyperparameters may be
calculated from the probability distribution.
Using the Bayesian method the parameters and hyperparameters are all deter-
mined uniquely by probability theory and in principle the only choice left is which
method of regularization should be used. However the most likely regularization
method may also be found by integration over all the hyperparameters yielding the
posterior probability for each method.
The original method of Glatter treated measurements which were done at low
concentrations where inter particle effects could be neglected and where the
measured data only referred to intra (single) particle contributions to the scattering
intensity. This is useful as it is frequently possible to make measurements using very
diluted samples of just a few mg/ml. However this may not always be the case. For
many problems the assumption of a dilute solution does not hold. In some cases
the structure of interest is only to be found in non-dilute solutions. This means
13 Bayesian Methods in SAXS and SANS Structure Determination 317
that experiments have to be done with relatively concentrated samples and that
the subsequent analysis of the measured data has to take inter particle effects into
account.
For such non-dilute systems the “Generalized Indirect Fourier Transformation”
(GIFT) extension of IFT was introduced by Brunner-Popela and Glatter [84]. In
GIFT inter particle effects are taken into account by including a structure factor in
the calculations. The inclusion of the structure factor leads to a non linear set of
equations which has to be solved either iteratively or by Monte Carlo methods.
Using GIFT the interaction between the scatterers has to be specified by the user,
who has to choose a specific structure factor. On one hand this requires some extra
a priori information about the scattering system, but on the other hand the choice of
a structure factor allows estimation of relevant parameters describing the interaction
such as the charge of scatterers and their interaction radius. Further input parameters
may be needed such as temperature and dielectric constant of the solvent. The
estimation of parameters from the model may be useful (provided of course that the
chosen model is correct), but correlations between the parameters may also reduce
the advantages of the approach [203].
Using a Bayesian approach and expressing the function describing the (real
space) structure of the scatterer as a combination of an intra particle contribution and
an inter particle contribution with appropriate constraints, it is possible to separate
the contributions in real space leading to the form factor and the structure factor
in reciprocal space (assuming that the scatterers are not very elongated). In this
situation it is not necessary to specify a structure factor to be used for the indirect
transformation. Only a rough estimate of the shape of the scatterer is necessary and
this estimate may be made from the scattering profile itself. The downside of this
approach is that less detailed information may be deduced from the data. However
this is also the case for the original IFT-method which nonetheless has proved to
be a most useful supplement to direct model fitting for analysis of small-angle
scattering data.
13.2.1 Overview
a 1
b 10
9
8
0.1
7
6
p(r) [a.u.]
I(q) [a.u.]
0.01 5
4
3
0.001
2
1
0.0001 0
0 0.05 0.1 0.15 0.2 0.25 0 50 100 150 200 250 300 350 400
q [Å-1] r [Å]
Fig. 13.4 (a) Scattering intensities for homogeneous scatterers of different shapes. (b) Corre-
sponding distance distribution functions
where n is the (average) number density of the particles and V is the volume of one
particle.
The distance distribution function is related to the density-density correlation
.r/ of the scattering length density .r/ by
Z
p.r/ D r .r/ D r 2 2
< .r 0 /.r C r 0 / dr 0 >; (13.2)
V
where .r/ is the scattering contrast, given by the difference in scattering density
between the scatterer sc .r/ and the solvent so , i.e. .r/ D sc .r/ so , < >
means averaging over all orientations of the molecule.
Examples of scattering intensities for a few simple geometrical shapes and their
corresponding distance distribution functions are shown in Fig. 13.4
For uniform scattering density of the molecule the distance distribution function
is proportional to the probability distribution for the distance between two arbitrary
scattering points within the molecule.
If the distance distribution is known, the Guinier radius Rg may be calculated
from p.r/ according to the formula [228]:
R
p.r/r 2 dr
Rg2 D R ; (13.3)
2 p.r/ dr
13 Bayesian Methods in SAXS and SANS Structure Determination 319
It is seen that the Guinier radius is the scattering length density weighted radius of
gyration for the scatterer.
Also from Eq. 13.1 the forward scattering I.0/ is related to p.r/ through
Z d
I.0/ D 4 nV p.r/ dr (13.4)
0
from which the volume V of the scatterer may be calculated when the contrast and
the concentration of the sample is known.
If p.r/ is not known, it may be possible to estimate Rg and I.0/ from a plot of
ln.I.q// against q 2 as
This approximation is generally valid for q 1=Rg , which is often a very small
part of the scattering data (measurements close to zero angle are prevented by the
beam stop, the purpose of which is to shield the sensitive detector from the direct
beam).
For non-uniform scattering density the distance distribution may have negative
regions (if the scattering density of some region of the scatterer is less than the
scattering density of the solvent).
a 1
b 35
30
0.8 25
20
0.6 15
p(r) [a.u]
10
γ (r)
0.4 5
0
0.2 -5
-10
0 -15
-20
0 50 100 150 200 250 300 0 50 100 150 200 250 300
r [Å] r [Å]
Fig. 13.5 (a) Full line: 1 .r/ for sphere of diameter 100 Å. Dotted line: Corresponding excl .r/.
Dashed-dotted line: struct .r/. (b) Full line: p1 .r/ for sphere of diameter 100 Å. Dotted line:
pexcl .r/ for spheres of diameter 100 Å and volume fraction D 0:1. Dashed-dotted line pstruct .r/
for spheres of diameter 100 Å D 0:1. Dashed line: Total distance distribution function p.r/
according to Eq. 13.6 [280]
For a monodisperse solution pexcl .r/ is due to the perturbation of the distribution
of distances caused by the fact that the centers of two molecules cannot come
closer than the minimum dimension of the molecules. At distances larger than twice
the maximum dimension pexcl .r/ D 0. The introduction of inter particle effects
increases the integration limit of Eq. 13.1 from the maximum dimension d of the
single molecule to that of the maximum length of the interaction (which may in
principle be infinite). The first term on the right hand side of Eq. 13.6 determines
the form factor P .q/ when Fourier transformed according to Eq. 13.1 and the last
two terms determine the structure factor S.q/ specified below. Correspondingly the
intensity in Eq. 13.1 can be divided into a part which is due to intra particle effects –
the form factor P .q/ – and a part which is due to the remaining inter particle effects
– the structure factor S.q/.
I.q/ / S.q/P .q/; (13.7)
For dilute solutions S.q/ D 1 and the measured intensity is given by the form factor
P .q/. Equation 13.7 is valid for spherical monodisperse particles, but frequently it
is assumed to hold true also for slightly elongated particles with some degree of
polydispersity [202].
The structure factor can be written
Z 1
sin.qr/
S.q/ D 1 C 4 n h.r/r 2 dr (13.8)
0 qr
where h.r/ is the total correlation function [554], which is related to the radial
distribution (or pair correlation) function g.r/ [790] for the particles by
For hard spheres g.r/ D 0 for r < d , where d is the diameter of the sphere. For a
system of monodisperse hard spheres the functional form for S.q/ is known [572].
Hayter and Penfold [299] have given the analytical solution for S.q/ for a system
of particles interacting through a screened Coulomb potential.
From Fig. 13.5b it is also seen that for spheres p1 .r/ and pstruct .r/ have their
support mainly in different regions of space. This means that if pexcl .r/ is given,
it may be possible to estimate p1 .r/ and pstruct .r/ separately from experimental
data. As the form of pexcl .r/ is only dependent upon the geometry of the particle,
it requires less information for an IFT than a complete determination of a structure
factor S.q/ which requires specification of the interaction between the particles.
From Eq. 13.1 for dilute scattering the distance distribution function p.r/ may
be approximated by p D .p1 ; : : : ; pN / and the measured intensity at a given qi
written as
XN
I.qi / D Tij pj C ei (13.10)
j D1
13.3.1 Regularization
The estimation of p from the noisy scattering data is an underdetermined and ill-
posed problem. To select a unique solution among the many which may fit the
data adequately, regularization by the method of Tikhonov and Arsenin [717] may
be used. Tikhonov and Arsenin estimated a distribution p D .p1 ; : : : ; pN / by
minimizing a new functional written as a weighted sum of the chi-square 2 and
a regularization functional K:
˛K.p; m; / C 2 (13.12)
where the prime indicates a derivative (first and/or higher) of p. The first term
minimizes the deviation of p from the prior estimate m D .m1 ; : : : ; mN / with
respect to a given norm and the second term imposes a smoothness constraint on
the distribution to be estimated.
The 2 is defined in the conventional manner i.e.
X
M
.Im .qi / I.qi //2
D
2
(13.14)
i D1
i2
X
N 1 2
.pj 1 C pj C1 / 1 1
KD pj C p12 C pN2 (13.15)
j D2
2 2 2
plateau does exist, the point on the plateau which should be selected is not uniquely
determined as the plateaus may be relatively wide.
Using only first term in Eq. 13.12 will give a regularization similar to that of the
maximum entropy method. The norm is then to be replaced by the Shannon entropy
which measures the distance between two distributions f and m [407, 646].
For regularization by the maximum entropy method [668] a constraint
Z
KD Œp.r/ ln.p.r/=m.r// p.r/ C m.r/ dr (13.16)
X
N
KD pj ln.pj =mj / pj C mj (13.17)
j D1
X
N
K Œ.pj mj /2 =2mj (13.18)
j D1
From this equation it can be seen that using a prior mj D .pj C1 C pj 1 /=2 the
maximum entropy constraint corresponds to the smoothness constraint Eq. 13.15
in a new metric defined by the denominator 2mj in Eq. 13.18. Using this metric
will combine the positivity constraint of Eq. 13.17 with the smoothness constraint
of Eq. 13.15.
For non-dilute solutions p in Eq. 13.1 may be replaced by the sum of p1 , pexcl
and pstruct as given by Eq. 13.6. In the examples shown in Sect. 13.4.4 p1 was
regularized using Eq. 13.18 with mj D .pj C1 C pj 1 /=2 as mentioned above,
while the conventional constraint Eq. 13.15 was used for pstruct .
For the shape pexcl of an ellipsoid of revolution may be used. This only requires
one extra parameter – the axial ratio for the ellipsoid – because the maximum
324 S. Hansen
diameter is known from p1 . The axial ratio may be estimated from the shape of
p1 or alternatively it can enter the estimation as an additional hyperparameter.
Furthermore for non-dilute solutions it can assumed that pstruct 0 for r < 0:5d
(Fig. 13.5) which holds true for scatterers which do not have large axial rations.
Also as S.q/ ! 1 for q ! 1 Eq. 13.7 gives:
FTŒp1 .r/ C pexcl .r/ C pstruct .r/ ! FTŒp1 .r/ for q ! 1 (13.20)
where FT denotes the Fourier transform of Eq. 13.1. Consequently it must hold that
where p.DjHi / is the probability of the data D assuming that the model Hi is
correct, p.Hi / denotes the prior probability for the model Hi which is assumed
constant for “reasonable” hypotheses (i.e. different hypotheses should not be
ascribed different prior probabilities) and p.D/ is the probability for measuring the
data which amounts to a renormalization constant after the data have been measured.
13 Bayesian Methods in SAXS and SANS Structure Determination 325
The evidence p.DjHi / for a hypothesis can be found by integrating over parameters
p D .p1 ; : : : ; pN / in the model Hi which can be written:
Z
p.Hi jD/ / p.DjHi / D p.D; pjHi / dN p; (13.23)
where N is the number of parameters in the model, and d is the maximum diameter
of the sample. Again using Bayes’ theorem
Z
p.DjHi / D p.DjHi ; p/p.pjHi / dN p (13.24)
M being theQnumber of data points. For the usual case of Gaussian errors L D 2 =2
and ZL D .2
i2 /1=2 where
i is the standard deviation of the Gaussian noise
at data point i .
Correspondingly it is now assumed that the prior probability for the distribution
p can be expressed through some functional K (to be chosen) and written
By this expression the model Hi is the hypothesis that the prior probability for the
distribution of interest p can be written as above with some functional K and a
parameter ˛ which determines the “strength” of the prior (through K) relative to
the data (through 2 ). Both the functional form of K as well as the value for the
parameter ˛ are then part of the hypothesis and are subsequently to be determined.
Inserting Eqs. 13.25 and 13.27 in Eq. 13.24 and writing Q D ˛K 2 =2, the
evidence is given by
R R
exp.˛K/ exp. 2 =2/ dN p exp.Q/ dN p
p.DjHi / D D (13.29)
ZK ZL ZK ZL
Using Gaussian approximations for the integrals and expanding Q around the
maximum in p writing A D rrK, B D rr 2 =2 evaluated at the maximum value
of Q.p0 / D Q0 where rQ D 0:
326 S. Hansen
where Kmax is the maximum for the functional K. Usually Kmax D 0 which
will be assumed in theQ following – otherwise just a renormalizing constant is left.
Furthermore the term .2
i2 /1=2 from the experimental error is redundant for
comparison of different hypothesis and is left out in the following. The probability
for different hypothesis each being equally probable a priori can then be calculated
from the expression
1 1
log p.DjHi / D log det.A/ ˛K0 20 =2 log.det.A C ˛ 1 B// (13.32)
2 2
In the previous the hyperparameters were implicitly included in the models or
hypothesis. Now writing Hi for the model wi thout the hyperparameters ˛ and d
again we obtain from Bayes’ theorem that the posterior probability is determined by
where p.˛/ is the prior probability for ˛ and p.d / is the prior probability for d .
Assuming ˛ to be a scale parameter [340] gives p.˛/ D ˛ 1 which should be used
for the prior probability of ˛. For a parameter d with relatively narrow a priori
limits, the prior probability should be uniform within the allowed interval.
The Lagrange multiplier ˛, the maximum diameter d of the scatterer and – for the
case of non-dilute solutions: the volume fraction – are all hyperparameters which
can be estimated from their posterior probability p for a set .˛; d; / after data
have been measured. This probability is calculated using Gaussian approximations
around the optimal estimate popt for a given set of hyperparameters and integrating
over all solutions p for this particular set of hyperparameters [261, 456]. Using
the regularization from Eq. 13.15, writing A D rrK and B D rr 2 =2, the
probability of a set of hyperparameters .˛; d; / may be written [278]:
exp.˛K 2 =2/
p.˛; d; ; a/ / (13.34)
det1=2 .A C ˛ 1 B/
In Eq. 13.34 both matrices as well as .˛K 2 =2/ have to be evaluated at the point
p where exp.˛K 2 =2/ takes its maximum value.
Using Eq. 13.34 the most likely value for each hyperparameter can be found
from the optimum of the probability distribution and an error estimate for the
hyperparameters can be provided from the width of the distribution.
As the Bayesian framework ascribes a probability to each calculated solution p,
an error estimate for the (average) distribution of interest is provided from the
13 Bayesian Methods in SAXS and SANS Structure Determination 327
The first simulated example shown in Fig. 13.6a was taken from May and
Nowotny [281, 282, 490]. The original distance distribution function for the
simulated scatterer is shown in Fig. 13.6b.
In Fig. 13.6c is shown the conventional plot used to find the “point of inflexion”,
which is used to determine the noise level of the experiment using Glatter’s method.
a b
11 0.03
10
9 0.025
8
I(q) [rel. units]
0.02
p(r) [rel. units]
7
6
0.015
5
4
0.01
3
2 0.005
1
0 0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 20 40 60 80 100 120
q [1/nm] r [nm]
c d
100
10.4
10.2
I(0) [rel. units]
10
χ2 and K
10.0
1
9.8
9.6
0.1
1e-06 1e-05 0.0001 0.001 55 60 65 70 75 80 85 90 95 100 105
α d [nm]
Fig. 13.6 (a) Simulated data and fit to data. Insert shows the structure of the scatterer (eight
spheres). (b) Distance distribution functions corresponding to data in (a). Full line: Original distri-
bution. Long dashes: Calculation using maximum entropy. Dotted: Calculation using smoothness
regularization. (c) The 2 (dashed) and K (full line), rescaled for clarity, calculated as a function
of the Lagrange multiplier ˛ using the correct maximum diameter d D 78 nm. (d) The forward
scattering I.0/ calculated as a function of the maximum length d of the scatterer using the correct
noise level [278]
328 S. Hansen
The curve showing 2 as a function of ˛ has a very large plateau for decreasing ˛.
The correct value is 2 D 1:0. The curve showing K has a negative slope for all ˛’s,
leaving a range of possible ˛’s to be chosen for the “best” solution.
In Fig. 13.6d is shown the “forward scattering” I.0/ plotted as a function of the
maximum diameter calculated assuming the correct noise level to be known a priori.
This curve is to be used for estimation of the maximum diameter d . Similarly to the
choice of over-all noise level it is not clear which diameter should be chosen.
Using the Bayesian method for selection of ˛ and d the posterior probability
p.D; ˛; d jHi / is shown in Fig. 13.7 displays a clear maximum making it simple to
select the most likely set of hyperparameters .˛; d /.
Associated with each point on the two dimensional surface p.˛; d / in Fig. 13.7
is the distribution p which has the largest posterior probability for the given .˛; d /.
From these distributions with their associated probabilities an average value for p
as well as a standard deviation for each pi can be calculated.
In spite of the relatively large variation in diameters and noise levels used for
calculation of the average p, the posterior probabilities ensure that unlikely values
for the hyperparameters are ascribed relatively small weights, making the “total” or
average estimate of the distribution p – or representation of p.r/ – well behaved.
In connection with the determination of the hyperparameters from the posterior
probability distribution is should be noticed that the exact determination of the
maximum diameter of the scatterer may be a rather difficult task. The distance
distribution function p.r/ usually has a smooth transition to zero around the
maximum diameter of the particle. This means the the maximum diameter may
have to be estimated from a relatively small proportion of the total p.r/. E.g. for
a prolate ellipsoid of revolution of semi axis .45; 55/ Å less than 1% of the total
area of p1 is found in the interval Œ90I 110 Å and less than 0.1% in the interval
Œ100I 110 Å. This makes a reliable estimate of d very difficult in these cases as a
truncation of the tail of p.r/ around d will give a better value for the regularization
constraint without significant implications for the Fourier transformation of p.r/
a b
Evidence Evidence
0.06 0.016
0.014
0.05
0.012
0.04 0.010
0.008
0.03
0.006
0.02 0.004
0.002
0.01 105
0 100
105 90 95
0 95 100 1e-06
75 80 85
85 90 1e-05
1e-06 0.0001 65 70 d [nm]
1e-05 75 80 α 0.001 55 60
65 70 d [nm]
α 0.0001
0.001 55
60
Fig. 13.7 (a) Evidence for ˛ and d calculated from the simulated data shown in Fig. 13.6 using
smoothness regularization. (b) Evidence for ˛ and d calculated from the simulated data shown in
Fig. 13.1 using maximum entropy with a spherical prior [278]
13 Bayesian Methods in SAXS and SANS Structure Determination 329
and the corresponding quality of the fit of the data. To improve the estimates in such
cases an additional smoothness constraint on p.r/ could be implemented in the
region around the maximum diameter. Alternatively the maximum diameter may
be estimated from the chord length distribution for the scatterer, which may also
be estimated by IFT [279]. The chord length distribution does not suffer from a
large contribution of inner point distances in the scatterer which makes it easier to
estimate the maximum dimension.
Experimental SANS data from measurements on casein micelles and the estimated
distance distribution function are shown in Fig. 13.8a. The posterior distribution for
˛ and d in Fig. 13.8b was calculated using a smoothness constraint and has a clear
and well defined maximum. The average distance distribution in Fig. 13.8a appears
to be free from artifacts although the data are fitted quite closely. The maximum
dimension calculated from Fig. 13.8b is in good agreement with that calculated
from a model fit. The error bars on p.r/ have been calculated from the posterior
probability distributions as described above.
By comparison of the different methods for selection of hyperparameters it
becomes evident that the Bayesian method provides clearer and correct answers.
E.g. using Glatter’s original ad hoc method for the determination of the noise level
and the maximum diameter of the scatterer will leave the user with the subjective
choice within a large interval of values.
Furthermore it is not necessary to restrict the analysis by using only the
most likely distance distance distribution, but all solutions may be used with
the associated probabilities. The means that deduced parameters of interest like
a 250
10000
b
Evidence
I(q) [rel. units]
100 0.08
150
10
0.06
0.01 0.02 0.05 0.1 0.2
100 q [nm-1]
0.04
0.02 340
50 0 320
10 300
100 280 d[nm]
0 α 260
0 50 100 150 200 250 300 350 1000
r [nm]
Fig. 13.8 Bayesian estimation of hyperparameters for experimental data from SANS on casein.
(a) Estimated distance distribution function. Insert shows data and fit to data. (b) Evidence for
˛ and d . [278]
330 S. Hansen
the maximum diameter or the Guinier radius (radius of gyration) can be found by
integration over all distributions weighted by their probabilities.
For testing the sensitivity of the Bayesian estimation of d scattering data from
mixtures of two different scatterers were simulated.
For comparison a simulation of scattering data from a monodisperse sample
of spheres of diameter 20 nm is shown in Fig. 13.9 with the result of a Bayesian
estimation of the distance distribution function and associated hyperparameters.
In Fig. 13.10a is shown the simulated scattering data from various mixtures of
two spheres. The diameters of the spheres were 18 and 23 nm respectively and
the ratio I2 .0/=I1 .0/ – index 2 referring to the larger sphere – was varied in the
interval Œ0:01I 0:20. For I2 .0/=I1 .0/ D 0:20 the estimated distance distribution
function p.r/ is shown in Fig. 13.10b. For a lower fraction I2 .0/=I1 .0/ D 0:04 the
evidence p.˛; d / is shown in Fig. 13.10c. Finally in Fig. 13.10d the evidence for the
maximum diameter d is shown for all six different ratios of I2 .0/=I1 .0/ as indicated
in the figure.
a b 0.09
1
0.08
0.1 0.07
0.06
Intensity [a.u.]
p(r) [a.u.]
0.05
0.01
0.04
0.03
0.001
0.02
0.01
0.0001
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 5 10 15 20 25
q [nm-1] r [nm]
c d 0.8
0.7
Evidence
0.6
1
0.5
Evidence
0.8
0.6 0.4
0.4 0.3
0.2 24
23 0.2
0 2122
20
104 19 0.1
105 18 d
α 106
107 1617 0
16 17 18 19 20 21 22 23 24
d [nm]
Fig. 13.9 Estimation of p.r/ from simulated scattering data from a sphere of diameter 20 nm.
(a) Error bars: Simulated scattering intensity. Full line: Fit. (b) Estimated distance distribution
function with error bars. (c) Evidence p.˛; d / scaled to 1. (d) p.d / corresponding to (c)[740]
332 S. Hansen
a b
1 0.1
0.08
0.1
0.06
p(r) [a.u.]
I(q) [a.u.]
0.01
0.04
0.001
0.02
0.0001
0
0 0.5 1 1.5 2 0 5 10 15 20 25
q [nm-1] r [nm]
c d
1.2
Evidence
1%
1 1
0.8
0.8 3%
0.6
Evidence
0.4 0.6 20 %
8%
0.2 24 4%
0 22 0.4
20
105 18 d
106 16
α 107
0.2 5%
0
16 17 18 19 20 21 22 23 24 25
d [nm]
Fig. 13.10 Estimation of p.r/ from simulated scattering data from spheres of diameter 18 and
23 nm in different ratios as described in the text. (a) Error bars: Simulated scattering intensity.
Full line: Fit. (b) Estimated distance distribution function with error bars from data having 20%
large spheres. (c) Evidence p.˛; d / (scaled to 1) from data having 4% large spheres. (d) p.d /
from scattering data of spheres in different ratios (percentage of larger sphere indicated near the
respective peaks) [740]
a b 0.12
1
0.1
0.1
0.08
p(r) [a.u.]
I(q) [a.u.]
0.01 0.06
0.04
0.001
0.02
0.0001
0
0 0.5 1 1.5 2 2.5 3 0 5 10 15 20
q [nm-1] r [nm]
c d
3.5 3.5
3
3
2.5 2.5
1.9:1 3.0:1
Evidence
Evidence
2 2
1 1
1.5:1 2.0:1
0.5 0.5
0 0
16 17 18 19 20 21 22 23 24 16 17 18 19 20 21 22 23 24
d [nm] d [nm]
Fig. 13.11 Estimation of p.r/ from simulated scattering data from ellipsoids of diameters
(10,10,18) and (13,13,23) nm in different ratios as described in the text. (a) Error bars: simulated
scattering intensity from mixture 2:1. Full line: Fit. (b) Estimated distance distribution function
from data shown in (a). (c) Evidence p.d / from data having qmin D 0:10 nm1 . Full line: Ratio
1.5:1 Dashed line: Ratio 1.7:1. Dotted line: Ratio 1.9:1. (d) Evidence p.d / from data having
qmin D 0:15 nm1 . Full line: Ratio 2:1. Dashed line: Ratio 2.5:1. Dotted line: Ratio 3:1 [740]
Figure 13.12 shows experimental SAXS data [740] from protein concentrations
between 2 and 8 mg/ml. One protein sample consists of molecules with an expected
maximum diameter d of 18 nm, the other consists of a mixture of molecules with a
d of 18 and 23 nm respectively.
The influence of the measured q-range is similar to the case for the simulated
data. For the experimental data the two species present in the solution have different
overall shapes, thus representing a more complex mixture than two different bodies
of similar geometry but with varying radii (Fig. 13.12c). Again it is apparent from
Fig. 13.12d that the relative heights of the two main peaks in p.d / is influenced by
the truncation at qmin . Therefore unless the geometry of the scatterers is known well
a priori the presence of two peaks in the evidence p.d / should only be used as an
indication for polydispersity/degradation of the sample.
334 S. Hansen
a b
4500
4000
10000 3500
3000
I(q) [counts]
p(r) [a.u.]
2500
2000
1000
1500
1000
500
100 0
0 0.5 1 1.5 2 0 5 10 15 20 25 30
q [nm-1] r [nm]
c d
0.45 0.6
0.4
0.5
0.35
0.3 0.4
Evidence
Evidence
0.25
0.3
0.2
0.15 0.2
0.1
0.1
0.05
0 0
14 16 18 20 22 24 26 28 30 14 16 18 20 22 24 26 28 30
d [nm] d [nm]
Fig. 13.12 Estimation of p.r/ for experimental scattering data. (a) Points: Merged scattering
data resulting from measurements at 2, 4 and 6 mg/ml. Upper curve: Mixture of large and small
scatterer respectively. Lower curve: Sample of small scatterer. Full lines: Corresponding fits. (b)
Estimated distance distribution functions from data shown in (a). Upper and lower curve as in (a).
(c) Evidence p.d / from data having qmin D 0:15 nm1 . Full line: Corresponding to lower curve in
(a). Dotted line: Corresponding to upper curve in (a). (d) Evidence p.d / from mixture 1:1 of data
sets shown in (a). Full line: Estimate using q-interval [0.15:2] nm1 Dotted line: Estimate using
q-interval [0.2:0.3] nm1 [740]
From the simulated and experimental examples given here it is evident that
the probability distribution for the maximum diameter d of the scatterer which is
provided by the Bayesian approach enables additional information to be extracted
from the scattering data compared to the conventional ad hoc procedures for
selecting d .
0
-2
deduced from the corresponding ratio for homogeneous spheres (where the excluded
volume is 23 D 8 times that of the sphere).
a b
1 0.02
1
0.8 0.015
0.6
0.1 0.01
0.4
0.2 0.005
p(r) [a.u.]
I(a) [a.u.]
0
0.01 0 0.02 0.04 0.06 0.08 0.1 0
-0.005
0.001 -0.01
-0.015
0.0001 -0.02
0 0.05 0.1 0.15 0.2 0.25 0 50 100 150 200 250 300
q [Å-1] r [Å]
Fig. 13.14 (a) Error bars: Simulated data points from spheres of radius 50 Å, volume fraction
0.2 and noise 1%. Full line: Fit of data. Dotted line: Estimated P .q/. Insert shows the intensity
on a linear scale for q < 0:1 Å1 . (b) Error bars: p1 .r/. Original p1 .r/ not discernible from the
estimate. Dotted line: pstruct .r/. Dashed-dotted line: pstruct .r/ [280]
a b c
0.8 0.25 0.04
0.7 0.035
0.2
0.6 0.03
0.5 0.15 0.025
P(α)
P(η)
P(d)
0.4 0.02
0.3 0.1 0.015
0.2 0.01
0.05
0.1 0.005
0 0 0
8 10 12 14 16 18 85 90 95 100 105 110 115 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
ln(α) d [Å] η
Fig. 13.15 Posterior probabilities for hyperparameters from Fig. 13.14. (a) Probability for the
Lagrange multiplier ˛. (b) Probability for the maximum diameter d . (c) Probability for the volume
fraction [280]
13 Bayesian Methods in SAXS and SANS Structure Determination 337
The results of SANS experiments using SDS at three different ionic strengths are
shown in Fig. 13.16 [14]. The corresponding estimates of p1 ; pexcl ; pstruct using the
resolution function for the specific experimental setup [570] are shown in Fig. 13.16.
a b
1.2 0.05
1 0.04
0.8 0.03
p(r) [a.u.]
I(a) [a.u.]
0.6 0.02
0.4 0.01
0.2 0
0 -0.01
0 0.05 0.1 0.15 0.2 0 20 40 60 80 100 120 140
q [Å-1] r [Å]
c d
1.2 0.05
1 0.04
0.8 0.03
p(r) [a.u.]
I(a) [a.u.]
0.6 0.02
0.4 0.01
0.2 0
0 -0.01
0 0.05 0.1 0.15 0.2 0 20 40 60 80 100 120 140
q [Å-1] r [Å]
e f
1.2 0.05
1 0.04
0.8 0.03
p(r) [a.u.]
I(a) [a.u.]
0.6 0.02
0.4 0.01
0.2 0
0 -0.01
0 0.05 0.1 0.15 0.2 0 20 40 60 80 100 120 140
q [Å-1] r [Å]
Fig. 13.16 SDS in 20 mM NaCl. (a) Error bars: Experimental data. Full line: Fit of data. Dotted
line: FT[p1 .r/]. (b) Full line with error bars: p1 .r/. Dotted line with error bars: pexcl .r/. Dashed-
dotted line with error bars: pstruct .r/. SDS in 50 mM NaCl. (c) Error bars: Experimental data.
Full line: Fit of data. Dotted line: FT[p1 .r/]. (d) Full line with error bars: p1 .r/. Dotted line with
error bars: pexcl .r/. Dashed-dotted line with error bars: pstruct .r/. SDS in 250 mM NaCl. (e) Error
bars: Experimental data. Full line: Fit of data. Dotted line: FT[p1 .r/]. (f) Full line with error bars:
p1 .r/. Dotted line with error bars: pexcl .r/. Dashed-dotted line with error bars: pstruct .r/ [280]
338 S. Hansen
The structure factors calculated from the estimates of p1 ; pexcl ; pstruct are shown in
Fig. 13.17 for all three examples.
The distance distribution functions p1 in Fig. 13.16 all show spherical structures
with a diameter about 5 nm. This is to be expected as the experiments were all done
well above the critical micelle concentration for SDS. The small tails around the
maximum diameter may indicate a small amount of polydispersity in the solutions
which is also expected. The estimated volume fractions are consistent with the initial
concentration of SDS and the presence of water molecules in the micelles (more
water molecules are expected to be associated with SDS at low ionic strengths).
Due to the low volume fractions the corresponding error estimates become relatively
large. At decreasing ionic strength a negative region in pstruct increases, which
indicates the reduced density of micelles at this distance. The reduced density is
caused by the repulsion between the charged head groups of the SDS-molecules at
the surface of the micelles. The structure factors calculated from the estimates of
p1 ; pexcl and pstruct are shown in Fig. 13.17.
Figure 13.16 indicates an additional advantage of the free form estimation.
Interpretation of data in reciprocal space is usually more difficult than the corre-
sponding representation in real space which is one of the reasons that p.r/ is usually
preferred to I.q/. In the approach suggested here S.q/ is represented by the real
space distributions pexcl .r/ and pstruct .r/ which may allow interactions effects to be
interpreted directly from the shape of pstruct .r/.
The analysis of the simulated and the experimental data shows that by applying
Bayesian methods it is possible to obtain simultaneous estimates of p1 , pexcl
and pstruct , as well as estimates for the posterior probability distributions of the
hyperparameters ˛, d and .
1.2
0.8
S(q)
0.6
0.4
Fig. 13.17 Calculated
structure factors S.q/ for
SDS experiments shown in 0.2
Fig. 13.16. Full line: 20 mM
NaCl. Dashed-dotted line: 0
0 0.05 0.1 0.15 0.2
50 mM NaCl. Dotted line:
250 mM NaCl [280] q [Å-1]
13 Bayesian Methods in SAXS and SANS Structure Determination 339
For many purposes it may be useful to be able to quantify the information content
of a given set of experimental data.
The sampling theorem [646] states that a continuous scattering curve I.q/ from an
object of maximum diameter d is fully represented by its values in a set of points
(Shannon channels) at qn D n=d , where n D .1; : : : ; 1/
The number of Shannon channels Ns necessary to represent the intensity I.q/ in
the interval Œqmin I qmax is given by
As a more realistic measure of the information content in the data the “number
of good parameters” Ng has been suggested using regularization by the maximum
entropy method [261, 525] :
X
N
j
Ng D (13.38)
j D1
˛ C j
340 S. Hansen
Here j are the eigenvalues of B and ˛ is the Lagrange multiplier of Eq. 13.12. By
this equation Ng “counts” the number of eigenvalues which are large compared to
the Lagrange multiplier ˛, balancing the information in the data (eigenvalues for B)
against the weight of the regularizing functional or prior (eigenvalues for ˛A). For
entropy regularization A D I, where I is the unity matrix. Hence Eq. 13.38 gives the
number of directions in parameter space, which are determined well for the given
noise level. Expressing the information content of the experimental data through
Eq. 13.38 removes ambiguities due to the choice of qmax as very noisy data do not
contribute to Ng .
Overfitting the data will reduce the Lagrange multiplier ˛ and reduce the
eigenvalues of ˛A C B towards those of B, increasing Ng above the number of
Shannon channels Ns .
Underfitting the data to a higher chi-square will increase the Lagrange multi-
plier ˛. This leads to a lower value for Ng calculated from Eq. 13.38 (consistent
with a stronger correlation of the Shannon channels and reduction of information in
the experimental data).
For the general case the denominator of Eq. 13.38 has to be replaced by the
eigenvalues of the matrix ˛A C B (see e.g. [456]).
Using the eigenvalues of B for experimental design and analysis of reflectivity
data has been suggested by Sivia and Webster [667].
13.5.3 Estimation of Ng
Deducing Ng good parameters from the data corresponds to reducing the number of
degrees of freedom for the 2 by Ng , which is similar to the conventional reduction
of the number of degrees of freedom for the 2 by the number of fitting parameters.
Fitting the “true” information in the data invariably leads to fitting of some of the
noise as well [456]. Writing the reduced chi-square 2r for M data points leads to
M
2r D (13.39)
M C Ng
In addition to fitting some of the noise the estimate of p.r/ may extend beyond
the true maximum dimension d for the scatterer (providing information about a
larger region of direct space) which may also make Ng exceed Ns . For the simulated
examples below these effects are seen for the lowest noise levels.
For comparing Ns and Ng calculated by Eqs. 13.37 and 13.38 respectively,
simulated scattering data from a sphere of diameter 20 nm was used (Fig. 13.9).
The data was simulated in the q-interval [0;1.5] nm1 using M D 100 points. An
absolute noise of
i D aI.0/ C bI.qi / was added, using various values for a and b
(absolute and relative noise respectively). Furthermore the data was truncated at
various values for qmin and qmax .
13 Bayesian Methods in SAXS and SANS Structure Determination 341
13.6 Conclusion
It has been demonstrated that a Bayesian approach to IFT in SAS offers several
advantages compared to the conventional methods:
1. It is possible to estimate the hyperparameters (e.g. ˛; d; ) relevant for IFT from
the basic axioms of probability theory instead of using ad hoc criteria.
2. Error estimates are provided for the hyperparameters.
3. The posterior probability distribution for d may indicate two different sizes when
this is relevant.
4. It is possible to separate intra particle and inter particle effects for non dilute
solution scattering experiments.
5. The information content of the experimental data may be quantified.
For future applications Bayesian methods may be able to improve upon the
recently developed methods for three dimensional structures estimation from SAS
data (as mentioned in the introduction) by offering a transparent and consistent way
of handling the many new constraints as well as their interrelations.
References
1. Abseher, R., Horstink, L., Hilbers, C.W., Nilges, M.: Essential spaces defined by NMR
structure ensembles and molecular dynamics simulation show significant overlap. Proteins
31, 370–382 (1998)
2. Akaike, H.: New look at statistical-model identification. IEEE Trans. Automat. Contr. AC19,
716–723 (1974)
3. Allen, B.D., Mayo, S.L.: Dramatic performance enhancements for the FASTER optimization
algorithm. J. Comp. Chem. 27, 1071–1075 (2006)
4. Allen, B.D., Mayo, S.L.: An efficient algorithm for multistate protein design based on
FASTER. J. Comput. Chem. 31(5), 904–916 (2009)
5. Allen, M., Tildesley, D.: Computer Simulation of Liquids. Clarendon Press, New York (1999)
6. Amatori, A., Tiana, G., Sutto, L., Ferkinghoff-Borg, J., Trovato, A., Broglia, R.A.: Design of
amino acid sequences to fold into C˛ -model proteins. J. Chem. Phys. 123, 054904 (2005)
7. Amatori, A., Ferkinghoff-Borg, J., Tiana, G., Broglia, R.A.: Thermodynamic features charac-
terizing good and bad folding sequences obtained using a simplified off-lattice protein model.
Phys. Rev. E 73, 061905 (2006)
8. Amatori, A., Tiana, G., Ferkinghoff-Borg, J., Broglia, R.A.: Denatured state is critical
in determining the properties of model proteins designed on different folds. Proteins 70,
1047–1055 (2007)
9. Ambroggio, X.I., Kuhlman, B.: Computational design of a single amino acid sequence that
can switch between two distinct protein folds. J. Am. Chem. Soc. 128, 1154–1161 (2006)
10. Ambroggio, X.I., Kuhlman, B.: Design of protein conformational switches. Curr. Opin.
Struct. Biol. 16, 525–530 (2006)
11. Anfinsen, C.: Principles that govern the folding of protein chains. Science 181, 223–230
(1973)
12. Anfinsen, C.B., Haber, E., Sela, M., White, F.H.: The kinetics of formation of native
ribonuclease during oxidation of the reduced polypeptide chain. Proc. Natl. Acad. Sci. U.S.A.
47, 1309–1314 (1961)
13. Appignanesi, G.A., Fernández, A.: Cooperativity along kinetic pathways in RNA folding.
J. Phys. A 29, 6265–6280 (1996)
14. Arleth, L.: Unpublished results. (2004)
15. Artymiuk, P., Poirrette, A., Grindley, H., Rice, D., Willett, P.: A graph-theoretic approach to
the identification of three-dimensional patterns of amino acid side-chains in protein structures.
J. Mol. Biol. 243, 327–344 (1994)
16. Ashworth, J., Havranek, J.J., Duarte, C.M., Sussman, D., Monnat, R.J., Stoddard, B.L.,
Baker, D.: Computational redesign of endonuclease DNA binding and cleavage specificity.
Nature 441, 656–659 (2006)
17. Bahar, I., Jernigan, R.L.: Coordination geometry of nonbonded residues in globular proteins.
Fold. Des. 1, 357–370 (1996)
18. Baker, E.N., Hubbard, R.E.: Hydrogen bonding in globular proteins. Prog. Biophys. Mol.
Biol. 44, 97–179 (1984)
19. Baldwin, R.L.: In search of the energetic role of peptide hydrogen bonds. J. Biol. Chem. 278,
17581–17588 (2003)
20. Baldwin, R.L.: Energetics of protein folding. J. Mol. Biol. 371, 283–301 (2007)
21. Baldwin, R.L., Rose, G.D.: Is protein folding hierarchic? I. Local structure and peptide
folding. Trends Biochem. Sci. 24, 26–33 (1999)
22. Baldwin, R.L., Rose, G.D.: Is protein folding hierarchic? II. Folding intermediates and
transition states. Trends Biochem. Sci. 24, 77–83 (1999)
23. Bartels, C., Karplus, M.: Multidimensional adaptive umbrella sampling: applications to main
chain and side chain peptide conformations. J. Comp. Chem. 18, 1450–1462 (1997)
24. Bastolla, U., Porto, M., Roman, H., Vendruscolo, M.: Principal eigenvector of contact
matrices and hydrophobicity profiles in proteins. Proteins 58, 22–30 (2005)
25. Bastolla, U., Porto, M., Roman, H., Vendruscolo, M.: Structure, stability and evolution of
proteins: principal eigenvectors of contact matrices and hydrophobicity profiles. Gene 347,
219–230 (2005)
26. Bax, A., Kontaxis, G., Tjandra, N.: Dipolar couplings in macromolecular structure determi-
nation. Methods Enzymol. 339, 127–174 (2001)
27. Bayes, T.: An essay towards solving a problem in the doctrine of chances. Phil. Trans. R. Soc.
53, 370–418 (1763)
28. Bayrhuber, M., Meins, T., Habeck, M., Becker, S., Giller, K., Villinger, S., Vonrhein,
C., Griesinger, C., Zweckstetter, M., Zeth, K.: Structure of the human voltage-dependent
anion channel. Proc. Natl. Acad. Sci. U.S.A. 105, 15370–15375 (2008)
29. Beaumont, M.A., Zhang, W., Balding, D.J.: Approximate Bayesian computation in population
genetics. Genetics 162, 2025–2035 (2002)
30. Beitz, E.: TeXshade: shading and labeling of multiple sequence alignments using LaTeX2e.
Bioinformatics 16, 135–139 (2000)
31. Bellhouse, D.: The Reverend Thomas Bayes, FRS: a biography to celebrate the tercentenary
of his birth. Statist. Sci. 19, 3–43 (2004)
32. Ben-Naim, A.: Statistical potentials extracted from protein structures: are these meaningful
potentials? J. Chem. Phys. 107, 3698–3706 (1997)
33. Ben-Naim, A.: A Farewell to Entropy. World Scientific Publishing Company, Hackensack
(2008)
34. Bennett, C.: Efficient estimation of free energy differences from Monte Carlo data. J. Comp.
Phys. 22, 245–268 (1976)
35. Benros, C., de Brevern, A., Etchebest, C., Hazout, S.: Assessing a novel approach for
predicting local 3D protein structures from sequence. Proteins 62, 865–880 (2006)
36. Berg, B.: Locating global minima in optimization problems by a random-cost approach.
Nature 361, 708–710 (1993)
37. Berg, B.A.: Multicanonical recursions. J. Stat. Phys. 82, 323–342 (1996)
38. Berg, B.A.: Algorithmic aspects of multicanonical simulations. Nuclear Phys. B 63, 982–984
(1998)
39. Berg, B.: Workshop on Monte Carlo methods, fields institute communications. In: Introduc-
tion to Multicanonical Monte Carlo Simulations, vol. 26, pp. 1–23. American Mathematical
Society, Providence (2000)
40. Berg, B., Celik, T.: New approach to spin-glass simulations. Phys. Rev. Lett. 69, 2292–2295
(1992)
41. Berg, B., Neuhaus, T.: Multicanonical algorithms for first order phase transitions. Phys. Lett.
B 267, 249–253 (1991)
42. Berg, B., Neuhaus, T.: Multicanonical ensemble: a new approach to simulate first-order phase
transitions. Phys. Rev. Lett. 68, 9 (1992)
References 345
43. Berger, J., Bernardo, J., Sun, D.: The formal definition of reference priors. Ann. Stat. 37,
905–938 (2009)
44. Berman, H.M.: The protein data bank: a historical perspective. Acta Crystallogr. A 64, 88–95
(2008)
45. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov,
I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Res. 28, 235–42 (2000)
46. Berman, H., Henrick, K., Nakamura, H.: Announcing the worldwide protein data bank. Nat.
Struct. Biol. 10, 980 (2003)
47. Bernard, B., Samudrala, R.: A generalized knowledge-based discriminatory function for
biomolecular interactions. Proteins 76, 115–128 (2009)
48. Bernado, J., Smith, A.: Bayesian Theory. Wiley, New York (1994)
49. Bernardo, J.: The concept of exchangeability and its applications. Far East J. Math. Sci. 4,
111–122 (1996)
50. Bernardo, J.: Reference analysis. In: Dey, D.K., Rao, C.R. (eds.) Handbook of Statistics,
vol. 25, pp. 17–90. Elsevier, Burlington (2005)
51. Bernardo, J.: A Bayesian mathematical statistics primer. In: Proceedings of the Seventh
International Conference on Teaching Statistics. International Association for Statistical
Education, Salvador (Bahia), Brazil (2006)
52. Bernardo, J.: Modern Bayesian inference: foundations and objective methods. In: Philosophy
of Statistics. Elsevier, Amsterdam (2009)
53. Berndt, K., Güntert, P., Orbons, L., Wütrich, K.: Determination of a high-quality nuclear mag-
netic resonance solution structure of the bovine pancreatic trypsin inhibitor and comparison
with thee crystal structures. J. Mol. Biol. 227, 757–775 (1993)
54. Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. B
36, 192–236 (1974)
55. Besag, J.: Statistical analysis of non-lattice data. Statistician 24, 179–195 (1975)
56. Best, D.J., Fisher, N.I.: Efficient simulation method of the von Mises distribution. Appl.
Statist. 28, 152–157 (1979)
57. Betancourt, M.: Efficient Monte Carlo trial moves for polypeptide simulations. J. Chem. Phys.
123, 174905 (2005)
58. Betancourt, M.R.: Another look at the conditions for the extraction of protein knowledge-
based potentials. Proteins 76, 72–85 (2009)
59. Betancourt, M.R., Thirumalai, D.: Pair potentials for protein folding: choice of reference
states and sensitivity of predicted native states to variations in the interaction schemes. Protein
Sci. 8, 361–369 (1999)
60. Bethe, H.: Statistical theory of superlattices. Proc. R. Soc. London A 150, 552–575 (1935)
61. Binder, K.: Monte Carlo and Molecular Dynamics Simulations in Polymer Science. Oxford
University Press, New York (1995)
62. Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)
63. Bjornstad, J.F.: On the generalization of the likelihood function and likelihood principle.
J. Am. Stat. Assoc. 91, 791–806 (1996)
64. Blundell, T.L., Sibanda, B.L., Sternberg, M.J., Thornton, J.M.: Knowledge-based prediction
of protein structures and the design of novel molecules. Nature 326, 347–52 (1987)
65. Boas, F.E., Harbury, P.B.: Potential energy functions for protein design. Curr. Opin. Struct.
Biol. 17, 199–204 (2007)
66. Bolon, D.N., Mayo, S.L.: Enzyme-like proteins by computational design. Proc. Natl. Acad.
Sci. U.S.A. 98, 14274–14279 (2001)
67. Bookstein, F.: Size and shape spaces for landmark data in two dimensions. Statist. Sci. 1,
181–222 (1986)
68. Boomsma, W., Mardia, K., Taylor, C., Ferkinghoff-Borg, J., Krogh, A., Hamelryck, T.: A
generative, probabilistic model of local protein structure. Proc. Natl. Acad. Sci. U.S.A. 105,
8932–8937 (2008)
346 References
69. Boomsma, W., Borg, M., Frellsen, J., Harder, T., Stovgaard, K., Ferkinghoff-Borg, J.,
Krogh, A., Mardia, K.V., Hamelryck, T.: PHAISTOS: protein structure prediction using a
probabilistic model of local structure. In: Proceedings of CASP8, pp. 82–83. Cagliari (2009)
70. Borg, M., Mardia, K.: A probabilistic approach to protein structure prediction: PHAISTOS
in CASP9. In: Gusnanto, A., Mardia, K., Fallaize, C. (eds.) LASR2009 – Statistical Tools for
Challenges in Bioinformatics, pp. 65–70. Leeds University Press, Leeds, UK (2009)
71. Bortz, A., Kalos, M., Lebowitz, J.: A new algorithm for Monte Carlo simulation of Ising spin
systems. J. Comp. Phys. 17, 10–18 (1975)
72. Bottaro, S., Boomsma, W., Johansson, K.E., Andreetta, C., Hamelryck, T.,
Ferkinghoff-Borg, J.: Subtle Monte Carlo updates in dense molecular systems. J. Chem.
Theory Comput., Accepted and published online (2012)
73. Bourne, P.E., Weissig, H.: Structural bioinformatics. In: Methods of Biochemical Analysis,
vol. 44. Wiley-Liss, Hoboken (2003)
74. Bowie, J., Eisenberg, D.: An evolutionary approach to folding small alpha-helical proteins
that uses sequence information and an empirical guiding fitness function. Proc. Natl. Acad.
Sci. U.S.A. 91, 4436–4440 (1994)
75. Bowie, J., Luthy, R., Eisenberg, D.: A method to identify protein sequences that fold into a
known three-dimensional structure. Science 253, 164–170 (1991)
76. Bradley, P., Misura, K., Baker, D.: Toward high-resolution de novo structure prediction for
small proteins. Science 309, 1868–1871 (2005)
77. Braémaud, P.: Markov Chains. Springer, New York (1999)
78. Brenner, S.E., Koehl, P., Levitt, M.: The ASTRAL compendium for protein structure and
sequence analysis. Nucleic Acids Res. 28, 254–256 (2000)
79. Brünger, A.T.: The free R value: a novel statistical quantity for assessing the accuracy of
crystal structures. Nature 355, 472–474 (1992)
80. Brünger, A.T.: X-PLOR, Version 3.1: A System for X-ray Crystallography and NMR. Yale
University Press, New Haven (1992)
81. Brünger, A.T., Nilges, M.: Computational challenges for macromolecular structure deter-
mination by X-ray crystallography and solution NMR spectroscopy. Q. Rev. Biophys. 26,
49–125 (1993)
82. Brünger, A.T., Clore, G.M., Gronenborn, A.M., Saffrich, R., Nilges, M.: Assessing the quality
of solution nuclear magnetic resonance structures by complete cross-validation. Science 261,
328–331 (1993)
83. Brünger, A.T., Adams, P.D., Clore, G.M., DeLano, W.L., Gros, P., Grosse-Kunstleve, R.W.,
Jiang, J.S., Kuszewski, J., Nilges, M., Pannu, N.S., Read, R.J., Rice, L.M., Simonson,
T., Warren, G.L.: Crystallography and NMR system (CNS): a new software suite for
macromolecular structure determination. Acta Crystallogr. D 54, 905–921 (1998)
84. Brunner-Popela, J., Glatter, O.: Small-angle scattering of interacting particles. I. Basic
principles of a global evaluation technique. J. Appl. Cryst. 30, 431–442 (1997)
85. Bryant, S.H., Lawrence, C.E.: The frequency of ion-pair substructures in proteins is quantita-
tively related to electrostatic potential: a statistical model for nonbonded interactions. Proteins
9, 108–119 (1991)
86. Bryngelson, J.D., Wolynes, P.G.: Spin glasses and the statistical mechanics of protein folding.
Proc. Natl. Acad. Sci. U.S.A. 84, 7524–7528 (1987)
87. Buchete, N.V., Straub, J.E., Thirumalai, D.: Continuous anisotropic representation of coarse-
grained potentials for proteins by spherical harmonics synthesis. J. Mol. Graph. Model. 22,
441–450 (2004)
88. Buchete, N.V., Straub, J.E., Thirumalai, D.: Development of novel statistical potentials for
protein fold recognition. Curr. Opin. Struct. Biol. 14, 225–32 (2004)
89. Bunagan, M., Yang, X., Saven, J., Gai, F.: Ultrafast folding of a computationally designed
Trp-cage mutant: Trp2 -cage. J. Phys. Chem. B 110, 3759–3763 (2006)
90. Burge, J., Lane, T.: Shrinkage estimator for Bayesian network parameters. Lect. Notes
Comput. Sci. 4701, 67–78, Leeds LS2 9JT (2007)
References 347
91. Burnham, K.P., Anderson, D.R.: Model Selection and Multimodel Inference – A Practical
Information-Theoretic Approach, 2nd edn. Springer, New York (2002)
92. Butterfoss, G.L., Hermans, J.: Boltzmann-type distribution of side-chain conformation in
proteins. Protein Sci. 12, 2719–2731 (2003)
93. Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-
structure motifs. J. Mol. Biol. 281, 565–577 (1998)
94. Bystroff, C., Thorsson, V., Baker, D.: HMMSTR: a hidden Markov model for local sequence-
structure correlations in proteins. J. Mol. Biol. 301, 173–190 (2000)
95. Calhoun, J.R., Kono, H., Lahr, S., Wang, W., DeGrado, W.F., Saven, J.G.: Computational
design and characterization of a monomeric helical dinuclear metalloprotein. J. Mol. Biol.
334, 1101–1115 (2003)
96. Camproux, A., Tuffery, P., Chevrolat, J., Boisvieux, J., Hazout, S.: Hidden Markov model
approach for identifying the modular framework of the protein backbone. Protein Eng. 12,
1063–1073 (1999)
97. Camproux, A., Gautier, R., Tufféry, P.: A hidden Markov model derived structural alphabet
for proteins. J. Mol. Biol. 339, 591–605 (2004)
98. Canutescu, A.A., Shelenkov, A.A., Dunbrack R. L. Jr.: A graph-theory algorithm for rapid
protein side-chain prediction. Protein Sci. 12, 2001–2014 (2003)
99. Carraghan, R., Pardalos, P.: Exact algorithm for the maximum clique problem. Oper. Res.
Lett. 9, 375–382 (1990)
100. Carreira-Perpinan, M., Hinton, G.: On Contrastive divergence learning. In Proceedings of
the 10th International Workshop on Artificial Intelligence and Statistics, Jan 6-8 2005, the
Savannah Hotel, Barbados p. 217
101. Caruana, R.: Multitask learning. Mach. Learn. 28, 41–75 (1997)
102. Cawley, S., Pachter, L.: HMM sampling and applications to gene finding and alternative
splicing. Bioinformatics 19(Suppl 2), II36–II41 (2003)
103. Celeux, G., Diebolt, J.: The SEM algorithm: a probabilistic teacher algorithm derived from
the EM algorithm for the mixture problem. Comp. Stat. Quart. 2, 73–92 (1985)
104. Cerf, N., Martin, O.: Finite population-size effects in projection Monte Carlo methods. Phys.
Rev. E 51, 3679 (1995)
105. Chalaoux, F.R., O’Donoghue, S.I., Nilges, M.: Molecular dynamics and accuracy of NMR
structures: effects of error bounds and data removal. Proteins 34, 453–463 (1999)
106. Chan, H.S., Dill, K.A.: Solvation: how to obtain microscopic energies from partitioning and
solvation experiments. Annu. Rev. Biophys. Biomol. Struct. 26, 425–459 (1997)
107. Chandler, D.: Introduction to Modern Statistical Mechanics. Oxford University Press, Oxford
(1987)
108. Chandrasekaran, R., Ramachandran, G.: Studies on the conformation of amino acids. XI.
Analysis of the observed side group conformation in proteins. Int. J. Protein Res. 2, 223–233
(1970)
109. Cheeseman, P.: Oral presentation at MaxEnt 2004, Munich (2004)
110. Chen, W.W., Shakhnovich, E.I.: Lessons from the design of a novel atomic potential for
protein folding. Protein Sci. 14, 1741–1752 (2005)
111. Chen, J., Wonpil, I., Brooks, C.L.: Application of torsion angle molecular dynamics for
efficient sampling of protein conformations. J. Comp. Chem. 26, 1565–1578 (2005)
112. Chikenji, G., Fujitsuka, Y., Takada, S.: A reversible fragment assembly method for de novo
protein structure prediction. J. Chem. Phys. 119, 6895–6903 (2003)
113. Chu, W., Ghahramani, Z., Podtelezhnikov, A., Wild, D.L.: Bayesian segmental models
with multiple sequence alignment profiles for protein secondary structure and contact map
prediction. IEEE Trans. Comput. Biol. Bioinform. 3, 98–113 (2006)
114. Cieplak, M., Holter, N.S., Maritan, A., Banavar, J.R.: Amino acid classes and the protein
folding problem. J. Chem. Phys. 114, 1420–1423 (2001)
115. Clark, L.A., van Vlijmen, H.W.: A knowledge-based forcefield for protein-protein interface
design. Proteins 70, 1540–1550 (2008)
348 References
116. Clore, G.M., Schwieters, C.D.: Concordance of residual dipolar couplings, backbone order
parameters and crystallographic B-factors for a small alpha/beta protein: a unified picture of
high probability, fast atomic motions in proteins. J. Mol. Biol. 355, 879–886 (2006)
117. Cochran, F., Wu, S., Wang, W., Nanda, V., Saven, J., Therien, M., DeGrado, W.: Computa-
tional de novo design and characterization of a four-helix bundle protein that selectively binds
a nonbiological cofactor. J. Am. Chem. Soc. 127, 1346–1347 (2005)
118. Cooper, N., Eckhardt, R., Shera, N.: From Cardinals to Chaos: Reflections on the Life and
Legacy of Stanislaw Ulam. Cambridge University Press, New York (1989)
119. Cootes, T., Taylor, C., Cooper, D., Graham, J.: Training models of shape from sets of
examples. In: Hogg, D., Boyle, R. (eds.) British Machine Vision Conference, pp. 9–18.
Springer, Berlin (1992)
120. Cootes, T., Taylor, C., Cooper, D., Graham, J.: Image search using flexible shape models
generated from sets of examples. In: Mardia, K. (ed.) Statistics and Images, vol. 2. Carfax,
Oxford (1994)
121. Corey, R., Pauling, L.: Fundamental dimensions of polypeptide chains. Proc. R. Soc. London
B 141, 10–20 (1953)
122. Cossio, P., Marinelli, F., Laio, A., Pietrucci, F.: Optimizing the performance of bias-exchange
metadynamics: folding a 48-residue LysM domain using a coarse-grained model. J. Phys.
Chem. B 114, 3259–3265 (2010)
123. Cossio, P., Trovato, A., Pietrucci, F., Seno, F., Maritan, A., Laio, A.: Exploring the universe
of protein structures beyond the protein data bank. PLoS Comput. Biol. 6, e1000957 (2010)
124. Costa, M., Lopes, J., Santos, J.: Analytical study of tunneling times in flat histogram Monte
Carlo. Europhys. Lett. 72, 802 (2005)
125. Cowles, M., Carlin, B.: Markov chain Monte Carlo convergence diagnostics: a comparative
review. J. Am. Stat. Assoc. 91, 883–904 (1996)
126. Cowell, R., Dawid, A., Lauritzen, S., Spiegelhalter, D.: Probabilistic Networks and Expert
Systems. Springer, New York (1999)
127. Cox, R.T.: The Algebra of Probable Inference. John Hopkins University Press, Baltimore
(1961)
128. Cox, D., Isham, V.: Point Processes. Chapman and Hall, New York (1980)
129. Crespo, Y., Laio, A., Santoro, G., Tosatti, E.: Calculating thermodynamics properties of
quantum systems by a non-Markovian Monte Carlo procedure. Phys. Rev. E 80, 015702
(2009)
130. Czogiel, I., Dryden, I., Brignell, C.: Bayesian alignment of continuous molecular shapes using
random fields. In: Barber, S., Baxter, P., Gusnanto, A., Mardia, K. (eds.) The Art and Science
of Statistical Bioinformatics. Leeds University Press, Leeds (2008)
131. Dagpunar, J.: Principles of Random Variate Generation. Clarendon Press, Oxford (1988)
132. Dahiyat, B.I., Mayo, S.L.: De novo protein design: fully automated sequence selection.
Science 278, 82–87 (1997)
133. Das, R., Baker, D.: Automated de novo prediction of native-like RNA tertiary structures. Proc.
Natl. Acad. Sci. U.S.A. 104, 14664–14669 (2007)
134. Das, R., Baker, D.: Macromolecular modeling with ROSETTA. Annu. Rev. Biochem. 77,
363–382 (2008)
135. Davies, J., Jackson, R., Mardia, K., Taylor, C.: The Poisson index: a new probabilistic model
for protein-ligand binding site similarity. Bioinformatics 23, 3001–3008 (2007)
136. Dawid, A.P.: Some matrix-variate distribution theory: notational considerations and a
Bayesian application. Biometrika 68, 265–274 (1981)
137. Day, R., Paschek, D., Garcia, A.E.: Microsecond simulations of the folding/unfolding
thermodynamics of the Trp-cage miniprotein. Proteins 78, 1889–1899 (2010)
138. Dayal, P., Trebst, S., Wessel, S., Wuertz, D., Troyer, M., Sabhapandit, S., Coppersmith, S.:
Performance limitations of flat-histogram methods. Phys. Rev. Lett. 92, 97201 (2004)
139. de Brevern, A., Etchebest, C., Hazout, S.: Bayesian probabilistic approach for predicting
backbone structures in terms of protein blocks. Proteins 41, 271–287 (2000)
References 349
140. de Oliveira, P., Penna, T., Herrmann, H.: Broad histogram method. Brazil. J. Phys. 26, 677
(1996)
141. Decanniere, K., Transue, T.R., Desmyter, A., Maes, D., Muyldermans, S., Wyns, L.:
Degenerate interfaces in antigen-antibody complexes. J. Mol. Biol. 313, 473–478 (2001)
142. Dehouck, Y., Gilis, D., Rooman, M.: A new generation of statistical potentials for proteins.
Biophys. J. 90, 4010–4017 (2006)
143. DeLano, W.: The PyMOL molecular graphics system. http://www.pymol.org (2002)
144. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the
EM algorithm. J. R. Statist. Soc. B. 39, 1–38 (1977)
145. Derrida, B.: Random-energy model: an exactly solvable model of disordered systems. Phys.
Rev. B 24, 2613–2626 (1981)
146. Desjarlais, J.R., Handel, T.M.: De novo design of the hydrophobic cores of proteins. Protein
Sci. 4, 2006–2018 (1995)
147. Desmet, J., De Maeyer, M., Hazes, B., Lasters, I.: The dead-end elimination theorem and its
use in protein side-chain positioning. Nature 356, 539–542 (1992)
148. Desmet, J., Spriet, J., Lasters, I.: Fast and accurate side-chain topology and energy refinement
(FASTER) as a new method for protein structure optimization. Proteins 48, 31–43 (2002)
149. Diamond, R.: On the multiple simultaneous superposition of molecular-structures by rigid
body transformations. Protein Sci. 1, 1279–1287 (1992)
150. Dill, K., Bromberg, S.: Molecular Driving Forces. Garland Science, New York (2002)
151. Ding, F., Dokholyan, N.V.: Emergence of protein fold families through rational design. PLoS
Comput. Biol. 2, e85 (2006)
152. Djuranovic, S., Hartmann, M.D., Habeck, M., Ursinus, A., Zwickl, P., Martin, J., Lupas,
A.N., Zeth, K.: Structure and activity of the N-terminal substrate recognition domains in
proteasomal ATPases. Mol. Cell 34, 1–11 (2009)
153. Dodd, L.R., Boone, T.D., Theodorou, D.N.: A concerted rotation algorithm for atomistic
Monte Carlo simulation of polymer melts and glasses. Mol. Phys. 78, 961–996 (1993)
154. Dose, V.: Bayesian inference in physics: case studies. Rep. Prog. Phys. 66, 1421–1461 (2003)
155. Doucet, A., Freitas, N., Gordon, N. (eds.): Sequential Monte Carlo Methods in Practice.
Springer, New York (2001)
156. Dowe, D., Allison, L., Dix, T., Hunter, L., Wallace, C., Edgoose, T.: Circular clustering
of protein dihedral angles by minimum message length. Pac. Symp. Biocomput. Edited by
Lawrence Hunter and Teri Klein, World Scientific Publishing Co, Singapore, 242–255 (1996)
157. Downs, T.: Orientation statistics. Biometrika 59, 665–676 (1972)
158. Dryden, I.: Discussion to schmidler. In: Bernardo, J., Bayarri, M., Berger, J., Dawid, A.,
Heckerman, D. Smith, A., West, M. (eds.) Bayesian Statistics 8, pp. 471–490. Oxford
University Press, Oxford (2007)
159. Dryden, I.L., Mardia, K.V.: Statistical Shape Analysis. Wiley Series in Probability and
Statistics, Probability and Statistics. Wiley, Chichester (1998)
160. Dryden, I., Hirst, J., Melville, J.: Statistical analysis of unlabeled point sets: comparing
molecules in chemoinformatics. Biometrics 63, 237–251 (2007)
161. Duane, S., Kennedy, A.D., Pendleton, B., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B
195, 216–222 (1987)
162. Dunbrack, R.L.: Rotamer libraries in the 21st century. Curr. Opin. Struct. Biol. 12, 431–440
(2002)
163. Dunbrack, R.L., Karplus, M.: Backbone-dependent rotamer library for proteins application to
side-chain prediction. J. Mol. Biol. 230, 543–574 (1993)
164. Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge
University Press, New York (1998)
165. Dutilleul, P.: The MLE algorithm for the matrix normal distribution. J. Stat. Comput. Sim. 64,
105–123 (1999)
166. Earl, D., Deem, M.: Parallel tempering: theory, applications, and new perspectives. Phys.
Chem. Chem. Phys. 7, 3910–3916 (2005)
350 References
167. Eckart, C.: Some studies concerning rotating axes and polyatomic molecules. Phys. Rev. 47,
552–558 (1935)
168. Edgoose, T., Allison, L., Dowe, D.: An MML classification of protein structure that knows
about angles and sequence. Pac. Symp. Biocomput. 3, 585–596 (1998)
169. Efron, B., Morris, C.: Limiting the risk of Bayes and empirical Bayes estimators–part II: the
empirical Bayes case. J. Am. Stat. Assoc. 67, 130–139 (1972)
170. Efron, B., Morris, C.: Stein’s estimation rule and its competitors–an empirical Bayes
approach. J. Am. Stat. Soc. 68, 117–130 (1973)
171. Elidan, G., McGraw, I., Koller, D.: Residual belief propagation: informed scheduling
for asynchronous message passing. In: Proceedings of the Twenty-second Conference on
Uncertainty in AI (UAI), Boston (2006)
172. Elofsson, A., Le Grand, S.M., Eisenberg, D.: Local moves: an efficient algorithm for
simulation of protein folding. Proteins 23, 73–82 (1995)
173. Engh, R.A., Huber, R.: Accurate bond and angle parameters for X-ray structure refinement.
Acta Crystallogr. A A47, 392–400 (1991)
174. Engh, R.A., Huber, R.: Structure quality and target parameters. In: Rossman, M.G., Arnold, E.
(eds.) International Tables for Crystallography, Vol. F: Crystallography of Biological Macro-
molecules, 1st edn., pp. 382–392. Kluwer Academic Publishers for the International Union
of Crystallography, Dordrecht/Boston/London (2001)
175. Etchebest, C., Benros, C., Hazout, S., de Brevern, A.: A structural alphabet for local protein
structures: improved prediction methods. Proteins 59, 810–827 (2005)
176. Favrin, G., Irbäck, A., Sjunnesson, F.: Monte Carlo update for chain molecules: biased
Gaussian steps in torsional space. J. Chem. Phys. 114, 8154 (2001)
177. Feigin, L., Svergun, D., Taylor, G.: Structure Analysis by Small-Angle X-ray and Neutron
Scattering. Plenum Press, New York (1987)
178. Feng, Y., Kloczkowski, A., Jernigan, R.L.: Four-body contact potentials derived from two
protein datasets to discriminate native structures from decoys. Proteins 68, 57–66 (2007)
179. Ferkinghoff-Borg, J.: Monte Carlo methods in complex systems. Ph.D. thesis, Graduate
School of Biophysics, Niels Bohr Institute (2002)
180. Ferkinghoff-Borg, J.: Optimized Monte Carlo analysis for generalized ensembles. Eur. Phys.
J. B 29, 481–484 (2002)
181. Feroz, F., Hobson, M.: Multimodal nested sampling: an efficient and robust alternative to
Markov chain Monte Carlo methods for astronomical data analyses. Mon. Not. R. Astron.
Soc. 384, 449–463 (2008)
182. Fersht, A.: Nucleation mechanisms in protein folding. Curr. Opin. Struct. Biol. 7, 3–9 (1997)
183. Fienberg, S.: When did Bayesian inference become Bayesian. Bayesian Anal. 1, 1–40 (2006)
184. Finkelstein, A.V., Badretdinov, A.Y., Gutin, A.M.: Why do protein architectures have
Boltzmann-like statistics? Proteins 23, 142–150 (1995)
185. Fisher, N.I., Lewis, T., Embleton, B.J.J.: Statistical Analysis of Spherical Data. Cambridge
University Press, New York (1987)
186. Fitzgerald, J.E., Jha, A.K., Colubri, A., Sosnick, T.R., Freed, K.F.: Reduced Cˇ statisti-
cal potentials can outperform all-atom potentials in decoy identification. Protein Sci. 16,
2123–2139 (2007)
187. Fixman, M.: Classical statistical mechanics of constraints: a theorem and application to
polymers. Proc. Natl. Acad. Sci. U.S.A. 71, 3050–3053 (1974)
188. Fixman, M.: Simulation of polymer dynamics. I. general theory. J. Chem. Phys. 69,
1527–1537 (1978)
189. Fleming, P.J., Rose, G.D.: Do all backbone polar groups in proteins form hydrogen bonds?
Protein Sci. 14, 1911–1917 (2005)
190. Fleury, P., Chen, H., Murray, A.F.: On-chip contrastive divergence learning in analogue VLSI.
IEEE Int. Conf. Neur. Netw. 3, 1723–1728 (2004)
191. Flower, D.R.: Rotational superposition: a review of methods. J. Mol. Graph. Model. 17, 238–
244 (1999)
References 351
192. Fogolari, F., Esposito, G., Viglino, P., Cattarinussi, S.: Modeling of polypeptide chains as C˛
chains, C˛ chains with Cˇ, and C˛ chains with ellipsoidal lateral chains. Biophys. J. 70,
1183–1197 (1996)
193. Fraenkel, A.: Protein folding, spin glass and computational complexity. In: DNA Based
Computers III: DIMACS Workshop, June 23–25, 1997, p. 101. American Mathematical
Society, Providence (1999)
194. Frauenkron, H., Bastolla, U., Gerstner, E., Grassberger, P., Nadler, W.: New Monte Carlo
algorithm for protein folding. Phys. Rev. Lett. 80, 3149–3152 (1998)
195. Frellsen, J.: Probabilistic methods in macromolecular structure prediction. Ph.D. thesis,
Bioinformatics Centre, University of Copenhagen (2011)
196. Frellsen, J., Ferkinghoff-Borg, J.: Muninn – an automated method for Monte Carlo simula-
tions in generalized ensembles. In preparation
197. Frellsen, J., Moltke, I., Thiim, M., Mardia, K.V., Ferkinghoff-Borg, J., Hamelryck, T.: A
probabilistic model of RNA conformational space. PLoS Comput Biol 5, e1000406 (2009)
198. Frenkel, D., Smit, B.: Understanding Molecular Simulation: From Algorithms to Applica-
tions, vol. 1. Academic, San Diego (2002)
199. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315,
972–976 (2007)
200. Frickenhaus, S., Kannan, S., Zacharias, M.: Efficient evaluation of sampling quality of
molecular dynamics simulations by clustering of dihedral torsion angles and Sammon
mapping. J. Comp. Chem. 30, 479–492 (2009)
201. Friedland, G.D., Linares, A.J., Smith, C.A., Kortemme, T.: A simple model of backbone
flexibility improves modeling of side-chain conformational variability. J. Mol. Biol. 380,
757–774 (2008)
202. Fritz, G., Glatter, O.: Structure and interaction in dense colloidal systems: evaluation of
scattering data by the generalized indirect Fourier transformation method. J. Phys. Condens.
Matter 18, S2403–S2419 (2006)
203. Fritz, G., Bergmann, A., Glatter, O.: Evaluation of small-angle scattering data of charged
particles using the generalized indirect Fourier transformation technique. J. Chem. Phys. 113,
9733–9740 (2000)
204. Fromer, M., Globerson, A.: An LP view of the M-best MAP problem. In: Bengio, Y.,
Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural
Information Processing Systems (NIPS) MIT press, Cambridge, MA 22, pp. 567–575 (2009)
205. Fromer, M., Shifman, J.M.: Tradeoff between stability and multispecificity in the design of
promiscuous proteins. PLoS Comput. Biol. 5, e1000627 (2009)
206. Fromer, M., Yanover, C.: A computational framework to empower probabilistic protein
design. Bioinformatics 24, i214–i222 (2008)
207. Fromer, M., Yanover, C.: Accurate prediction for atomic-level protein design and its
application in diversifying the near-optimal sequence space. Proteins 75, 682–705 (2009)
208. Fromer, M., Yanover, C., Harel, A., Shachar, O., Weiss, Y., Linial, M.: SPRINT: side-chain
prediction inference toolbox for multistate protein design. Bioinformatics 26, 2466–2467
(2010)
209. Fromer, M., Yanover, C., Linial, M.: Design of multispecific protein sequences using
probabilistic graphical modeling. Proteins 78, 530–547 (2010)
210. Garel, T., Orland, H.: Guided replication of random chain: a new Monte Carlo method.
J. Phys. A 23, L621 (1990)
211. Gelman, A., Carlin, J., Stern, H., Rubin, D.: Bayesian Data Analysis. Chapman & Hall/CRC
press, Boca Raton (2004)
212. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restora-
tion of images. IEEE Trans. PAMI 6, 721–741 (1984)
213. Georgiev, I., Donald, B.R.: Dead-end elimination with backbone flexibility. Bioinformatics
23, i185–i194 (2007)
214. Georgiev, I., Keedy, D., Richardson, J.S., Richardson, D.C., Donald, B.R.: Algorithm for
backrub motions in protein design. Bioinformatics 24, i196–i204 (2008)
352 References
215. Georgiev, I., Lilien, R.H., Donald, B.R.: The minimized dead-end elimination criterion and
its application to protein redesign in a hybrid scoring and search algorithm for computing
partition functions over molecular ensembles. J. Comp. Chem. 29, 1527–1542 (2008)
216. Gerber, P.R., Müller, K.: Superimposing several sets of atomic coordinates. Acta Crystallogr.
A 43, 426–428 (1987)
217. Gervais, C., Wüst, T., Landau, D., Xu, Y.: Application of the Wang–Landau algorithm to the
dimerization of glycophorin A. J. Chem. Phys. 130, 215106 (2009)
218. Geyer, C., Thompson, E.: Annealing Markov chain Monte Carlo with applications to ancestral
inference. J. Am. Stat. Assoc. 90, 909–920 (1995)
219. Ghahramani, Z.: Learning dynamic Bayesian networks. Lect. Notes Comput. Sci. 1387,
168–197 (1997)
220. Ghahramani, Z., Jordan, M.I.: Factorial hidden Markov models. Mach. Learn. 29, 245–273
(1997)
221. Ghose, A., Crippen, G.: Geometrically feasible binding modes of a flexible ligand molecule
at the receptor site. J. Comp. Chem. 6, 350–359 (1985)
222. Gibrat, J., Madej, T., Bryant, S.: Surprising similarities in structure comparison. Curr. Opin.
Struct. Biol. 6, 377–385 (1996)
223. Gilks, W.R., Richardson, S., Spiegelhalter, D.J.: Markov Chain Monte Carlo in Practice.
Chapman and Hall/CRC, Boca Raton (1998)
224. Glasbey, C., Horgan, G., Gibson, G., Hitchcock, D.: Fish shape analysis using landmarks.
Biometrical J. 37, 481–495 (1995)
225. Glatter, O.: A new method for the evaluation of small-angle scattering data. J. Appl. Cryst.
10, 415–421 (1977)
226. Glatter, O.: Determination of particle-size distribution functions from small-angle scattering
data by means of the indirect transformation method. J. Appl. Cryst. 13, 7–11 (1980)
227. Glatter, O.: Evaluation of small-angle scattering data from lamellar and cylindrical particles
by the indirect transformation method. J. Appl. Cryst. 13, 577–584 (1980)
228. Glatter, O., Kratky, O.: Small Angle X-ray Scattering. Academic, London (1982)
229. Globerson, A., Jaakkola, T.: Fixing max-product: convergent message passing algorithms for
MAP LP-relaxations. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural
Information Processing Systems 21. MIT, Cambridge (2007)
230. Gō, N.: Theoretical studies of protein folding. Annu. Rev. Biophys. Bioeng. 12, 183–210
(1983)
231. Gō, N., Scheraga, H.: Ring closure and local conformational deformations of chain molecules.
Macromolecules 3, 178–187 (1970)
232. Godzik, A.: Knowledge-based potentials for protein folding: what can we learn from known
protein structures? Structure 4, 363–366 (1996)
233. Godzik, A., Kolinski, A., Skolnick, J.: Are proteins ideal mixtures of amino acids? Analysis
of energy parameter sets. Protein Sci. 4, 2107–2117 (1995)
234. Gohlke, H., Klebe, G.: Statistical potentials and scoring functions applied to protein-ligand
binding. Curr. Opin. Struct. Biol. 11, 231–235 (2001)
235. Gold, N.: Computational approaches to similarity searching in a functional site database
for protein function prediction. Ph.D thesis, Leeds University, School of Biochemistry and
Microbiology (2003)
236. Gold, N., Jackson, R.: Fold independent structural comparisons of protein-ligand binding sites
for exploring functional relationships. J. Mol. Biol. 355, 1112–1124 (2006)
237. Goldstein, R.F.: Efficient rotamer elimination applied to protein side-chains and related spin
glasses. Biophys. J. 66, 1335–1340 (1994)
238. Goldstein, R.A., Luthey-Schulten, Z.A., Wolynes, P.G.: Optimal protein-folding codes from
spin-glass theory. Proc. Natl. Acad. Sci. U.S.A. 89, 4918–4922 (1992)
239. Golender, V., Rozenblit, A.: Logical and Combinatorial Algorithms for Drug Design.
Research Studies Press, Letchworth (1983)
240. Gonzalez, J., Low, Y., Guestrin, C.: Residual splash for optimally parallelizing belief
propagation. In: Artificial Intelligence and Statistics (AISTATS). Clearwater Beach (2009)
References 353
241. Goodall, C.R.: Procrustes methods in the statistical analysis of shape. J. R. Statist. Soc. B. 53,
285–321 (1991)
242. Goodall, C.R.: Procrustes methods in the statistical analysis of shape: rejoinder to discussion.
J. R. Statist. Soc. B. 53, 334–339 (1991)
243. Goodall, C.R.: Procrustes methods in the statistical analysis of shape revisited. In: Mardia,
K.V., Gill, C.A. (eds.) Proceedings in Current Issues in Statistical Shape Analysis, pp. 18–33.
Leeds University Press, Leeds (1995)
244. Goodall, C.R., Bose, A.: Models and Procrustes methods for the analysis of shape differences.
In: Proceedings of the 19th Symposium on the Interface between Computer Science and
Statistics, Lexington, KY (1987)
245. Goodall, C.R., Mardia, K.V.: Multivariate aspects of shape theory. Ann. Stat. 21, 848–866
(1993)
246. Gordon, D.B., Marshall, S.A., Mayo, S.L.: Energy functions for protein design. Curr. Opin.
Struct. Biol. 9, 509–513 (1999)
247. Gordon, D.B., Hom, G.K., Mayo, S.L., Pierce, N.A.: Exact rotamer optimization for protein
design. J. Comp. Chem. 24, 232–243 (2003)
248. Gower, J.C.: Generalized Procrustes analysis. Psychometrika 40, 33–51 (1975)
249. Grantcharova, V., Riddle, D., Santiago, J., Baker, D.: Important role of hydrogen bonds in
the structurally polarized transition state for folding of the src SH3 domain. Nat. Struct. Mol.
Biol. 5, 714–720 (1998)
250. Grassberger, P.: Pruned-enriched Rosenbluth method: simulations of polymers of chain
length up to 1 000 000. Phys. Rev. E 56, 3682 (1997)
251. Grassberger, P., Fraunenkron, H., Nadler, W.: Monte Carlo Approach to Biopolymers and
Protein Folding, p. 301. World Scientific, Singapore (1998)
252. Green, B.: The orthogonal approximation of an oblique structure in factor analysis. Psy-
chometrika 17, 429–440 (1952)
253. Green, P.: Reversible jump Markov chain Monte Carlo computation and bayesian model
determination. Biometrika 82, 711 (1995)
254. Green, P., Mardia, K.: Bayesian alignment using hierarchical models, with applications in
protein bioinformatics. Biometrika 93, 235–254 (2006)
255. Green, P.J., Silverman, B.W.: Nonparametric Regression and Generalized Linear Models:
A Roughness Penalty Approach, vol. 58, 1st edn. Chapman and Hall, London (1994)
256. Green, P., Mardia, K., Nyirongo, V., Ruffieux, Y.: Bayesian modeling for matching and
alignment of biomolecules. In: O’Hagan, A., West, M. (eds.) The Oxford Handbook of
Applied Bayesian Analysis, pp. 27–50. Oxford University Press, Oxford (2010)
257. Grigoryan, G., Ochoa, A., Keating, A.E.: Computing van der Waals energies in the context of
the rotamer approximation. Proteins 68, 863–878 (2007)
258. Grigoryan, G., Reinke, A.W., Keating, A.E.: Design of protein-interaction specificity gives
selective bZIP-binding peptides. Nature 458, 859–864 (2009)
259. Gu, J., Li, H., Jiang, H., Wang, X.: A simple C˛ -SC potential with higher accuracy for protein
fold recognition. Biochem. Biophys. Res. Commun. 379, 610–615 (2009)
260. Guggenheim, E.A.: The statistical mechanics of co-operative assemblies. Proc. R. Soc.
London A 169, 134–148 (1938)
261. Gull, S.: Maximum Entropy and Bayesian Methods, pp. 53–71. Kluwer, Dordrecht (1989)
262. Güntert, P.: Structure calculation of biological macromolecules from NMR data. Q. Rev.
Biophys. 31, 145–237 (1998)
263. Güntert, P., Mumenthaler, C., Wüthrich, K.: Torsion angle dynamics for NMR structure
calculation with the new program DYANA. J. Mol. Biol. 273, 283–298 (1997)
264. Guo, Z., Thirumalai, D.: Kinetics of protein folding: nucleation mechanism, time scales, and
pathways. Biopolymers 36, 83–102 (1995)
265. Guttorp, P., Lockhart, R.A.: Finding the location of a signal: a Bayesian analysis. J. Am. Stat.
Assoc. 83, 322–330 (1988)
266. Habeck, M.: Statistical mechanics analysis of sparse data. J. Struct. Biol. 173, 541–548 (2011)
354 References
267. Habeck, M., Nilges, M., Rieping, W.: Bayesian inference applied to macromolecular structure
determination. Phys. Rev. E 72, 031912 (2005)
268. Habeck, M., Nilges, M., Rieping, W.: Replica-Exchange Monte Carlo scheme for Bayesian
data analysis. Phys. Rev. Lett. 94, 0181051–0181054 (2005)
269. Habeck, M., Rieping, W., Nilges, M.: Bayesian estimation of Karplus parameters and torsion
angles from three-bond scalar coupling constants. J. Magn. Reson. 177, 160–165 (2005)
270. Habeck, M., Rieping, W., Nilges, M.: Weighting of experimental evidence in macromolecular
structure determination. Proc. Natl. Acad. Sci. U.S.A. 103, 1756–1761 (2006)
271. Habeck, M., Nilges, M., Rieping, W.: A unifying probabilistic framework for analyzing
residual dipolar couplings. J. Biomol. NMR 40, 135–144 (2008)
272. Hagan, A.O.: Monte Carlo is fundamentally unsound. Statistician 36, 247–249 (1987)
273. Halperin, I., Ma, B., Wolfson, H., Nussinov, R.: Principles of docking: an overview of search
algorithms and a guide to scoring functions. Proteins 47, 409–443 (2002)
274. Hamelryck, T.: Probabilistic models and machine learning in structural bioinformatics. Stat.
Methods Med. Res. 18, 505–526 (2009)
275. Hamelryck, T., Kent, J., Krogh, A.: Sampling realistic protein conformations using local
structural bias. PLoS Comput. Biol. 2, e131 (2006)
276. Hamelryck, H., Borg, M., Paluszewski, M., Paulsen, J., Frellsen, J., Andreetta, C.,
Boomsma, W., Bottaro, S., Ferkinghoff-Borg, J.: Potentials of mean force for protein structure
prediction vindicated, formalized and generalized. PLoS ONE 5, e13714 (2010)
277. Han, K., Baker, D.: Recurring local sequence motifs in proteins. J. Mol. Biol. 251, 176–187
(1995)
278. Hansen, S.: Bayesian estimation of hyperparameters for indirect Fourier transformation in
small-angle scattering. J. Appl. Cryst. 33, 1415–1421 (2000)
279. Hansen, S.: Estimation of chord length distributions from small-angle scattering using indirect
Fourier transformation. J. Appl. Cryst. 36, 1190–1196 (2003)
280. Hansen, S.: Simultaneous estimation of the form factor and structure factor for globular
particles in small-angle scattering. J. Appl. Cryst. 41, 436–445 (2008)
281. Hansen, S., Müller, J.: Maximum-Entropy and Bayesian Methods, Cambridge, England,
1994, pp. 69–78. Kluwer, Dordrecht (1996)
282. Hansen, S., Pedersen, J.: A comparison of three different methods for analysing small-angle
scattering data. J. Appl. Cryst. 24, 541–548 (1991)
283. Hansmann, U.H.E.: Sampling protein energy landscapes the quest for efficient algorithms.
In: Kolinski, A. (ed.) Multiscale Approaches to Protein Modeling, pp. 209–230. Springer,
New York (2011)
284. Hansmann, U., Okamoto, Y.: Prediction of peptide conformation by multicanonical algo-
rithm: new approach to the multiple-minima problem. J. Comp. Chem. 14, 1333–1338 (1993)
285. Hansmann, U., Okamoto, Y.: Generalized-ensemble Monte Carlo method for systems with
rough energy landscape. Phys. Rev. E 56, 2228 (1997)
286. Hansmann, U., Okamoto, Y.: Erratum:“finite-size scaling of helix–coil transitions in poly-
alanine studied by multicanonical simulations”[J. Chem. Phys. 110, 1267 (1999)]. J. Chem.
Phys. 111, 1339 (1999)
287. Hansmann, U., Okamoto, Y.: Finite-size scaling of helix–coil transitions in poly-alanine
studied by multicanonical simulations. J. Chem. Phys. 110, 1267 (1999)
288. Hansmann, U.H.E., Okamoto, Y.: New Monte Carlo algorithms for protein folding. Curr.
Opin. Struct. Biol. 9, 177–183 (1999)
289. Hao, M.H., Scheraga, H.A.: How optimization of potential functions affects protein folding.
Proc. Natl. Acad. Sci. U.S.A. 93, 4984–4989 (1996)
290. Harder, T., Boomsma, W., Paluszewski, M., Frellsen, J., hansson, J., Hamelryck, T.: Beyond
rotamers: a generative, probabilistic model of side chains in proteins. BMC bioinformatics
11, 306 (2010)
291. Hartmann, C., Schütte, C.: Comment on two distinct notions of free energy. Physica D 228,
59–63 (2007)
References 355
292. Hartmann, C., Antes, I., Lengauer, T.: IRECS: a new algorithm for the selection of
most probable ensembles of side-chain conformations in protein models. Protein Sci. 16,
1294–1307 (2007)
293. Hastings, W.: Monte Carlo sampling methods using Markov chains and their applications.
Biometrika 57, 97 (1970)
294. Hatano, N., Gubernatis, J.: Evidence for the double degeneracy of the ground state in the
three-dimensional˙j spin glass. Phys. Rev. B 66(5), 054437 (2002)
295. Havel, T.F.: The sampling properties of some distance geometry algorithms applied to
unconstrained polypeptide chains: a study of 1830 independently computed conformations.
Biopolymers 29, 1565–1585 (1990)
296. Havel, T.F., Kuntz, I.D., Crippen, G.M.: The theory and practice of distance geometry. Bull.
Math. Biol. 45, 665–720 (1983)
297. Havranek, J.J., Harbury, P.B.: Automated design of specificity in molecular recognition. Nat.
Struct. Mol. Biol. 10, 45–52 (2003)
298. Hayes, R.J., Bentzien, J., Ary, M.L., Hwang, M.Y., Jacinto, J.M., Vielmetter, J., Kundu, A.,
Dahiyat, B.I.: Combining computational and experimental screening for rapid optimization
of protein properties. Proc. Natl. Acad. Sci. U.S.A. 99, 15926–15931 (2002)
299. Hayter, J., Penfold, J.: An analytic structure factor for macro-ion solutions. Mol. Phys. 42,
109–118 (1981)
300. Hazan, T., Shashua, A.: Convergent message-passing algorithms for inference over general
graphs with convex free energies. In: McAllester, D.A., Myllymäki, P. (eds.) Proceedings
of the 24rd Conference on Uncertainty in Artificial Intelligence, pp. 264–273. AUAI Press
(2008)
301. Hendrickson, W.A.: Stereochemically restrained refinement of macromolecular structures.
Methods Enzymol. 115, 252–270 (1985)
302. Hess, B.: Convergence of sampling in protein simulations. Phys. Rev. E 65, 031910 (2002)
303. Hesselbo, B., Stinchcombe, R.: Monte Carlo simulation and global optimization without
parameters. Phys. Rev. Lett. 74, 2151–2155 (1995)
304. Higuchi, T.: Monte Carlo filter using the genetic algorithm operators. J. Stat. Comput. Sim.
59, 1–24 (1997)
305. Hinds, D.A., Levitt, M.: A lattice model for protein structure prediction at low resolution.
Proc. Natl. Acad. Sci. U.S.A. 89, 2536–2540 (1992)
306. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural
Comput. 14, 1771–800 (2002)
307. Hinton, G., Sejnowski, T.: Optimal perceptual inference. In: IEEE Conference Computer
Vision Pattern Recognition, pp. 448–453. Washington, DC (1983)
308. Ho, B.K., Coutsias, E.A., Seok, C., Dill, K.A.: The flexibility in the proline ring couples to
the protein backbone. Protein Sci. 14, 1011–1018 (2005)
309. Hoffmann, D., Knapp, E.W.: Polypeptide folding with off-lattice Monte Carlo dynamics: the
method. Eur. Biophys. J. 24, 387–403 (1996)
310. Hoffmann, D., Knapp, E.W.: Protein dynamics with off-lattice Monte Carlo moves. Phys.
Rev. E 53, 4221–4224 (1996)
311. Holm, L., Sander, C.: Protein structure comparison by alignment of distance matrices. J. Mol.
Biol. 233, 123–138 (1993)
312. Hong, E.J., Lippow, S.M., Tidor, B., Lozano-Perez, T.: Rotamer optimization for protein
design through MAP estimation and problem-size reduction. J. Comp. Chem. 30, 1923–1945
(2009)
313. Hooft, R., van Eijck, B., Kroon, J.: An adaptive umbrella sampling procedure in conforma-
tional analysis using molecular dynamics and its application to glycol. J. Chem. Phys. 97,
6690 (1992)
314. Hopfinger, A.J.: Conformational Properties of Macromolecules. Academic, New York (1973)
315. Horn, B.K.P.: Closed-form solution of absolute orientation using unit quaternions. J. Opt. Soc.
Am. A 4, 629–642 (1987)
356 References
316. Hu, X., Kuhlman, B.: Protein design simulations suggest that side-chain conformational
entropy is not a strong determinant of amino acid environmental preferences. Proteins 62,
739–748 (2006)
317. Hu, C., Li, X., Liang, J.: Developing optimal non-linear scoring function for protein design.
Bioinformatics 20, 3080–3098 (2004)
318. Huang, K.: Statistical Mechanics. Wiley, New York (1987)
319. Huang, S.Y., Zou, X.: An iterative knowledge-based scoring function to predict protein-ligand
interactions: I. Derivation of interaction potentials. J. Comput. Chem. 27, 1866–1875 (2006)
320. Huang, S.Y., Zou, X.: An iterative knowledge-based scoring function to predict protein-ligand
interactions: II. Validation of the scoring function. J. Comput. Chem. 27, 1876–1882 (2006)
321. Huelsenbeck, J.P., Crandall, K.A.: Phylogeny estimation and hypothesis testing using maxi-
mum likelihood. Annu. Rev. Ecol. Systemat. 28, 437–466 (1997)
322. Hukushima, K.: Domain-wall free energy of spin-glass models: numerical method and
boundary conditions. Phys. Rev. E 60, 3606 (1999)
323. Hukushima, K., Nemoto, K.: Exchange Monte Carlo method and application to spin glass
simulations. J. Phys. Soc. Jap. 65, 1604–1608 (1996)
324. Hukushima, K., Takayama, H., Yoshino, H.: Exchange Monte Carlo dynamics in the SK
model. J. Phys. Soc. Jap. 67, 12–15 (1998)
325. Humphris, E.L., Kortemme, T.: Design of multi-specificity in protein interfaces. PLoS
Comput. Biol. 3, e164 (2007)
326. Hunter, L., States, D.J.: Bayesian classification of protein structure. IEEE Expert: Intel. Sys.
App. 7, 67–75 (1992)
327. Hutchinson, E., Thornton, J.: PROMOTIF–A program to identify and analyze structural
motifs in proteins. Protein Sci. 5, 212–220 (1996)
328. Iba, Y.: Extended ensemble Monte Carlo. Int. J. Mod. Phys. C12, 623–656 (2001)
329. Irbäck, A., Potthast, F.: Studies of an off-lattice model for protein folding: Sequence
dependence and improved sampling at finite temperature. J. Chem. Phys. 103, 10298 (1995)
330. Jack, A., Levitt, M.: Refinement of large structures by simultaneous minimization of energy
and R factor. Acta Crystallogr. A 34, 931–935 (1978)
331. Jain, T., de Pablo, J.: A biased Monte Carlo technique for calculation of the density of states
of polymer films. J. Chem. Phys. 116, 7238 (2002)
332. Janke, W., Sugita, Y., Mitsutake, A., Okamoto, Y.: Generalized-ensemble algorithms for
protein folding simulations. In: Rugged Free Energy Landscapes. Lecture Notes in Physics,
vol. 736, pp. 369–407. Springer, Berlin/Heidelberg (2008)
333. Jardine, N.: The observational and theoretical components of homology: a study based on the
morphology of the derma-roofs of rhipidistan fishes. Biol. J. Linn. Soc. 1, 327–361 (1969)
334. Jaynes, E.: How does the brain do plausible reasoning? Technical report; Microwave
Laboratory and Department of Physics, Stanford University, California, USA (1957)
335. Jaynes, E.: Information Theory and Statistical Mechanics. II. Phys. Rev. 108, 171–190 (1957)
336. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. Lett. 106, 620–630
(1957)
337. Jaynes, E.: Prior probabilities. IEEE Trans. Syst. Sci. Cybernetics 4, 227–241 (1968)
338. Jaynes, E.T.: Confidence intervals vs Bayesian intervals (with discussion). In:Harper,
W.L., Hooker, C.A. (eds.) Foundations of Probability Theory, Statistical Inference, and
Statistical Theories of Science, pp. 175–257. D. Reidel, Dordrecht/Holland (1976)
339. Jaynes, E.: Where do we stand on maximum entropy? In: Levine, R.D, Tribus, M. (eds.) The
Maximum Entropy Formalism Conference, pp. 15–118. MIT, Cambridge (1978)
340. Jaynes, E.: In: Rosenkrantz, R. (ed.) Papers on Probability, Statistics and Statistical Physics.
Kluwer, Dordrecht (1983)
341. Jaynes, E.: Prior information and ambiguity in inverse problems. In: SIAM -AMS Proceed-
ings, vol. 14, pp. 151–166. American Mathematical Society (1984)
342. Jaynes, E.: Probability Theory: The Logic of Science. Cambridge University Press,
Cambridge (2003)
References 357
343. Jeffreys, H.: Theory of Probability. Oxford Classics Series. Oxford University Press, Oxford
(1998)
344. Jensen, J.L.W.V.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta
Mathematica 30, 175–193 (1906)
345. Jensen, F.: Introduction to Bayesian Networks. Springer, New York (1996)
346. Jiang, L., Gao, Y., Mao, F., Liu, Z., Lai, L.: Potential of mean force for protein-protein
interaction studies. Proteins 46, 190–196 (2002)
347. Jiang, L., Althoff, E.A., Clemente, F.R., Doyle, L., Rothlisberger, D., Zanghellini, A.,
Gallaher, J.L., Betker, J.L., Tanaka, F., Barbas Carlos F. III, Hilvert, D., Houk, K.N.,
Stoddard, B.L., Baker, D.: De novo computational design of retro-aldol enzymes. Science
319, 1387–1391 (2008)
348. Jin, W., Kambara, O., Sasakawa, H., Tamura, A., Takada, S.: De novo design of foldable
proteins with smooth folding funnel: automated negative design and experimental verification.
Structure 11, 581–590 (2003)
349. Jones, D.T.: De novo protein design using pairwise potentials and a genetic algorithm. Protein
Sci. 3, 567–574 (1994)
350. Jones, D.: Successful ab initio prediction of the tertiary structure of NK-lysin using multiple
sequences and recognized supersecondary structural motifs. Proteins 1, 185–191 (1997)
351. Jones, T.A., Thirup, S.: Using known substructures in protein model building and crystallog-
raphy. EMBO J. 5, 819–822 (1986)
352. Jones, D.T., Taylor, W.R., Thornton, J.M.: A new approach to protein fold recognition. Nature
358, 86–89 (1992)
353. Jordan, M. (ed.): Learning in Graphical Models. MIT, Cambridge (1998)
354. Jorgensen, W.L., Maxwell, D.S., Tirado-Rives, J.: Development and testing of the OPLS
all-atom force field on conformational energetics and properties of organic liquids. J. Am.
Chem. Soc. 118, 11225–11236 (1996)
355. Juraszek, J., Bolhuis, P.G.: Sampling the multiple folding mechanisms of Trp-cage in explicit
solvent. Proc. Natl. Acad. Sci. U.S.A. 103, 15859–15864 (2006)
356. Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta Crystallogr.
A 32, 922–923 (1976)
357. Kabsch, W.: A discussion of the solution for the best rotation to to relate two sets of vectors.
Acta Crystallogr. A A34, 827–828 (1978)
358. Kamisetty, H., Xing, E.P., Langmead, C.J.: Free energy estimates of all-atom protein
structures using generalized belief propagation. In: Speed, T., Huang, H. (eds.) RECOMB,
pp. 366–380. Springer, Heidelberg (2007)
359. Kamisetty, H., Bailey-Kellogg, C., Langmead, C.J.: A graphical model approach for predict-
ing free energies of association for protein-protein interactions under backbone and side-chain
flexibility. Technical Report, School of Computer Science, Carnegie Mellon University,
Pittsburgh, PA (2008)
360. Karlin, S., Zhu, Z.Y.: Characterizations of diverse residue clusters in protein three-
dimensional structures. Proc. Natl. Acad. Sci. U.S.A. 93, 8344–8349 (1996)
361. Karplus, M.: Vicinal proton coupling in nuclear magnetic resonance. J. Am. Chem. Soc. 85,
2870–2871 (1963)
362. Katzgraber, H., Trebst, S., Huse, D., Troyer, M.: Feedback-optimized parallel tempering
Monte Carlo. J. Stat. Mech. 2006, P03018 (2006)
363. Kearsley, S.K.: An algorithm for the simultaneous superposition of a structural series.
J. Comput. Chem. 11, 1187–1192 (1990)
364. Kelley, L.A., MacCallum, R.M., Sternberg, M.J.: Enhanced genome annotation using struc-
tural profiles in the program 3D-PSSM. J. Mol. Biol. 299, 501–522 (2000)
365. Kent, J.T.: The Fisher-Bingham distribution on the sphere. J. R. Stat. Soc. B 44, 71–80 (1982)
366. Kent, J.: The complex Bingham distribution and shape analysis. J. R. Statist. Soc. B 56,
285–299 (1994)
367. Kent, J.T., Hamelryck, T.: Using the Fisher-Bingham distribution in stochastic models for
protein structure. In: Barber, S., Baxter, P.D., Mardia, K.V., Walls, R.E. (eds.) Proceedings
358 References
in Quantitative Biology, Shape Analysis and Wavelets, pp. 57–60. Leeds University Press
(2005)
368. Kent, J., Mardia, K., Taylor, C.: Matching problems for unlabelled configurations. In:
Aykroyd, R., Barber, S., Mardia, K. (eds.) Bioinformatics, Images, and Wavelets, pp. 33–36.
Leeds University Press, Leeds (2004)
369. Kent, J.T., Mardia, K.V., Taylor, C.C.: Modelling strategies for bivariate circular data. In:
Barber, S., Baxter, P.D., Gusnanto, A., Mardia, K.V. (eds.) The Art and Science of Statistical
Bioinformatics, pp. 70–74. Leeds University Press, Leeds (2008)
370. Kent, J., Mardia, K., Taylor, C.: Matching Unlabelled Configurations and Protein Bioinfor-
matics. Research note, University of Leeds, Leeds (2010)
371. Kerler, W., Rehberg, P.: Simulated-tempering procedure for spin-glass simulations. Phys. Rev.
E 50, 4220–4225 (1994)
372. Kikuchi, R.: A theory of cooperative phenomena. Phys. Rev. 81, 988 (1951)
373. Kimura, K., Taki, K.: Time-homogeneous parallel annealing algorithm. In: Vichneetsky, R.,
Miller, J. (eds.) Proceedings of the 13th IMACS World Congress on Computation and Applied
Mathematics, vol. 2, p. 827. Criterion Press, Dublin (1991)
374. Kingsford, C.L., Chazelle, B., Singh, M.: Solving and analyzing side-chain positioning
problems using linear and integer programming. Bioinformatics 21, 1028–1039 (2005)
375. Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. science 220,
671 (1983)
376. Kirkwood, J.G.: Statistical mechanics of fluid mixtures. J. Chem. Phys. 3, 300–313 (1935)
377. Kirkwood, J.G., Maun, E.K., Alder, B.J.: Radial distribution functions and the equation of
state of a fluid composed of rigid spherical molecules. J. Chem. Phys. 18, 1040–1047 (1950)
378. Kleinman, C.L., Rodrigue, N., Bonnard, C., Philippe, H., Lartillot, N.: A maximum likelihood
framework for protein design. BMC Bioinformatics 7, 326 (2006)
379. Kleywegt, G.: Recognition of spatial motifs in protein structures. J. Mol. Biol. 285, 1887–
1897 (1999)
380. Kloppmann, E., Ullmann, G.M., Becker, T.: An extended dead-end elimination algorithm to
determine gap-free lists of low energy states. J. Comp. Chem. 28, 2325–2335 (2007)
381. Koehl, P., Delarue, M.: Application of a self-consistent mean field theory to predict protein
side-chains conformation and estimate their conformational entropy. J. Mol. Biol. 239, 249–
275 (1994)
382. Kojetin, D.J., McLaughlin, P.D., Thompson, R.J., Dubnau, D., Prepiak, P., Rance, M.,
Cavanagh, J.: Structural and motional contributions of the Bacillus subtilis ClpC N-domain
to adaptor protein interactions. J. Mol. Biol. 387, 639–652 (2009)
383. Kolinski, A.: Protein modeling and structure prediction with a reduced representation. Acta
Biochim. Pol. 51, 349–371 (2004)
384. Koliǹski, A., Skolnick, J.: High coordination lattice models of protein structure, dynamics
and thermodynamics. Acta Biochim. Pol. 44, 389–422 (1997)
385. Kolinski, A., Skolnick, J.: Reduced models of proteins and their applications. Polymer 45,
511–524 (2004)
386. Kollman, P.: Free energy calculations: applications to chemical and biochemical phenomena.
Chem. Rev. 93, 2395–2417 (1993)
387. Komodakis, N., Paragios, N.: Beyond loose LP-relaxations: optimizing MRFs by repairing
cycles. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV, pp. 806–820. Springer,
Heidelberg (2008)
388. Konig, G., Boresch, S.: Non-Boltzmann sampling and Bennett’s acceptance ratio method:
how to profit from bending the rules. J. Comput. Chem. 32, 1082–1090 (2011)
389. Kono, H., Saven, J.G.: Statistical theory for protein combinatorial libraries. packing interac-
tions, backbone flexibility, and sequence variability of main-chain structure. J. Mol. Biol. 306,
607–628 (2001)
390. Koppensteiner, W.A., Sippl, M.J.: Knowledge-based potentials–back to the roots. Biochem-
istry Mosc. 63, 247–252 (1998)
References 359
391. Körding, K., Wolpert, D.: Bayesian integration in sensorimotor learning. Nature 427, 244–247
(2004)
392. Kortemme, T., Morozov, A.V., Baker, D.: An orientation-dependent hydrogen bonding
potential improves prediction of specificity and structure for proteins and protein-protein
complexes. J. Mol. Biol. 326, 1239–1259 (2003)
393. Kraemer-Pecore, C.M., Lecomte, J.T.J., Desjarlais, J.R.: A de novo redesign of the WW
domain. Protein Sci. 12, 2194–2205 (2003)
394. Kraulis, P.J.: MOLSCRIPT: A program to produce both detailed and schematic plots of
protein structures. J. Appl. Cryst. 24, 946–950 (1991)
395. Krishna, S.S., Majumdar, I., Grishin, N.V.: Structural classification of zinc fingers: survey and
summary. Nucleic Acids Res. 31, 532–550 (2003)
396. Krissinel, E., Henrick, K.: Secondary-structure matching (ssm), a new tool for fast protein
structure alignment in three dimensions. Acta Crystallogr. D 60, 2256–2268 (2004)
397. Kristof, W., Wingersky, B.: Generalization of the orthogonal Procrustes rotation procedure
for more than two matrices. In: Proceedings of the 79th Annual Convention of the American
Psychological Association, vol. 6, pp. 89–90. American Psychological Association, Washing-
ton, DC (1971)
398. Krivov, G.G., Shapovalov, M.V., Dunbrack, R.L. Jr.: Improved prediction of protein side-
chain conformations with SCWRL4. Proteins 77, 778–795 (2009)
399. Kruglov, T.: Correlation function of the excluded volume. J. Appl. Cryst. 38, 716–720 (2005)
400. Kryshtafovych, A., Venclovas, C., Fidelis, K., Moult, J.: Progress over the first decade of
CASP experiments. Proteins 61, 225–236 (2005)
401. Kryshtafovych, A., Fidelis, K., Moult, J.: Progress from CASP6 to CASP7. Proteins 69, 194–
207 (2007)
402. Kschischang, F., Frey, B., Loeliger, H.: Factor graphs and the sum-product algorithm. IEEE
Trans. Inf. Theory 47, 498–519 (2001)
403. Kuczera, K.: One- and multidimensional conformational free energy simulations. J. Comp.
Chem. 17, 1726–1749 (1996)
404. Kudin, K., Dymarsky, A.: Eckart axis conditions and the minimization of the root-mean-
square deviation: two closely related problems. J. Chem. Phys. 122, 224105 (2005)
405. Kuhlman, B., Baker, D.: Native protein sequences are close to optimal for their structures.
Proc. Natl. Acad. Sci. U.S.A. 97, 10383–10388 (2000)
406. Kuhlman, B., Dantas, G., Ireton, G.C., Varani, G., Stoddard, B.L., Baker, D.: Design of a
novel globular protein fold with atomic-level accuracy. Science 302, 1364–1368 (2003)
407. Kullback, S.: Information Theory and Statistics. Wiley, New York (1959)
408. Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951)
409. Kumar, S., Rosenberg, J., Bouzida, D., Swendsen, R., Kollman, P.: The weighted histogram
analysis method for free-energy calculations on biomolecules. I. the method. J. Comp. Chem.
13, 1011–1021 (1992)
410. Kuszewski, J., Gronenborn, A.M., Clore, G.M.: Improving the quality of NMR and crystal-
lographic protein structures by means of a conformational database potential derived from
structure databases. Protein Sci. 5, 1067–1080 (1996)
411. Laio, A., Parrinello, M.: Escaping free-energy minima. Proc. Natl. Acad. Sci. U.S.A. 99,
12562–12566 (2002)
412. Laio, A., Rodriguez-Fortea, A., Gervasio, F., Ceccarelli, M., Parrinello, M.: Assessing the
accuracy of metadynamics. J. Phys. Chem. B 109, 6714–6721 (2005)
413. Landau, D., Binder, K.: A Guide to Monte Carlo Simulations in Statistical Physics. Cam-
bridge University Press, Cambridge (2009)
414. Landau, L.D., Lifshitz, E.M.: Statistical Physics. Course of Theoretical Physics, vol. 5.
Pergamon Press, Oxford (1980)
415. Laplace, P.: Mémoire sur la probabilité des causes par les événements. Mémoires de
Mathématique et de Physique 6, 621–656 (1774)
416. Laplace, S.: Théorie analytique des probabilités. Gauthier-Villars, Paris (1820)
360 References
417. Larson, S.M., England, J.L., Desjarlais, J.R., Pande, V.S.: Thoroughly sampling sequence
space: large-scale protein design of structural ensembles. Protein Sci. 11, 2804–2813 (2002)
418. Larson, S.M., Garg, A., Desjarlais, J.R., Pande, V.S.: Increased detection of structural
templates using alignments of designed sequences. Proteins 51, 390–396 (2003)
419. Launay, G., Mendez, R., Wodak, S., Simonson, T.: Recognizing protein-protein interfaces
with empirical potentials and reduced amino acid alphabets. BMC Bioinformatics 8, 270
(2007)
420. Lauritzen, S.: Graphical Models. Oxford University Press, Oxford (1996)
421. Law, A.M., Kelton, W.D.: Simulation Modeling and Analysis. McGraw-Hill, New York
(1991)
422. Lazar, G.A., Desjarlais, J.R., Handel, T.M.: De novo design of the hydrophobic core of
ubiquitin. Protein Sci. 6, 1167–1178 (1997)
423. Lazaridis, T., Karplus, M.: Effective energy functions for protein structure prediction. Curr.
Opin. Struct. Biol. 10, 139–145 (2000)
424. Leach, A.R.: Molecular Modelling: Principles and Applications, 2nd edn. Prentice Hall,
Harlow (2001)
425. Leach, A.R., Lemon, A.P.: Exploring the conformational space of protein side chains using
dead-end elimination and the A* algorithm. Proteins 33, 227–239 (1998)
426. Leaver-Fay, A., Tyka, M., Lewis, S., Lange, O., Thompson, J., Jacak, R., Kaufman, K.,
Renfrew, P., Smith, C., Sheffler, W., et al.: ROSETTA3: an object-oriented software suite
for the simulation and design of macromolecules. Methods Enzymol. 487, 545–574 (2011)
427. Lee, J.: New Monte Carlo algorithm: entropic sampling. Phys. Rev. Lett. 71, 211–214 (1993)
428. Lee, P.M.: Bayesian Statistics: An Introduction, 3rd edn. Wiley, New York (2004)
429. Lee, Y., Nelder, J.: Hierarchical generalized linear models. J. R. Stat. Soc. B 58, 619–678
(1996)
430. Lele, S.: Euclidean distance matrix analysis (EDMA) – estimation of mean form and mean
form difference. Math. Geol. 25, 573–602 (1993)
431. Lele, S., Richtsmeier, J.T.: Statistical models in morphometrics: are they realistic? Syst. Zool.
39, 60–69 (1990)
432. Lele, S., Richtsmeier, J.T.: An Invariant Approach to Statistical Analysis of Shapes. Interdis-
ciplinary Statistics. Chapman and Hall/CRC, Boca Raton (2001)
433. Lemm, J.: Bayesian Field Theory. Johns Hopkins University Press, Baltimore (2003)
434. Lennox, K.P.: A Dirichlet process mixture of hidden Markov models for protein structure
prediction. Ann. Appl. Stat. 4, 916–942 (2010)
435. Lennox, K.P., Dahl, D.B., Vannucci, M., Tsai, J.W.: Density estimation for protein confor-
mation angles using a bivariate von Mises distribution and Bayesian nonparametrics. J. Am.
Stat. Assoc. 104, 586–596 (2009)
436. Levitt, M.: Growth of novel protein structural data. Proc. Natl. Acad. Sci. U.S.A. 104, 3183–
3188 (2007)
437. Levitt, M., Warshel, A.: Computer simulation of protein folding. Nature 253, 694–698 (1975)
438. Li, X., Liang, J.: Knowledge-based energy functions for computational studies of proteins. In:
Xu, Y., Xu, D., Liang, J. (eds.) Computational Methods for Protein Structure Prediction and
Modeling. Biological and Medical Physics, Biomedical Engineering, vol. 1, p. 71. Springer,
New York (2007)
439. Li, H., Tang, C., Wingreen, N.S.: Nature of driving force for protein folding: a result from
analyzing the statistical potential. Phys. Rev. Lett. 79, 765–768 (1997)
440. Lifshitz, E.M., Landau, L.D.: Statistical physics. In: Course of Theoretical Physics, vol. 5,
3rd edn. Butterworth-Heinemann, Oxford (1980)
441. Lifson, S., Levitt, M.: On obtaining energy parameters from crystal structure data. Comput.
Chem. 3, 49–50 (1979)
442. Lilien, H., Stevens, W., Anderson, C., Donald, R.: A novel ensemble-based scoring and
search algorithm for protein redesign and its application to modify the substrate specificity of
the gramicidin synthetase a phenylalanine adenylation enzyme. J. Comp. Biol. 12, 740–761
(2005)
References 361
443. Lima, A., de Oliveira, P., Penna, T.: A comparison between broad histogram and multicanon-
ical methods. J. Stat. Phys. 99, 691–705 (2000)
444. Lindley, D.: Bayesian Statistics, A Review. Society for Industrial and applied Mathematics,
Philadelphia (1972)
445. Lindorff-Larsen, K., Best, R.B., Depristo, M.A., Dobson, C.M., Vendruscolo, M.: Simultane-
ous determination of protein structure and dynamics. Nature 433, 128–132 (2005)
446. Linge, J.P., Nilges, M.: Influence of non-bonded parameters on the quality of NMR structures:
a new force-field for NMR structure calculation. J. Biomol. NMR 13, 51–59 (1999)
447. Liu, Z., Macias, M.J., Bottomley, M.J., Stier, G., Linge, J.P., Nilges, M., Bork, P., Sattler, M.:
The three-dimensional structure of the HRDC domain and implications for the Werner and
Bloom syndrome proteins. Fold. Des. 7, 1557–1566 (1999)
448. Looger, L.L., Hellinga, H.W.: Generalized dead-end elimination algorithms make large-scale
protein side-chain structure prediction tractable: implications for protein design and structural
genomics. J. Mol. Biol. 307, 429–445 (2001)
449. Looger, L.L., Dwyer, M.A., Smith, J.J., Hellinga, H.W.: Computational design of receptor
and sensor proteins with novel functions. Nature 423, 185–190 (2003)
450. Lou, F., Clote, P.: Thermodynamics of RNA structures by Wang–Landau sampling. Bioinfor-
matics 26, i278–i286 (2010)
451. Lovell, S.C., Word, J.M., Richardson, J.S., Richardson, D.C.: The penultimate rotamer library.
Proteins 40, 389–408 (2000)
452. Lu, H., Skolnick, J.: A distance-dependent atomic knowledge-based potential for improved
protein structure selection. Proteins 44, 223–232 (2001)
453. Lu, M., Dousis, A.D., Ma, J.: OPUS-PSP: an orientation-dependent statistical all-atom
potential derived from side-chain packing. J. Mol. Biol. 376, 288–301 (2008)
454. Lu, M., Dousis, A.D., Ma, J.: OPUS-Rota: a fast and accurate method for side-chain
modeling. Protein Sci. 17, 1576–1585 (2008)
455. Lyubartsev, A., Martsinovski, A., Shevkunov, S., Vorontsov-Velyaminov, P.: New approach to
Monte Carlo calculation of the free energy: method of expanded ensembles. J. Chem. Phys.
96, 1776 (1992)
456. MacKay, D.: Maximum Entropy and Bayesian Methods: Proceedings of the Eleventh
International Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis,
Seattle, 1991, pp. 39–66. Kluwer, Dordrecht (1992)
457. MacKay, D.: Information Theory, Inference, and Learning Algorithms. Cambridge University
Press, Cambridge (2003)
458. MacKerel, A. Jr., Brooks, C. III, Nilsson, L., Roux, B., Won, Y., Karplus, M.: CHARMM:
the energy function and its parameterization with an overview of the program. In: Schleyer,
P.v.R., et al. (eds.) The Encyclopedia of Computational Chemistry, vol. 1, pp. 271–277. Wiley,
Chichester (1998)
459. Mackerell, A.D.: Empirical force fields for biological macromolecules: overview and issues.
J. Comput. Chem. 25, 1584–1604 (2004)
460. Macura, S., Ernst, R.R.: Elucidation of cross relaxation in liquids by two-dimensional NMR
spectroscopy. Mol. Phys. 41, 95–117 (1980)
461. Maddox, J.: What Remains to be Discovered. Macmillan, London (1998)
462. Madras, N., Sokal, A.: The pivot algorithm: a highly efficient Monte Carlo method for the
self-avoiding walk. J. Stat. Phys. 50, 109–186 (1988)
463. Maiorov, V.N., Crippen, G.M.: Contact potential that recognizes the correct folding of
globular proteins. J. Mol. Biol. 227, 876–888 (1992)
464. Májek, P., Elber, R.: A coarse-grained potential for fold recognition and molecular dynamics
simulations of proteins. Proteins 76, 822–836 (2009)
465. Malioutov, D.M., Johnson, J.K., Willsky, A.S.: Walk-sums and belief propagation in Gaussian
graphical models. J. Mach. Learn. Res. 7, 2031–2064 (2006)
466. Mao, Y., Senic-Matuglia, F., Di Fiore, P.P., Polo, S., Hodsdon, M.E., De Camilli, P.:
Deubiquitinating function of ataxin-3: insights from the solution structure of the Josephin
domain. Proc. Natl. Acad. Sci. U.S.A. 102, 12700–12705 (2005)
362 References
467. Mardia, K.V.: Characterization of directional distributions. In: Patil, G.P., Kotz, S., Ord, J.K.
(eds.) Statistical Distributions in Scientific Work, pp. 365–386. D. Reidel Publishing Com-
pany, Dordrecht/Holland (1975)
468. Mardia, K.V.: Statistics of directional data (with discussion). J. R. Statist. Soc. B. 37, 349–393
(1975)
469. Mardia, K.: Discussion to schmidler. In: Bernardo, J., Bayarri, M., Berger, J., Dawid, A.,
Heckerman, D., Smith, A., West, M. (eds.) Bayesian Statistics 8, pp. 471–490. Oxford
University Press, Oxford (2007)
470. Mardia, K.: On some recent advancements in applied shape analysis and directional statistics.
In: Barber, S., Baxter, P., Mardia, K. (eds.) Systems Biology & Statistical Bioinformatics,
pp. 9–17. Leeds University Press, Leeds (2007)
471. Mardia, K.V.: Statistical complexity in protein bioinformatics. In: Gusnanto, A., Mardia, K.V.,
Fallaize, C.J. (eds.) Statistical Tools for Challenges in Bioinformatics, pp. 9–20. Leeds
University Press, Leeds (2009)
472. Mardia, K.V.: Bayesian analysis for bivariate von Mises distributions. J. Appl. Stat. 37, 515–
528 (2010)
473. Mardia, K.V., El-Atoum, S.A.M.: Bayesian inference for the von Mises-Fisher distribution.
Biometrika 63, 203–206 (1976)
474. Mardia, K.V., Jupp, P.: Directional Statistics, 2nd edn . Wiley, Chichester (2000)
475. Mardia, K.V., Patrangenaru, V.: Directions and projective shapes. Ann. Stat. 33, 1666–1699
(2005)
476. Mardia, K.V., Hughes, G., Taylor, C.C.: Efficiency of the pseudolikelihood for multivariate
normal and von Mises distributions. Technical report, Department of Statistics, University of
Leeds (2007)
477. Mardia, K., Nyirongo, V., Green, P., Gold, N., Westhead, D.: Bayesian refinement of protein
functional site matching. BMC Bioinformatics 8, 257 (2007)
478. Mardia, K.V., Taylor, C.C., Subramaniam, G.K.: Protein bioinformatics and mixtures of
bivariate von Mises distributions for angular data. Biometrics 63, 505–512 (2007)
479. Mardia, K.V., Hughes, G., Taylor, C.C., Singh, H.: A multivariate von Mises distribution with
applications to bioinformatics. Can. J. Stat. 36, 99–109 (2008)
480. Mardia, K.V., Kent, J.T., Hughes, G., Taylor, C.C.: Maximum likelihood estimation using
composite likelihoods for closed exponential families. Biometrika 96, 975–982 (2009)
481. Mardia, K.V., Fallaize, C.J., Barber, S., Jackson, R.M., Theobald, D.L.: Bayesian alignment
of similarity shapes and halfnormal-gamma distributions. Ann. Appl. Stat. Submitted (2011)
482. Marin, J., Nieto, C.: Spatial matching of multiple configurations of points with a bioinformat-
ics application. Commun. Stat. Theor. Meth. 37, 1977–1995 (2008)
483. Marinari, E.: Optimized Monte Carlo methods. In: Kertész, J., Kondor, I. (eds.) Advances in
Computer Simulation. Lecture Notes in Physics, vol. 501, pp. 50–81. Springer, Berlin (1998)
484. Marinari, E., Parisi, G.: Simulated tempering: a new Monte Carlo scheme. Europhys. Lett.
19, 451–458 (1992)
485. Marinari, E., Parisi, G., Ricci-Tersenghi, F., Zuliani, F.: The use of optimized Monte Carlo
methods for studying spin glasses. J. Phys. A 34, 383 (2001)
486. Marjoram, P., Molitor, J., Plagnol, V., Tavaré, S.: Markov chain Monte Carlo without
likelihoods. Proc. Natl. Acad. Sci. U.S.A. 100, 15324–15328 (2003)
487. Markley, J.L., Bax, A., Arata, Y., Hilbers, C.W., Kaptein, R., Sykes, B.D., Wright, P.E.,
Wüthrich, K.: Recommendations for the presentation of NMR structures of proteins and
nucleic acids. J. Mol. Biol. 280, 933–952 (1998)
488. Martinez, J., Pisabarro, M., Serrano, L.: Obligatory steps in protein folding and the confor-
mational diversity of the transition state. Nat. Struct. Mol. Biol. 5, 721–729 (1998)
489. Martoňák, R., Laio, A., Parrinello, M.: Predicting crystal structures: the Parrinello-Rahman
method revisited. Phys. Rev. Lett. 90, 75503 (2003)
490. May, R., Nowotny, V.: Distance information derived from neutron low-Q scattering. J. Appl.
Cryst. 22, 231–237 (1989)
References 363
491. Mayewski, S.: A multibody, whole-residue potential for protein structures, with testing by
Monte Carlo simulated annealing. Proteins 59, 152–169 (2005)
492. McCammon, J.A., Harvey, S.C.: Dynamics of Proteins and Nucleic Acids. Cambridge
University Press, Cambridge (1987)
493. McDonald, I.K., Thornton, J.M.: Satisfying hydrogen bonding potential in proteins. J. Mol.
Biol. 238, 777–793 (1994)
494. McGrayne, S.: The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma
Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of
Controversy. Yale University Press, New Haven (2011)
495. McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley Series in Probabil-
ity and Statistics. Applied Probability and Statistics. Wiley, New York (1997)
496. McQuarrie, D.: Statistical Mechanics. University Science Books, Sausalito (2000)
497. Mechelke, M., Habeck, M.: Robust probabilistic superposition and comparison of protein
structures. BMC Bioinformatics 11, 363 (2010)
498. Melo, F., Feytmans, E.: Novel knowledge-based mean force potential at atomic level. J. Mol.
Biol. 267, 207–222 (1997)
499. Meltzer, T., Globerson, A., Weiss, Y.: Convergent message passing algorithms – a unifying
view. In: Proceedings of the Twenty-fifth Conference on Uncertainty in AI (UAI), Montreal,
Canada (2009)
500. Mendes, J., Baptista, A.M., Carrondo, M.A., Soares, C.M.: Improved modeling of side-chains
in proteins with rotamer-based methods: a flexible rotamer model. Proteins 37, 530–543
(1999)
501. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of
state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 (1953)
502. Metzler, W.J., Hare, D.R., Pardi, A.: Limited sampling of conformational space by the
distance geometry algorithm: implications for structures generated from NMR data. Biochem-
istry 28, 7045–7052 (1989)
503. Mezard, M., Parisi, G., Virasoro, M.: Spin Glasses and Beyond. World Scientific, New York
(1988)
504. Mezei, M.: Adaptive umbrella sampling: self-consistent determination of the non-Boltzmann
bias. J. Comp. Phys. 68, 237–248 (1987)
505. Mezei, M.: Efficient Monte Carlo sampling for long molecular chains using local moves,
tested on a solvated lipid bilayer. J. Chem. Phys. 118, 3874 (2003)
506. Micheletti, C., Seno, F., Banavar, J.R., Maritan, A.: Learning effective amino acid interactions
through iterative stochastic techniques. Proteins 42, 422–431 (2001)
507. Micheletti, C., Laio, A., Parrinello, M.: Reconstructing the density of states by history-
dependent metadynamics. Phys. Rev. Lett. 92, 170601 (2004)
508. Miller, R.: The jackknife-a review. Biometrika 61, 1 (1974)
509. Minka, T.: Pathologies of orthodox statistics. Microsoft research, Technical report,
Cambridge, UK (2001)
510. Minka, T.: Estimating a Dirichlet distribution. Technical report, Microsoft Research,
Cambridge (2003)
511. Mintseris, J., Wiehe, K., Pierce, B., Anderson, R., Chen, R., Janin, J., Weng, Z.: Protein-
protein docking benchmark 2.0: an update. Proteins 60, 214 (2005)
512. Mirny, L.A., Shakhnovich, E.I.: How to derive a protein folding potential? A new approach
to an old problem. J. Mol. Biol. 264, 1164–1179 (1996)
513. Mitsutake, A., Sugita, Y., Okamoto, Y.: Generalized-ensemble algorithms for molecular
simulations of biopolymers. Biopolymers 60, 96–123 (2001)
514. Mitsutake, A., Sugita, Y., Okamoto, Y.: Replica-exchange multicanonical and multicanonical
replica-exchange Monte Carlo simulations of peptides. I. formulation and benchmark test.
J. Chem. Phys. 118, 6664 (2003)
515. Mitsutake, A., Sugita, Y., Okamoto, Y.: Replica-exchange multicanonical and multicanonical
replica-exchange Monte Carlo simulations of peptides. II. application to a more complex
system. J. Chem. Phys. 118, 6676 (2003)
364 References
516. Miyazawa, S., Jernigan, R.: Estimation of effective interresidue contact energies from protein
crystal structures: quasi-chemical approximation. Macromolecules 18, 534–552 (1985)
517. Miyazawa, S., Jernigan, R.L.: Residue-residue potentials with a favorable contact pair term
and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 256,
623–644 (1996)
518. Miyazawa, S., Jernigan, R.L.: How effective for fold recognition is a potential of mean force
that includes relative orientations between contacting residues in proteins? J. Chem. Phys.
122, 024901 (2005)
519. Moont, G., Gabb, H.A., Sternberg, M.J.: Use of pair potentials across protein interfaces in
screening predicted docked complexes. Proteins 35, 364–373 (1999)
520. Moore, P.: Small-angle scattering. Information content and error analysis. J. Appl. Cryst. 13,
168–175 (1980)
521. Moore, G.L., Maranas, C.D.: Identifying residue-residue clashes in protein hybrids by using
a second-order mean-field approach. Proc. Natl. Acad. Sci. U.S.A. 100, 5091–5096 (2003)
522. Mourik, J.v., Clementi, C., Maritan, A., Seno, F., Banavar, J.R.: Determination of interaction
potentials of amino acids from native protein structures: tests on simple lattice models.
J. Chem. Phys. 110, 10123 (1999)
523. Mouritsen, O.G.: Computer Studies of Phase Transitions and Critical Phenomena. Springer,
New York (1984)
524. Müller, K., Glatter, O.: Practical aspects to the use of indirect Fourier transformation methods.
Makromol. Chem. 183, 465–479 (1982)
525. Müller, J., Hansen, S., Pürschel, H.: The use of small-angle scattering and the maximum-
entropy method for shape-model determination from distance-distribution functions. J. Appl.
Cryst. 29, 547–554 (1996)
526. Müller, J., Schmidt, P., Damaschun, G., Walter, G.: Determination of the largest particle
dimension by direct Fourier cosine transformation of experimental small-angle X-ray scat-
tering data. J. Appl. Cryst. 13, 280–283 (1980)
527. Mullinax, J.W., Noid, W.G.: Extended ensemble approach for deriving transferable coarse-
grained potentials. J. Chem. Phys. 131, 104110 (2009)
528. Mullinax, J.W., Noid, W.G.: Recovering physical potentials from a model protein databank.
Proc. Natl. Acad. Sci. U.S.A. 107, 19867–19872 (2010)
529. Muñoz, V., Eaton, W.A.: A simple model for calculating the kinetics of protein folding from
three-dimensional structures. Proc. Natl. Acad. Sci. U.S.A. 96, 11311–11316 (1999)
530. Murray, I.: Advances in Markov chain Monte Carlo methods. Ph.D. thesis, Gatsby Computa-
tional Neuroscience Unit, University College London (2007)
531. Murray, L.J.W., Arendall, W.B., Richardson, D.C., Richardson, J.S.: RNA backbone is
rotameric. Proc. Natl. Acad. Sci. U.S.A. 100, 13904–13909 (2003)
532. Murshudov, G.N., Vagin, A.A., Dodson, E.J.: Refinement of macromolecular structures by
the maximum-likelihood method. Acta Crystallogr. D 53, 240–55 (1997)
533. Myers, J.K., Pace, C.N.: Hydrogen bonding stabilizes globular proteins. Biophys. J. 71, 2033–
2039 (1996)
534. Nadler, W., Hansmann, U.: Generalized ensemble and tempering simulations: a unified view.
Phys. Rev. E 75, 026109 (2007)
535. Neal, R.: Probabilistic inference using Markov chain Monte Carlo methods. Technical report,
Citeseer (1993)
536. Neal, R.M.: Bayesian Learning for Neural Networks. Lecture Notes in Statistics, 1st edn.
Springer, New York (1996)
537. Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and
other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–368. MIT press,
Cambridge (1998)
538. Neapolitan, R.: Learning Bayesian Networks. Prentice Hall, Harlow (2003)
539. Neidigh, J.W., Fesinmeyer, R.M., Andersen, N.H.: Designing a 20-residue protein. Nat.
Struct. Biol. 9, 425–430 (2002)
References 365
540. Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7, 308–313
(1965)
541. Ngan, S.C., Hung, L.H., Liu, T., Samudrala, R.: The metaphysics research lab, Center for the
study of language and information, Stanford university, Stanford, CA Scoring functions for
de novo protein structure prediction revisited. Methods Mol. Biol. 413, 243–281 (2008)
542. Nicastro, G., Menon, R.P., Masino, L., Knowles, P.P., McDonald, N.Q., Pastore, A.: The
solution structure of the Josephin domain of ataxin-3: structural determinants for molecular
recognition. Proc. Natl. Acad. Sci. U.S.A. 102, 10493–10498 (2005)
543. Nicastro, G., Habeck, M., Masino, L., Svergun, D.I., Pastore, A.: Structure validation of the
Josephin domain of ataxin-3: conclusive evidence for an open conformation. J. Biomol. NMR
36, 267–277 (2006)
544. Nielsen, S.: The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6,
457–489 (2000)
545. Nightingale, M., Blöte, H.: Monte Carlo calculation of free energy, critical point, and surface
critical behavior of three-dimensional Heisenberg ferromagnets. Phys. Rev. Lett. 60, 1562–
1565 (1988)
546. Nilges, M., Clore, G.M., Gronenborn, A.M.: Determination of three-dimensional structures
of proteins from interproton distance data by dynamical simulated annealing from a random
array of atoms: avoiding problems associated with folding. FEBS Lett. 239, 129–136 (1988)
547. Nilges, M., Bernard, A., Bardiaux, B., Malliavin, T., Habeck, M., Rieping, W.: Accurate
NMR structures through minimization of an extended hybrid energy. Structure 16, 1305–
1312 (2008)
548. Novotny, M.: A new approach to an old algorithm for the simulation of Ising-like systems.
Comput. Phys. 9, 46–52 (1995)
549. Nyirongo, V.: Statistical approaches to protein matching in bioinformatics. Ph.D thesis, Leeds
University, Department of Statistics (2006)
550. Nymeyer, H., Garcı́a, A., Onuchic, J.: Folding funnels and frustration in off-lattice minimalist
protein landscapes. Proc. Natl. Acad. Sci. U.S.A. 95, 5921 (1998)
551. Oliveira, P.: Broad histogram: tests for a simple and efficient microcanonical simulator. Brazil.
J. Phys. 30, 766–771 (2000)
552. Olsson, S., Boomsma, W., Frellsen, J., Bottaro, S., Harder, T., Ferkinghoff-Borg, J.,
Hamelryck, T.: Generative probabilistic models extend the scope of inferential structure
determination. J. Magn. Reson. 213, 182–186 (2011)
553. Orland, H.: Monte Carlo Approach to Biopolymers and Protein Folding, p. 90. World
Scientific, Singapore (1998)
554. Ornstein, L., Zernike, F.: Integral equation in liquid state theory. In: Proceedings of the Section
of Sciences Koninklijke Nederlandse Akademie Van Wetenschappen, vol. 17, pp. 793–806.
North-Holland, Amsterdam (1914)
555. Ortiz, A.R., Kolinski, A., Skolnick, J.: Nativelike topology assembly of small proteins using
predicted restraints in Monte Carlo folding simulations. Proc. Natl. Acad. Sci. U.S.A. 95,
1020–1025 (1998)
556. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data. Eng. 22, 1345–
1359 (2010)
557. Pande, V., Grosberg, A., Tanaka, T.: Heteropolymer freezing and design: towards physical
models of protein folding. Rev. Mod. Phys. 72, 259 (2000)
558. Pappu, R.V., Srinivasan, R., Rose, G.D.: The Flory isolated-pair hypothesis is not valid for
polypeptide chains: implications for protein folding. Proc. Natl. Acad. Sci. U.S.A. 97, 12565–
12570 (2000)
559. Park, B.H., Levitt, M.: The complexity and accuracy of discrete state models of protein
structure. J. Mol. Biol. 249, 493–507 (1995)
560. Park, S., Kono, H., Wang, W., Boder, E.T., Saven, J.G.: Progress in the development and
application of computational methods for probabilistic protein design. Comput. Chem. Eng.
29, 407–421 (2005)
366 References
561. Parsons, D., Williams, D.: Globule transitions of a single homopolymer: a Wang-Landau
Monte Carlo study. Phys. Rev. E 74, 041804 (2006)
562. Pártay, L., Bartók, A., Csányi, G.: Efficient sampling of atomic configurational spaces.
J. Phys. Chem. B 114, 10502–10512 (2010)
563. Paschek, D., Nymeyer, H., Garcı́a, A.E.: Replica exchange simulation of reversible folding/
unfolding of the Trp-cage miniprotein in explicit solvent: on the structure and possible role of
internal water. J. Struct. Biol. 157, 524–533 (2007)
564. Paschek, D., Hempel, S., Garcı́a, A.E.: Computing the stability diagram of the Trp-cage
miniprotein. Proc. Natl. Acad. Sci. U.S.A. 105, 17754–17759 (2008)
565. Pauling, L., Corey, R.B.: The pleated sheet, a new layer configuration of polypeptide chains.
Proc. Natl. Acad. Sci. U.S.A. 37, 251–256 (1951)
566. Pauling, L., Corey, R.B., Branson, H.R.: The structure of proteins; two hydrogen-bonded
helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. U.S.A. 37, 205–211
(1951)
567. Pawitan, Y.: In All Likelihood: Statistical Modeling and Inference Using Likelihood. Oxford
Science Publications. Clarendon Press, Oxford (2001)
568. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufman Publishers, San Mateo (1988)
569. Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge University Press,
Cambridge (2000)
570. Pedersen, J., Posselt, D., Mortensen, K.: Analytical treatment of the resolution function for
small-angle scattering. J. Appl. Cryst. 23, 321–333 (1990)
571. Peng, J., Hazan, T., McAllester, D., Urtasun, R.: Convex max-product algorithms for
continuous MRFs with applications to protein folding. In: International Conference of
Machine Learning (ICML). Bellevue, Washington (2011)
572. Percus, J., Yevick, G.: Analysis of classical statistical mechanics by means of collective
coordinates. Phys. Rev. 110, 1–13 (1958)
573. Peskun, P.: Optimum Monte-Carlo sampling using Markov chains. Biometrika 60, 607–612
(1973)
574. Peterson, R.W., Dutton, P.L., Wand, A.J.: Improved side-chain prediction accuracy using an
ab initio potential energy function and a very large rotamer library. Protein Sci. 13, 735–751
(2004)
575. Pettersen, E.F., Goddard, T.D., Huang, C.C., Couch, G.S., Greenblatt, D.M., Meng, E.C.,
Ferrin, T.E.: UCSF Chimera – A visualization system for exploratory research and analysis.
J. Comput. Chem. 25, 1605–1612 (2004)
576. Piana, S., Laio, A.: A bias-exchange approach to protein folding. J. Phys. Chem. B 111,
4553–4559 (2007)
577. Pierce, N.A., Winfree, E.: Protein design is NP-hard. Protein Eng. 15, 779–782 (2002)
578. Pierce, N.A., Spriet, J.A., Desmet, J., Mayo, S.L.: Conformational splitting: a more powerful
criterion for dead-end elimination. J. Comp. Chem. 21, 999–1009 (2000)
579. Pietrucci, F., Marinelli, F., Carloni, P., Laio, A.: Substrate binding mechanism of HIV-1
protease from explicit-solvent atomistic simulations. J. Am. Chem. Soc. 131, 11811–11818
(2009)
580. Pitera, J.W., Swope, W.: Understanding folding and design: replica-exchange simulations of
Trp-cage miniproteins. Proc. Natl. Acad. Sci. U.S.A. 100, 7587–7592 (2003)
581. Plaxco, K., Simons, K., Baker, D.: Contact order, transition state placement and the refolding
rates of single domain proteins1. J. Mol. Biol. 277, 985–994 (1998)
582. Podtelezhnikov, A.A., Wild, D.L.: Comment on “Efficient Monte Carlo trial moves for
polypeptide simulations” [J. Chem. Phys. 123, 174905 (2005)]. J. Chem. Phys. 129, 027103
(2008)
583. Podtelezhnikov, A.A., Wild, D.L.: CRANKITE: a fast polypeptide backbone conformation
sampler. Source Code Biol. Med. 3, 12 (2008)
584. Podtelezhnikov, A.A., Wild, D.L.: Reconstruction and stability of secondary structure
elements in the context of protein structure prediction. Biophys. J. 96, 4399–4408 (2009)
References 367
585. Podtelezhnikov, A.A., Wild, D.L.: Exhaustive Metropolis Monte Carlo sampling and analysis
of polyalanine conformations adopted under the influence of hydrogen bonds. Proteins 61,
94–104 (2005)
586. Podtelezhnikov, A.A., Ghahramani, Z., Wild, D.L.: Learning about protein hydrogen bonding
by minimizing contrastive divergence. Proteins 66, 588–599 (2007)
587. Pohl, F.M.: Empirical protein energy maps. Nat. New Biol. 234, 277–279 (1971)
588. Ponder, J.W., Richards, F.M.: Tertiary templates for proteins. use of packing criteria in the
enumeration of allowed sequences for different structural classes. J. Mol. Biol. 193, 775–791
(1987)
589. Poole, A.M., Ranganathan, R.: Knowledge-based potentials in protein design. Curr. Opin.
Struct. Biol. 16, 508–513 (2006)
590. Poulain, P., Calvo, F., Antoine, R., Broyer, M., Dugourd, P.: Performances of Wang-Landau
algorithms for continuous systems. Phys. Rev. E 73, 056704 (2006)
591. Pritchard, J.K., Seielstad, M.T., Perez-Lezaun, A., Feldman, M.W.: Population growth of
human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16,
1791–1798 (1999)
592. Privalov, P.L.: Stability of proteins: small globular proteins. Adv. Protein Chem. 33, 167–241
(1979)
593. Privalov, P.L., Potekhin, S.A.: Scanning microcalorimetry in studying temperature-induced
changes in proteins. Methods Enzymol. 131, 4–51 (1986)
594. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech
recognition. Proc. IEEE 77, 257–286 (1989)
595. Ramachandran, G.N., Ramakrishnan, C., Sasisekharan, V.: Stereochemistry of polypeptide
chain configurations. J. Mol. Biol. 7, 95–99 (1963)
596. Rampf, F., Paul, W., Binder, K.: On the first-order collapse transition of a three-dimensional,
flexible homopolymer chain model. Europhys. Lett. 70, 628 (2005)
597. Rashin, A.A., Iofin, M., Honig, B.: Internal cavities and buried waters in globular proteins.
Biochemistry 25, 3619–3625 (1986)
598. Rasmussen, C., Ghahramani, Z.: Bayesian Monte Carlo. NIPS 15, 505–512 (2003)
599. Rathore, N., Knotts, T. IV, de Pablo, J.: Density of states simulations of proteins. J. Chem.
Phys. 118, 4285 (2003)
600. Reith, D., Pütz, M., Müller-Plathe, F.: Deriving effective mesoscale potentials from atomistic
simulations. J. Comput. Chem. 24, 1624–1636 (2003)
601. Rieping, W., Habeck, M., Nilges, M.: Inferential structure determination. Science 309,
303–306 (2005)
602. Rieping, W., Habeck, M., Nilges, M.: Modeling errors in NOE data with a lognormal
distribution improves the quality of NMR structures. J. Am. Chem. Soc. 27, 16026–16027
(2005)
603. Rieping, W., Nilges, M., Habeck, M.: ISD: a software package for Bayesian NMR structure
calculation. Bioinformatics 24, 1104–1105 (2008)
604. Ritchie, D.: Recent progress and future directions in protein-protein docking. Curr. Protein
Pept. Sci. 9, 1–15 (2008)
605. Rivest, L.: A distribution for dependent unit vectors. Comm. Stat. Theor. Meth. 17, 461–483
(1988)
606. Robert, C.: The Bayesian Choice: From Decision-Theoretic Foundations to Computational
Implementation. Springer, New York (2007)
607. Rodriguez, A., Schmidler, S.: Bayesian protein structure alignment. Submitted (2010)
608. Rohl, C.A., Strauss, C.E., Misura, K.M., Baker, D.: Protein structure prediction using Rosetta.
Methods Enzymol. 383, 66–93 (2004)
609. Rojnuckarin, A., Subramaniam, S.: Knowledge-based interaction potentials for proteins.
Proteins 36, 54–67 (1999)
610. Rooman, M., Rodriguez, J., Wodak, S.: Automatic definition of recurrent local structure
motifs in proteins. J. Mol. Biol. 213, 327–236 (1990)
368 References
611. Rosen, J.B., Phillips, A.T., Oh, S.Y., Dill, K.A.: A method for parameter optimization in
computational biology. Biophys. J. 79, 2818–2824 (2000)
612. Rosenberg, M., Goldblum, A.: Computational protein design: a novel path to future protein
drugs. Curr. Pharm. Des. 12, 3973–3997 (2006)
613. Rosenblatt, M.: Remarks on a multivariate transformation. Ann. Math. Stat. 23, 470–472
(1952)
614. Roth, S., Black, M.J.: On the spatial statistics of optical flow. Int. J. Comput. Vis. 74, 33–50
(2007)
615. Rothlisberger, D., Khersonsky, O., Wollacott, A.M., Jiang, L., DeChancie, J., Betker, J.,
Gallaher, J.L., Althoff, E.A., Zanghellini, A., Dym, O., Albeck, S., Houk, K.N., Tawfik, D.S.,
Baker, D.: Kemp elimination catalysts by computational enzyme design. Nature 453, 190–195
(2008)
616. Ruffieux, Y., Green, P.: Alignment of multiple configurations using hierarchical models.
J. Comp. Graph. Stat. 18, 756–773 (2009)
617. Russ, W.P., Ranganathan, R.: Knowledge-based potential functions in protein design. Curr.
Opin. Struct. Biol. 12, 447–452 (2002)
618. Rykunov, D., Fiser, A.: Effects of amino acid composition, finite size of proteins, and sparse
statistics on distance-dependent statistical pair potentials. Proteins 67, 559–568 (2007)
619. Sali, A., Blundell, T.L.: Comparative protein modelling by satisfaction of spatial restraints.
J. Mol. Biol. 234, 779–815 (1993)
620. Sali, A., Overington, J.P., Johnson, M.S., Blundell, T.L.: From comparisons of protein
sequences and structures to protein modelling and design. Trends Biochem. Sci. 15, 235–240
(1990)
621. Samudrala, R., Moult, J.: An all-atom distance-dependent conditional probability discrimina-
tory function for protein structure prediction. J. Mol. Biol. 275, 895–916 (1998)
622. Santos, E. Jr.: On the generation of alternative explanations with implications for belief
revision. In: In Uncertainty in Artificial Intelligence, UAI-91, pp. 339–347. Morgan Kaufman
Publishers, San Mateo (1991)
623. Saraf, M.C., Moore, G.L., Goodey, N.M., Cao, V.Y., Benkovic, S.J., Maranas, C.D.: IPRO: An
iterative computational protein library redesign and optimization procedure. Biophys. J. 90,
4167–4180 (2006)
624. Saunders, C.T., Baker, D.: Recapitulation of protein family divergence using flexible back-
bone protein design. J. Mol. Biol. 346, 631–644 (2005)
625. Saupe, A., Englert, G.: High-resolution nuclear magnetic resonance spectra of orientated
molecules. Phys. Rev. Lett. 11, 462–464 (1963)
626. Savage, H.J., Elliott, C.J., Freeman, C.M., Finney, J.M.: Lost hydrogen bonds and buried
surface area: rationalising stability in globular proteins. J. Chem. Soc. Faraday Trans. 89,
2609–2617 (1993)
627. Sayle, R.A., Milner-White, E.J.: RASMOL: biomolecular graphics for all. Trends Biochem.
Sci. 20, 374 (1995)
628. Schäfer, J., Strimmer, K.: A shrinkage approach to large-scale covariance matrix estimation
and implications for functional genomics. Stat. App. Gen. Mol. Biol. 4, Art. 32 (2005)
629. Scheek, R.M., van Gunsteren, W.F., Kaptein, R.: Molecular dynamics simulations techniques
for determination of molecular structures from nuclear magnetic resonance data. Methods
Enzymol. 177, 204–218 (1989)
630. Schellman, J.A.: The stability of hydrogen-bonded peptide structures in aqueous solution.
C. R. Trav. Lab. Carlsberg 29, 230–259 (1955)
631. Scheraga, H.A., Khalili, M., Liwo, A.: Protein-folding dynamics: overview of molecular
simulation techniques. Annu. Rev. Phys. Chem. 58, 57–83 (2007)
632. Schmidler, S.: Fast Bayesian shape matching using geometric algorithms. In: Bernardo,
J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A., West, M. (eds.) Bayesian
Statistics 8, pp. 471–490. Oxford University Press, Oxford (2007)
633. Schmidler, S.C., Liu, J.S., Brutlag, D.L.: Bayesian segmentation of protein secondary
structure. J. Comput. Biol. 7, 233–248 (2000)
References 369
634. Schmidt, K.E., Ceperley, D.: The Monte Method in Condensed Matter Physics. Topics in
Applied Physics, vol. 71, p. 205. Springer, Berlin (1995)
635. Schmitt, S., Kuhn, D., Klebe, G.: A new method to detect related function among proteins
independent of sequence and fold homology. J. Mol. Biol. 323, 387–406 (2002)
636. Schonemann, P.: A generalized solution of the orthogonal Procrustes problem. Psychometrika
31, 1–10 (1966)
637. Schutz, C.N., Warshel, A.: What are the dielectric “constants” of proteins and how to validate
electrostatic models? Proteins 44, 400–417 (2001)
638. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
639. Sciretti, D., Bruscolini, P., Pelizzola, A., Pretti, M., Jaramillo, A.: Computational protein
design with side-chain conformational entropy. Proteins 74, 176–191 (2009)
640. Seaton, D., Mitchell, S., Landau, D.: Developments in Wang-Landau simulations of a simple
continuous homopolymer. Brazil. J. Phys. 38, 48–53 (2008)
641. Seber, G.A.F., Wild, C.J.: Nonlinear Regression. Wiley Series in Probability and Mathemati-
cal Statistics, Probability and Mathematical Statistics. Wiley, New York (1989)
642. Selenko, P., Sprangers, R., Stier, G., Buehler, D., Fischer, U., Sattler, M.: SMN Tudor domain
structure and its interaction with the Sm proteins. Nat. Struct. Biol. 8, 27–31 (2001)
643. Shah, P.S., Hom, G.K., Ross, S.A., Lassila, J.K., Crowhurst, K.A., Mayo, S.L.: Full-sequence
computational design and solution structure of a thermostable protein variant. J. Mol. Biol.
372, 1–6 (2007)
644. Shakhnovich, E., Gutin, A.: Implications of thermodynamics of protein folding for evolution
of primary sequences. Nature 346, 773–775 (1990)
645. Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423,
623–656 (1948)
646. Shannon, C., Weaver, W.: The Mathematical Theory of Communication. University Illinois
Press, Urbana (1949)
647. Shapiro, A., Botha, J.D., Pastore, A., Lesk, A.M.: A method for multiple superposition of
structures. Acta Crystallogr. A 48, 11–14 (1992)
648. Shaw, L., Henley, C.: A transfer-matrix Monte Carlo study of random Penrose tilings. J. Phys.
A 24, 4129 (1991)
649. Shaw, D., Maragakis, P., Lindorff-Larsen, K., Piana, S., Dror, R., Eastwood, M., Bank, J.,
Jumper, J., Salmon, J., Shan, Y., et al.: Atomic-level characterization of the structural
dynamics of proteins. Science 330, 341 (2010)
650. Shea, J.E., Brooks, C.L.: From folding theories to folding proteins: a review and assessment
of simulation studies of protein folding and unfolding. Annu. Rev. Phys. Chem. 52, 499–535
(2001)
651. Shen, M.Y., Sali, A.: Statistical potential for assessment and prediction of protein structures.
Protein Sci. 15, 2507–2524 (2006)
652. Shepherd, S.J., Beggs, C.B., Jones, S.: Amino acid partitioning using a Fiedler vector model.
Eur. Biophys. J. 37, 105–109 (2007)
653. Shifman, J.M., Fromer, M.: Search algorithms. In: Park, S.J., Cochran, J.R. (eds.) Protein
Engineering and Design. CRC Press, Boca Raton (2009)
654. Shifman, J.M., Mayo, S.L.: Exploring the origins of binding specificity through the computa-
tional redesign of calmodulin. Proc. Natl. Acad. Sci. U.S.A. 100, 13274–13279 (2003)
655. Shindyalov, I., Bourne, P.: Protein structure alignment by incremental combinatorial extension
(CE) of the optimal path. Protein Eng. 11, 739–747 (1998)
656. Shortle, D.: Propensities, probabilities, and the Boltzmann hypothesis. Protein Sci. 12,
1298–1302 (2003)
657. Simon, I., Glasser, L., Scheraga, H.: Calculation of protein conformation as an assembly of
stable overlapping segments: application to bovine pancreatic trypsin inhibitor. Proc. Natl.
Acad. Sci. U S A 88, 3661–3665 (1991)
658. Simons, K.T., Kooperberg, C., Huang, E., Baker, D.: Assembly of protein tertiary structures
from fragments with similar local sequences using simulated annealing and Bayesian scoring
functions. J. Mol. Biol. 268, 209–225 (1997)
370 References
659. Simons, K.T., Ruczinski, I., Kooperberg, C., Fox, B.A., Bystroff, C., Baker, D.: Improved
recognition of native-like protein structures using a combination of sequence-dependent and
sequence-independent features of proteins. Proteins 34, 82–95 (1999)
660. Singh, H., Hnizdo, V., Demchuk, E.: Probabilistic model for two dependent circular variables.
Biometrika 89, 719–723 (2002)
661. Sippl, M.J.: Calculation of conformational ensembles from potentials of mean force: an
approach to the knowledge-based prediction of local structures in globular proteins. J. Mol.
Biol. 213, 859–883 (1990)
662. Sippl, M.J.: Boltzmann’s principle, knowledge-based mean fields and protein folding. An
approach to the computational determination of protein structures. J. Comput. Aided Mol.
Des. 7, 473–501 (1993)
663. Sippl, M.J.: Recognition of errors in three-dimensional structures of proteins. Proteins 17,
355–362 (1993)
664. Sippl, M., Hendlich, M., Lackner, P.: Assembly of polypeptide and protein backbone
conformations from low energy ensembles of short fragments: development of strategies and
construction of models for myoglobin, lysozyme, and thymosin ˇ4 . Protein Sci. 1, 625–640
(1992)
665. Sippl, M.J., Ortner, M., Jaritz, M., Lackner, P., Flockner, H.: Helmholtz free energies of atom
pair interactions in proteins. Fold. Des. 1, 289–298 (1996)
666. Sivia, D., Skilling, J.: Data Analysis: A Bayesian Tutorial. Oxford University Press, New York
(2006)
667. Sivia, D., Webster, J.: The Bayesian approach to reflectivity data. Physica B 248, 327–337
(1998)
668. Skilling, J.: Maximum Entropy and Bayesian Methods in Science and Engineering,
pp. 173–187. Kluwer, Dordrecht (1988)
669. Skilling, J.: Nested sampling. In: Bayesian Inference and Maximum Entropy Methods in
Science and Engineering, vol. 735, pp. 395–405. American Institute of Physics, Melville
(2004)
670. Skolnick, J.: In quest of an empirical potential for protein structure prediction. Curr. Opin.
Struct. Biol. 16, 166–171 (2006)
671. Skolnick, J., Jaroszewski, L., Kolinski, A., Godzik, A.: Derivation and testing of pair
potentials for protein folding. When is the quasichemical approximation correct? Protein Sci.
6, 676–688 (1997)
672. Smith, G., Bruce, A.: A study of the multi-canonical Monte Carlo method. J. Phys. A 28,
6623 (1995)
673. Smith, G., Bruce, A.: Multicanonical Monte Carlo study of a structural phase transition.
Europhys. Lett. 34, 91 (1996)
674. Smith, C.A., Kortemme, T.: Backrub-like backbone simulation recapitulates natural protein
conformational variability and improves mutant side-chain prediction. J. Mol. Biol. 380,
742–756 (2008)
675. Smith, G.R., Sternberg, M.J.E.: Prediction of protein-protein interactions by docking meth-
ods. Curr. Opin. Struct. Biol. 12, 28–35 (2002)
676. Solis, A.D., Rackovsky, S.R.: Information-theoretic analysis of the reference state in contact
potentials used for protein structure prediction. Proteins 78, 1382–1397 (2009)
677. Solomon, I.: Relaxation processes in a system of two spins. Phys. Rev. 99, 559–565 (1955)
678. Soman, K.V., Braun, W.: Determining the three-dimensional fold of a protein from approxi-
mate constraints: a simulation study. Cell. Biochem. Biophys. 34, 283–304 (2001)
679. Son, W., Jang, S., Shin, S.: A simple method of estimating sampling consistency based on
free energy map distance. J. Mol. Graph. Model. 27, 321–325 (2008)
680. Sontag, D., Jaakkola, T.: Tree block coordinate descent for MAP in graphical models. In:
Proceedings of the 12th International Conference on Artificial Intelligence and Statistics
(AISTATS), vol. 5, pp. 544–551. ClearWater Beach, Florida (2009)
681. Sontag, D., Meltzer, T., Globerson, A., Jaakkola, T., Weiss, Y.: Tightening LP relaxations for
MAP using message passing. In: McAllester, D.A., Myllymäki, P. (eds.) Proceedings of the
References 371
24rd Conference on Uncertainty in Artificial Intelligence, pp. 503–510. AUAI Press, Helsinki
(2008)
682. Speed, T.P., Kiiveri, H.T.: Gaussian Markov distributions over finite graphs. Ann. Stat. 14,
138–150 (1986)
683. Spiegelhalter, D., Best, N., Carlin, B., Van Der Linde, A.: Bayesian measures of model
complexity and fit. J. R. Statist. Soc.B 64(4), 583–639 (2002)
684. Spronk, A.E.M., Nabuurs, S.B., Bovin, M.J.J., Krieger, E., Vuister, W., Vriend, G.: The
precision of NMR structure ensembles revisited. J. Biomol. NMR 25, 225–234 (2003)
685. Staelens, S., Desmet, J., Ngo, T.H., Vauterin, S., Pareyn, I., Barbeaux, P., Van Rompaey, I.,
Stassen, J.M., Deckmyn, H., Vanhoorelbeke, K.: Humanization by variable domain resurfac-
ing and grafting on a human IgG4, using a new approach for determination of non-human like
surface accessible framework residues based on homology modelling of variable domains.
Mol. Immunol. 43, 1243–1257 (2006)
686. Stark, A., Sunyaev, S., Russell, R.: A model for statistical significance of local similarities in
structure. J. Mol. Biol. 326, 1307–1316 (2003)
687. Steenstrup, S., Hansen, S.: The maximum-entropy method without the positivity constraint-
applications to the determination of the distance-distribution function in small-angle scatter-
ing. J. Appl. Cryst. 27, 574–580 (1994)
688. Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate normal
distribution. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics
and Probability, vol. 1, pp. 197–206. University of California Press, Berkeley (1955)
689. Stein, C., James, W.: Estimation with quadratic loss. In: Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics and Probability, vol. 1, pp. 361–379. University of
California Press, Berkeley (1961)
690. Stickle, D.F., Presta, L.G., Dill, K.A., Rose, G.D.: Hydrogen bonding in globular proteins.
J. Mol. Biol. 226, 1143–1159 (1992)
691. Stigler, S.: Who discovered Bayes’s theorem? Am. Stat. pp. 290–296 (1983)
692. Stigler, S.: The History of Statistics: The Measurement of Uncertainty Before 1900. Belknap
Press, Cambridge (1986)
693. Stovgaard, K., Andreetta, C., Ferkinghoff-Borg, J., Hamelryck, T.: Calculation of accurate
small angle X-ray scattering curves from coarse-grained protein models. BMC Bioinformatics
11, 429 (2010)
694. Sugita, Y., Okamoto, Y.: Replica-exchange molecular dynamics method for protein folding.
Chem. Phys. Lett. 314, 141–151 (1999)
695. Sutcliffe, M.J.: Representing an ensemble of NMR-derived protein structures by a single
structure. Protein Sci. 2, 936–944 (1993)
696. Sutton, C., McCallum, A.: Piecewise pseudolikelihood for efficient training of conditional
random fields. In: Proceedings of the 24th International Conference on Machine Learning,
pp. 863–870. ACM, New York (2007)
697. Svensen, M., Bishop, C.: Robust Bayesian mixture modelling. Neurocomputing 64, 235–252
(2005)
698. Svergun, D.: Determination of the regularization parameter in indirect-transform methods
using perceptual criteria. J. Appl. Cryst. 25, 495–503 (1992)
699. Svergun, D.: Restoring low resolution structure of biological macromolecules from solution
scattering using simulated annealing. Biophys. J. 76, 2879–2886 (1999)
700. Svergun, D., Semenyuk, A., Feigin, L.: Small-angle-scattering-data treatment by the regular-
ization method. Acta Crystallogr. A 44, 244–250 (1988)
701. Svergun, D., Petoukhov, M., Koch, M.: Determination of domain structure of proteins from
X-ray solution scattering. Biophys. J. 80, 2946–2953 (2001)
702. Swendsen, R., Ferrenberg, A.: New Monte Carlo technique for studying phase transitions.
Phys. Rev. Lett 61, 2635 (1988)
703. Swendsen, R., Ferrenberg, A.: Optimized Monte Carlo data analysis. Phys. Rev. Lett 63, 1195
(1989)
372 References
704. Swendsen, R.H., Wang, J.S.: Replica Monte Carlo simulation of spin glasses. Phys. Rev. Lett.
57, 2607–2609 (1986)
705. Talbott, W.: Bayesian epistemology. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of
Philosophy, summer 2011 edn. The metaphysics research lab, Center for the study of language
and information, Stanford university, Stanford, CA (2011)
706. Tanaka, S., Scheraga, H.A.: Medium- and long-range interaction parameters between amino
acids for predicting three-dimensional structures of proteins. Macromolecules 9, 945–950
(1976)
707. Ten Berge, J.M.F.: Orthogonal Procrustes rotation for two or more matrices. Psychometrika
42, 267–276 (1977)
708. Tesi, M., Rensburg, E., Orlandini, E., Whittington, S.: Monte Carlo study of the interacting
self-avoiding walk model in three dimensions. J. Stat. Phys. 82, 155–181 (1996)
709. The UniProt Consortium: The universal protein resource (UniProt). Nucl. Acids Res. 36,
D190–D195 (2008)
710. Theobald, D.L.: A nonisotropic Bayesian approach to superpositioning multiple macro-
molecules. In: Barber, S., Baxter, P., Gusnanto, A., Mardia, K. (eds.) Statistical Tools for
Challenges in Bioinformatics, Proceedings of the 28th Leeds Annual Statistical Research
(LASR) Workshop. Department of Statistics, University of Leeds, Leeds LS2 9JT (2009)
711. Theobald, D.L., Mardia, K.V.: Full Bayesian analysis of the generalized non-isotropic
Procrustes problem with scaling. In: A. Gusnanto, K.V. Mardia, C.J. Fallaize (eds.) Next
Generation Statistics in Biosciences, Proceedings of the 30th Leeds Annual Statistical
Research (LASR) Workshop. Department of Statistics, University of Leeds, Leeds (2011)
712. Theobald, D.L., Wuttke, D.S.: Empirical Bayes hierarchical models for regularizing maxi-
mum likelihood estimation in the matrix Gaussian Procrustes problem. Proc. Natl. Acad. Sci.
U.S.A. 103, 18521–18527 (2006)
713. Theobald, D.L., Wuttke, D.S.: THESEUS: maximum likelihood superpositioning and analysis
of macromolecular structures. Bioinformatics 22, 2171–2172 (2006)
714. Theobald, D.L., Wuttke, D.S.: Accurate structural correlations from maximum likelihood
superpositions. PLoS Comput. Biol. 4, e43 (2008)
715. Thomas, P., Dill, K.: Statistical potentials extracted from protein structures: how accurate are
they? J. Mol. Biol. 257, 457–469 (1996)
716. Thomas, P.D., Dill, K.A.: An iterative method for extracting energy-like quantities from
protein structures. Proc. Natl. Acad. Sci. U.S.A. 93, 11628–11633 (1996)
717. Tikhonov, A., Arsenin, V., John, F.: Solutions of Ill-Posed Problems. Halsted Press, Winston
(1977)
718. Tjandra, N., Bax, A.: Direct measurement of distances and angles in biomolecules by NMR
in a dilute liquid crystalline medium. Science 278, 1111–1114 (1997)
719. Tobi, D., Elber, R.: Distance-dependent, pair potential for protein folding: results from linear
optimization. Proteins 41, 40–46 (2000)
720. Tobi, D., Shafran, G., Linial, N., Elber, R.: On the design and analysis of protein folding
potentials. Proteins 40, 71–85 (2000)
721. Tolman, J.R., Flanagan, J.M., Kennedy, M.A., Prestegard, J.H.: Nuclear magnetic dipole
interactions in field-oriented proteins: information for structure determination in solution.
Proc. Natl. Acad. Sci. U.S.A. 92, 9279–9283 (1995)
722. Torda, A.E., Brunne, R.M., Huber, T., Kessler, H., van Gunsteren, W.F.: Structure refinement
using time-averaged J coupling constant restraints. J. Biomol. NMR 3, 55–66 (1993)
723. Torrance, J., Bartlett, G., Porter, C., Thornton, J.: Using a library of structural templates to
recognise catalytic sites and explore their evolution in homologous families. J. Mol. Biol. 347,
565–581 (2005)
724. Trebst, S., Huse, D., Troyer, M.: Optimizing the ensemble for equilibration in broad-
histogram Monte Carlo simulations. Phys. Rev. E 70, 046701 (2004)
725. Trebst, S., Gull, E., Troyer, M.: Optimized ensemble Monte Carlo simulations of dense
Lennard-Jones fluids. J. Chem. Phys. 123, 204501 (2005)
References 373
726. Trovato, A., Ferkinghoff-Borg, J., Jensen, M.H.: Compact phases of polymers with hydrogen
bonding. Phys. Rev. E 67, 021805 (2003)
727. Troyer, M., Wessel, S., Alet, F.: Flat histogram methods for quantum systems: algorithms
to overcome tunneling problems and calculate the free energy. Phys. Rev. Lett. 90, 120201
(2003)
728. Trzesniak, D., Kunz, A.P., van Gunsteren, W.F.: A comparison of methods to compute the
potential of mean force. ChemPhysChem 8, 162–169 (2007)
729. Tuinstra, R.L., Peterson, F.C., Kutlesa, S., Elgin, E.S., Kron, M.A., Volkman, B.F.: Intercon-
version between two unrelated protein folds in the lymphotactin native state. Proc. Natl. Acad.
Sci. U S A 105, 5057–5062 (2008)
730. Ulmschneider, J., Jorgensen, W.: Monte Carlo backbone sampling for polypeptides with
variable bond angles and dihedral angles using concerted rotations and a Gaussian bias.
J. Chem. Phys. 118, 4261 (2003)
731. Umeyama, S.: Least-squares estimation of transformation parameters between two point
patterns. IEEE Trans. Pat. Anal. Mach. Intel. 13, 376–380 (1991)
732. Unger, R., Harel, D., Wherland, S., Sussman, J.: A 3D building blocks approach to analyzing
and predicting structure of proteins. Proteins 5, 355–373 (1989)
733. Upton, G., Cook, I.: Oxford Dictionary of Statistics, 2nd rev. edn. pp. 246–247. Oxford
University Press, Oxford (2008)
734. Vanderzande, C.: Lattice Models of Polymers. Cambridge University Press, Cambridge
(1998)
735. Vasquez, M., Scheraga, H.: Calculation of protein conformation by the build-up procedure.
Application to bovine pancreatic trypsin inhibitor using limited simulated nuclear magnetic
resonance data. J. Biomol. Struct. Dyn. 5, 705–755 (1988)
736. Vendruscolo, M., Domany, E.: Pairwise contact potentials are unsuitable for protein folding.
J. Chem. Phys. 109, 11101–11108 (1998)
737. Vendruscolo, M., Kussell, E., Domany, E.: Recovery of protein structure from contact maps.
Fold. Des. 2, 295–306 (1997)
738. Vendruscolo, M., Najmanovich, R., Domany, E.: Can a pairwise contact potential stabilize
native protein folds against decoys obtained by threading? Proteins 38, 134–148 (2000)
739. Verdier, P.H., Stockmayer, W.H.: Monte Carlo calculations on the dynamics of polymers in
dilute solution. J. Chem. Phys. 36, 227–235 (1962)
740. Vestergaard, B., Hansen, S.: Application of Bayesian analysis to indirect Fourier transforma-
tion in small-angle scattering. J. Appl. Cryst. 39, 797–804 (2006)
741. Vitalis, A., Pappu, R.: Methods for Monte Carlo simulations of biomacromolecules. Annu.
Rep. Comput. Chem. 5, 49–76 (2009)
742. Voigt, C.A., Gordon, D.B., Mayo, S.L.: Trading accuracy for speed: a quantitative comparison
of search algorithms in protein sequence design. J. Mol. Biol. 299, 789–803 (2000)
743. von Mises, R.: Über de “ganzzahligkeit” der atomgewichte und verwandte fragen. Physikal.
Z. 19, 490–500 (1918)
744. von Neumann, J.: Some matrix-inequalities and metrization of matric-space. Tomsk U. Rev.
1, 286–300 (1937)
745. Vorontsov-Velyaminov, P., Broukhno, A., Kuznetsova, T., Lyubartsev, A.: Free energy
calculations by expanded ensemble method for lattice and continuous polymers. J. Phys.
Chem. 100, 1153–1158 (1996)
746. Vorontsov-Velyaminov, P., Volkov, N., Yurchenko, A.: Entropic sampling of simple polymer
models within Wang-Landau algorithm. J. Phys. A 37, 1573 (2004)
747. Wainwright, M., Jordan, M.: Graphical models, exponential families, and variational infer-
ence. Mach. Learn. 1(1–2), 1–305 (2008)
748. Wainwright, M., Jaakkola, T., Willsky, A.: MAP estimation via agreement on trees: message-
passing and linear programming. IEEE Trans. Inf. Theory 51, 3697–3717 (2005)
749. Wallace, A., Borkakoti, N., Thornton, J.: TESS: a geometric hashing algorithm for deriving
3D coordinate templates for searching structural databases. Protein Sci. 6, 2308–2323 (1997)
374 References
750. Wallner, B., Elofsson, A.: Can correct protein models be identified? Protein Sci. 12, 1073–86
(2003)
751. Wang, G., Dunbrack, R.L.: PISCES: a protein sequence culling server. Bioinformatics 19,
1589–1591 (2003)
752. Wang, G., Dunbrack, R.L.: PISCES: recent improvements to a PDB sequence culling server.
Nucleic Acids Res. 33, W94–W98 (2005)
753. Wang, F., Landau, D.: Determining the density of states for classical statistical models: A
random walk algorithm to produce a flat histogram. Phys. Rev. E 64, 056101 (2001)
754. Wang, F., Landau, D.: Efficient, multiple-range random walk algorithm to calculate the
density of states. Phys. Rev. Lett. 86, 2050–2053 (2001)
755. Wang, J., Swendsen, R.: Transition matrix Monte Carlo method. J. Stat. Phys. 106, 245–285
(2002)
756. Wang, J., Wang, W.: A computational approach to simplifying the protein folding alphabet.
Nat. Struct. Biol. 6, 1033–1038 (1999)
757. Wang, Z., Xu, J.: A conditional random fields method for RNA sequence–structure relation-
ship modeling and conformation sampling. Bioinformatics 27, i102 (2011)
758. Warme, P.K., Morgan, R.S.: A survey of amino acid side-chain interactions in 21 proteins.
J. Mol. Biol. 118, 289–304 (1978)
759. Warme, P.K., Morgan, R.S.: A survey of atomic interactions in 21 proteins. J. Mol. Biol. 118,
273–287 (1978)
760. Wedemeyer, W.J., Scheraga, H.A.: Exact analytical loop closure in proteins using polynomial
equations. J. Comp. Chem. 20, 819–844 (1999)
761. Weiss, Y., Freeman, T.: Correctness of belief propagation in Gaussian graphical models of
arbitrary topology. Neural Comput. 13, 2173–2200 (2001)
762. Weiss, Y., Freeman, W.: On the optimality of solutions of the max-product belief-propagation
algorithm in arbitrary graphs. IEEE Trans. Inf. Theory 47, 736–744 (2001)
763. Weiss, Y., Yanover, C., Meltzer, T.: MAP estimation, linear programming and belief
propagation with convex free energies. In: The 23rd Conference on Uncertainty in Artificial
Intelligence. Vancouver (2007)
764. Wilkinson, D.: Bayesian methods in bioinformatics and computational systems biology. Brief.
Bioinformatics 8, 109–116 (2007)
765. Wilkinson, D.: Discussion to schmidler. In: Bernardo, J., Bayarri, M., Berger, J., Dawid, A.,
Heckerman, D., Smith A., West, M. (eds.) Bayesian Statistics 8, pp. 471–490. Oxford
University Press, Oxford (2007)
766. Willett, P.: Three-Dimensional Chemical Structure Handling. Wiley, New York (1991)
767. Willett, P., Wintermann, V.: Implementation of nearest-neighbor searching in an online
chemical structure search system. J. Chem. Inf. Comput. Sci. 26, 36–41 (1986)
768. Winther, O., Krogh, A.: Teaching computers to fold proteins. Phys. Rev. E 70, 030903 (2004)
769. Wolff, U.: Collective Monte Carlo updating for spin systems. Phys. Rev. Lett. 62, 361 (1989)
770. Won, K., Hamelryck, T., Prügel-Bennett, A., Krogh, A.: An evolutionary method for learning
HMM structure: prediction of protein secondary structure. BMC bioinformatics 8, 357 (2007)
771. Word, J.M., Lovell, S.C., LaBean, T.H., Taylor, H.C., Zalis, M.E., Presley, B.K.,
Richardson, J.S., Richardson, D.C.: Visualizing and quantifying molecular goodness-of-fit:
small-probe contact dots with explicit hydrogen atoms. J. Mol. Biol. 285, 1711–1733 (1999)
772. Wüthrich, K.: NMR studies of structure and function of biological macromolecules (Nobel
lecture). Angew. Chem. Int. Ed. Engl. 42, 3340–3363 (2003)
773. Xiang, Z., Honig, B.: Extending the accuracy limits of prediction for side-chain conforma-
tions. J. Mol. Biol. 311, 421–430 (2001)
774. Yan, Q., Faller, R., De Pablo, J.: Density-of-states Monte Carlo method for simulation of
fluids. J. Chem. Phys. 116, 8745 (2002)
775. Yang, Y., Zhou, Y.: Ab initio folding of terminal segments with secondary structures reveals
the fine difference between two closely related all-atom statistical energy functions. Protein
Sci. 17, 1212–1219 (2008)
References 375
776. Yanover, C., Weiss, Y.: Approximate inference and protein-folding. In: Becker, S.,
Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15,
pp. 1457–1464. MIT, Cambridge (2003)
777. Yanover, C., Weiss, Y.: Finding the M most probable configurations using loopy belief
propagation. In: Advances in Neural Information Processing Systems 16. MIT, Cambridge
(2004)
778. Yanover, C., Weiss, Y.: Approximate inference and side-chain prediction. Technical report,
Leibniz Center for Research in Computer Science, The Hebrew University of Jerusalem,
Jerusalem (2007)
779. Yanover, C., Meltzer, T., Weiss, Y.: Linear programming relaxations and belief propagation –
An empirical study. J. Mach. Learn. Res. 7, 1887–1907 (2006)
780. Yanover, C., Fromer, M., Shifman, J.M.: Dead-end elimination for multistate protein design.
J. Comput. Chem. 28, 2122–2129 (2007)
781. Yanover, C., Schueler-Furman, O., Weiss, Y.: Minimizing and learning energy functions for
side-chain prediction. J. Comp. Biol. 15, 899–911 (2008)
782. Yedidia, J.: An idiosyncratic journey beyond mean field theory. Technical report, Mitsubishi
Electric Research Laboratories, TR-2000-27 (2000)
783. Yedidia, J., Freeman, W., Weiss, Y.: Generalized belief propagation. Technical report,
Mitsubishi Electric Research Laboratories, TR-2000-26 (2000)
784. Yedidia, J., Freeman, W., Weiss, Y.: Bethe free energy, Kikuchi approximations and belief
propagation algorithms. Technical report, Mitsubishi Electric Research Laboratories, TR-
2001-16 (2001)
785. Yedidia, J., Freeman, W., Weiss, Y.: Understanding belief propagation and its generalizations.
Technical report, Mitsubishi Electric Research Laboratories, TR-2001-22 (2002)
786. Yedidia, J., Freeman, W., Weiss, Y.: Constructing free-energy approximations and generalized
belief propagation algorithms. IEEE Trans. Inf. Theory 51, 2282–2312 (2005)
787. Yin, S., Ding, F., Dokholyan, N.V.: Modeling backbone flexibility improves protein stability
estimation. Structure 15, 1567–1576 (2007)
788. Yu, H., Rosen, M.K., Shin, T.B., Seidel-Dugan, C., Brugge, J.S., Schreiber, S.L.: Solution
structure of the SH3 domain of Src and identification of its ligand-binding site. Science 258,
1665–1668 (1992)
789. Yuste, S.B., Santos, A.: Radial distribution function for hard spheres. Phys. Rev. A 43,
5418–5423 (1991)
790. Zernike, F., Prins, J.: X-ray diffraction from liquids. Zeits. f. Physik 41, 184–194 (1927)
791. Zhang, Y., Voth, G.: Combined metadynamics and umbrella sampling method for the
calculation of ion permeation free energy profiles. J. Chem. Theor. Comp. 7, 2277–2283
(2011)
792. Zhang, Y., Kolinski, A., Skolnick, J.: TOUCHSTONE II: a new approach to ab initio protein
structure prediction. Biophys. J. 85, 1145–1164 (2003)
793. Zhang, C., Liu, S., Zhou, H., Zhou, Y.: An accurate, residue-level, pair potential of mean force
for folding and binding based on the distance-scaled, ideal-gas reference state. Protein Sci.
13, 400–411 (2004)
794. Zhao, F., Li, S., Sterner, B., Xu, J.: Discriminative learning for protein conformation sampling.
Proteins 73, 228–240 (2008)
795. Zhao, F., Peng, J., Debartolo, J., Freed, K., Sosnick, T., Xu, J.: A probabilistic and continuous
model of protein conformational space for template-free modeling. J. Comp. Biol. 17,
783–798 (2010)
796. Zhou, C., Bhatt, R.N.: Understanding and improving the Wang-Landau algorithm. Phys.
Rev. E 72, 025701 (2005)
797. Zhou, H., Zhou, Y.: Distance-scaled, finite ideal-gas reference state improves structure-
derived potentials of mean force for structure selection and stability prediction. Protein Sci.
11, 2714–2726 (2002)
798. Zhou, Y., Zhou, H., Zhang, C., Liu, S.: What is a desirable statistical energy functions for
proteins and how can it be obtained? Cell. Biochem. Biophys. 46, 165–174 (2006)
376 References
799. Zoltowski, B.D., Schwerdtfeger, C., Widom, J., Loros, J.J., Bilwes, A.M., Dunlap, J.C.,
Crane, B.R.: Conformational switching in the fungal light sensor vivid. Science 316, 1054–
1057 (2007)
800. Zou, J., Saven, J.G.: Using self-consistent fields to bias Monte Carlo methods with applica-
tions to designing and sampling protein sequences. J. Chem. Phys. 118, 3843–3854 (2003)
801. Zuckerman, D.: Statistical Physics of Biomolecules: An Introduction. CRC Press, Boca Raton
(2010)
802. Zwanzig, R., Ailawadi, N.K.: Statistical error due to finite time averaging in computer
experiments. Phys. Rev. 182, 280–283 (1969)
803. Mardia, K.V., Frellsen, J., Borg, M., Ferkinghoff-Borg, J., Hamelryck, T.: High-throughput
Sequencing, Proteins and Statistics. In: Mardia, K.V., Gusnanto, A., Riley, A.D., Voss J.
(eds.) A statistical view on the reference ratio method, Leeds University Press, Leeds, UK,
pp. 55–61 (2011)
Index
BIC. See Bayesian information criterion Canonical distribution, 51, 69. See also
Binding affinity, in structure alignment, Boltzmann distribution
220 Canonical sampling, 116, 117, 119, 123
Binning, 63, 81, 83, 84, 88, 90 CASP, xi, 235
Binomial distribution, 9, 25 Categorical distribution. See Discrete
Bayesian parameter estimation, 9 distribution
Bivariate normal distribution, 181, 183, C˛ trace, 179, 239
185 Centering matrix, 197
Bivariate von Mises distributions, 162, 241, Child node. See Bayesian network
243 Circle, xiii, 17, 180
Bayesian estimation, 176 Circular data, 159
bimodality, 166 Circular mean, 160
conjugated priors, 176 Clique, in a graph, 41, 43
cosine models, 164, 241 Cluster-flip algorithm, 69
conditional distribution, 164 Cluster variation method. See Kikuchi
marginal distribution, 164 approximation
with negative interaction, 165 Coarse-grained variable, 103, 108, 110–112,
with positive interaction, 164 114, 117, 121, 125, 126, 132, 134
estimation, 169 Colatitude, 180, 183, 217
full bivariate model, 162 Collective variable, 54, 67
Gibbs sampling, 174 Combinatorial algorithms, 226
hybrid model, 165 Completed dataset, 47
maximum likelihood estimation, 169 Concentration parameter, 182
maximum pseudolikelihood estimation, Concerted rotations algorithm, 61, 69
172 Conditional independence, 267
moment estimation, 171 Conditional probability table, 36, 38
normal approximation, 168 Conditional random field, 254
6-parameter model, 163 Conformational sampling, 236
priors (see Prior distribution) Conformational space, 254
rejection sampling, 174 Conjugate prior. See Prior distribution
simulation, 173 Conjugate variables, 104
sine model, 163 Contact map, 108, 112, 136, 139, 140, 145,
conditional distribution, 164 146, 150–153
marginal distribution, 164 Contact potential, 112, 114
Blanket sampling, 46 Contrastive divergence, 117, 136
Blocked paths, in a graph, 36 Convergence, 66, 67, 69, 78, 80, 89
BMMF. See Best max-marginal first algorithm Cooperative transition. See Phase transition
BN. See Bayesian network Correlation, 60, 66, 68, 79
Boas, Franz, 193 Correlation function, 101, 102, 109
Boltzmann distribution, 30, 31, 99, 101, 103, Covariance matrix, 184, 195, 198
110, 117, 118, 144, 294, 302 Cox axioms, 18, 290
Boltzmann equation, 23 Cox, Richard T., 18, 290
Boltzmann hypothesis, 118, 119, 137 CPT. See Conditional probability table
Boltzmann inversion, 103, 107, 116, 118 Cross-entropy, 293, 296
Boltzmann learning, 116, 142 Cross-validation, 308
Boltzmann weights, 51 Cycle, in a graph, 35
Bond angle, 239
Bonded atoms, in structure alignment, 210,
222 Database, 109, 110, 114, 115, 117, 118
Bookmaker, 19 BARNACLE, 249
Bookstein coordinates, 213 DBN. See Dynamic Bayesian network
Brandeis dice problem, 22, 31 Dead-end elimination, 264
Broad histogram method, 81 Decision theory, 6, 13
ˇ-Bulge, 234 Decoys, 109, 114
Index 379
van der Waals radius, in structure alignment, X-ray crystallography, 192, 289
221
Variational free energy, 32
Viterbi algorithm. See Hidden Markov model Zustandssumme, 31