Plos 2015 - Maximum Entropy
Plos 2015 - Maximum Entropy
Abstract
Maximum entropy-based inference methods have been successfully used to infer direct
interactions from biological datasets such as gene expression data or sequence ensem-
bles. Here, we review undirected pairwise maximum-entropy probability models in two cate-
gories of data types, those with continuous and categorical random variables. As a concrete
example, we present recently developed inference methods from the field of protein contact
prediction and show that a basic set of assumptions leads to similar solution strategies for
inferring the model parameters in both variable types. These parameters reflect interactive
couplings between observables, which can be used to predict global properties of the bio-
logical system. Such methods are applicable to the important problems of protein 3-D struc-
OPEN ACCESS
ture prediction and association of gene–gene networks, and they enable potential
applications to the analysis of gene alteration patterns and to protein design.
Citation: Stein RR, Marks DS, Sander C (2015)
Inferring Pairwise Interactions from Biological Data
Using Maximum-Entropy Probability Models. PLoS
Comput Biol 11(7): e1004182. doi:10.1371/journal.
Introduction
pcbi.1004182 Modern high-throughput techniques allow for the quantitative analysis of various components
Editor: Shi-Jie Chen, University of Missouri, UNITED of the cell. This ability opens the door to analyzing and understanding complex interaction pat-
STATES terns of cellular regulation, organization, and evolution. In the last few years, undirected pair-
wise maximum-entropy probability models have been introduced to analyze biological data
Published: July 30, 2015
and have performed well, disentangling direct interactions from artifacts introduced by inter-
Copyright: © 2015 Stein et al. This is an open mediates or spurious coupling effects. Their performance has been studied for diverse prob-
access article distributed under the terms of the
lems, such as gene network inference [1,2], analysis of neural populations [3,4], protein contact
Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any prediction [5–8], analysis of a text corpus [9], modeling of animal flocks [10], and prediction of
medium, provided the original author and source are multidrug effects [11]. Statistical inference methods using partial correlations in the context of
credited. graphical Gaussian models (GGMs) have led to similar results and provide a more intuitive
Funding: This work was supported by NIH awards
understanding of direct versus indirect interactions by employing the concept of conditional
R01 GM106303 (DSM, CS, and RRS) and P41 independence [12,13].
GM103504 (CS). The funders had no role in the Our goal here is to derive a unified framework for pairwise maximum-entropy probability
preparation of the manuscript. models for continuous and categorical variables and to discuss some of the recent inference
Competing Interests: The authors have declared approaches presented in the field of protein contact prediction. The structure of the manuscript
that no competing interests exist. is as follows: (1) introduction and statement of the problem, (2) deriving the probabilistic
defined as,
C^ ij
rij :¼ qffiffiffiffiffiffiffiffiffiffiffi ;
^ ii C
C ^ jj
XM
^ ij :¼ 1
where C ðxim xi Þðxjm xj Þ denotes the (i, j)-element of the empirical covariance
M m¼1
^ ¼ ðC ^ ij Þ
matrix C i;j¼1;...;L . The sample mean operator provides the empirical mean from the
XM
measured data and is defined as xi :¼ M1 xm . A simple way to characterize dependencies
m¼1 i
in data is to classify two variables as being dependent if the absolute value of their correlation
coefficient is above a certain threshold (and independent otherwise) and then use those pairs to
draw a so-called relevance network [14]. However, the Pearson correlation is a misleading mea-
sure for direct dependence as it only reflects the association between two variables while ignor-
ing the influence of the remaining ones. Therefore, the relevance network approach is not
suitable to deduce direct interactions from a dataset [15–18]. The partial correlation between
two variables removes the variational effect due to the influence of the remaining variables
(Cramér [19], p. 306). To illustrate this, let’s take a simplified example with three random vari-
ables xA, xB, xC. Without loss of generality, we can scale each of these variables to zero-mean
qffiffiffiffiffiffi
and unit-standard deviation by xi 7!ðxi xi Þ C^ ii , which simplifies the correlation coeffi-
cient to rij xi xj . The sample partial correlation coefficient of a three-variable system between
xA and xB given xC is then defined as [19,20]
The latter equivalence by Cramer’s rule holds if the empirical covariance matrix,
^
C ¼ ðC^ ij Þ
i;j2fA;B;Cg , is invertible. Krumsiek et al. [21] studied the Pearson correlations and partial
Pearson’s correlations, rAB, rAC rBC, versus the corresponding partial correlations, rABC, rACB,
rBCA, shows that variables A and C appear to be correlated when using Pearson’s correlation as
a dependency measure since both are highly correlated with variable B, which results in a false
inferred reaction rAC. The strength of the incorrectly inferred interaction can be numerically
large and therefore particularly misleading if there are multiple intermediate variables B [22].
The partial correlation analysis removes the effect of the mediating variable(s) B and correctly
recovers the underlying interaction structure. This is always true for variables following a mul-
tivariate Gaussian distribution, but also seems to work empirically on realistic systems as
Krumsiek et al. [21] have shown for more complex reaction structures than the example pre-
sented here.
Although results did show promise, an important improvement was made years later by
using a maximum-entropy approach on the same setup [5–7,30]. In this framework, the direct
information of residue i and j was introduced by replacing fij in the mutual information by Pijdir ,
!
X Pijdir ðs; oÞ
DIij ¼ Pij ðs; oÞln
dir
; ð1Þ
s;o
fi ðsÞfj ðoÞ
where Pijdir ðs; oÞ ¼ Z1ij expðeij ðs; oÞ þ h~ i ðsÞ þ h~ j ðoÞÞ and h~ i ðsÞ; h~ j ðoÞ and Zij are chosen such
that Pijdir , which is based on a pairwise probability model of an amino acid sequence compatible
with the iso-structural sequence family, is consistent with the single-site frequency counts. In an
approximative solution, [6,7] determined the contact strength between the amino acids s and o
in position i and j, respectively, by
eij ðs; oÞ ’ ðC1 ðs; oÞÞij : ð2Þ
Here, (C−1(s,o))ij denotes the inverse element corresponding to Cij (s,o) fij(s,o) − fi(s)
fj(o) for amino acids s, o from a subset of 20 out of the 21 different states (the so-called gauge
fixing, see below). The comparison of contact prediction results based on MI- and DI-score for
the RAS human protein on top of the actual crystal structure shows a much more accurate pre-
diction result when using the direct information instead of the mutual information (Fig 1B).
The next section lays the foundation to deriving maximum-entropy models for the two data
types: continuous, as used in the first example, and categorical, as used in the second one. Sub-
sequently, we will present inference techniques to solve for their interaction parameters.
which is a natural requirement on any probability distribution. Additionally, the first moment
of variable xi is supposed to match the value of the corresponding sample mean over M mea-
surements in each i = 1,. . ., L,
ð
1X M
hxi i ¼ PðxÞxi dx ¼ x m ¼ xi ; ð4Þ
x M m¼1 i
where we define the n-th moment of the random variable xi distributed by the multivariate
ð
probability distribution P as hxin i :¼ PðxÞxin dx. Analogously, the second moment of the var-
x
iables xi and xj and its corresponding empirical expectation is supposed to be equal,
ð
1X M
hxi xj i ¼ PðxÞxi xj dx ¼ xm xm ¼ xi xj ð5Þ
x M m¼1 i j
for i, j = 1,. . ., L. Taken together, Eqs 4 and 5 constrain the distribution’s covariance matrix to
be coherent to the empirical covariance matrix. Finally, the probability distribution should
maximize the information entropy,
ð
maximize S ¼ PðxÞln PðxÞ dx ð6Þ
x
with the natural logarithm ln. A well-known analytical strategy to find functional extrema
under equality constraints is the method of Lagrange multipliers [32], which converts a con-
strained optimization problem into an unconstrained one by means of the Lagrangian L. In
our case, the probability distribution maximizing the entropy (Eq 6) subject to Eqs 3–5 is
found as the stationary point of the Lagrangian L ¼ LðPðxÞ; a; β; γÞ [33,34],
X
L X
L
L ¼ S þ aðh1i 1Þ þ bi ðhxi i xi Þ þ gij ðhxi xj i xi xj Þ: ð7Þ
i¼1 i;j¼1
The real-valued Lagrange multipliers α, β = (βi)i = 1,. . ., L and γ = (γij)i,j = 1,. . ., L correspond to
the constraints Eqs 3, 4, and 5, respectively. The maximizing probability distribution is then
found by setting the functional derivative of L with respect to the unknown density P(x) to
zero [33,35],
X
L X
L
dL
dPðxÞ
¼0 ) ln PðxÞ 1 þ a þ b i xi þ gij xi xj ¼ 0:
i¼1 i;j¼1
which is contained in the family of exponential probability distributions and assigns a non-
negative probability to any system configuration x = (x1,. . .,xL)T 2RL. For the second identity,
we introduced the partition function as normalization constant,
ð !
XL X
L
Zðβ; γÞ :¼ exp b i xi þ gij xi xj dx expð1 aÞ
x i¼1 i;j¼1
XL XL
with the Hamiltonian, HðxÞ :¼ i¼1
b i xi g x x . It can be shown by means of the
i;j¼1 ij i j
information inequality that Eq 8 is the unique maximum-entropy distribution satisfying the
constraints Eqs 3–5 (Cover and Thomas [35], p. 410). Note that α is fully determined for given
β = (βi) and γ = (γij) by the normalization constraint Eq 3 and is therefore not a free parameter.
The right-hand representation of Eq 8 is also referred to as Boltzmann distribution. The
matrix of Lagrange multipliers γ = (γij) has to have full rank in order to ensure a unique param-
etrization of P(x), otherwise, one can eliminate dependent constraints [33,36]. In addition, for
the integrals in Eqs 3–6 to converge with respect to L-dimensional Lebesgue measure, we
X
require γ to be negative definite, i.e., all of its eigenvalues to be negative or g xx ¼
i;j ij i j
xT γx < 0 for x 6¼ 0.
we use a so-called multiple sequence alignment, {x1,. . ., xM} ΩL×M, a collection of closely
homologous protein sequences that is formatted such that it allows comparison of the evolu-
tion across each residue [44]. These alignments may stem from different hidden Markov
model-derived resources, such as PFAM [45], hhblits [46], and Jackhmmer [47].
To formalize the derivation of the pairwise maximum-entropy probability distribution on
categorical variables, we use the approach of [8,30,48] and replace, as depicted in Fig 2, each
variable xi defined on categorical variables by an indicator function of the amino acid s 2 Ω,
1s: Ω ! {0, 1}q,
(
1 if xi ¼ s;
xi 7!xi ðsÞ : 1s ðxi Þ ¼
0 otherwise:
Inserting this embedding into the first and second moment constraints, corresponding to
Eqs 3 and 4 in the continuous variable case, we find their embedded analogues, the single and
pairwise marginal probability in positions i and j for amino acids s,o,2Ω
X X
hxi ðsÞi ¼ PðxðσÞÞxi ðsÞ ¼ Pðxi ¼ sÞ ¼ Pi ðsÞ;
xðσÞ x
X X
hxi ðsÞxj ðoÞi ¼ PðxðσÞÞxi ðsÞxj ðoÞ ¼ Pðxi ¼ s; xj ¼ oÞ ¼ Pij ðs; oÞ
xðσÞ x
including Pii(s,o) = Pi(s)1s(o) and with the distribution’s first moment in each random vari-
X
able, hyi i ¼ y
PðyÞyi and y = (y1,. . ., yLq)T 2RLq. The analogue of the covariance matrix then
becomes a symmetric Lq × Lq matrix of connected correlations whose entries Cij(s,o) = Pij(s,o)
− Pi(s) Pj(o) characterize the dependencies between pairs of variables. In the same way, the
1X M
xi ðsÞ ¼ xm ðsÞ ¼ fi ðsÞ;
M m¼1 i
1X M
xi ðsÞxj ðoÞ ¼ xm ðsÞxjm ðoÞ ¼ fij ðs; oÞ:
M m¼1 i
Furthermore, the single and pair constraints, the analogues of Eqs 3 and 4, enforce the resul-
ting probability distribution to be compatible with the measured single and pair frequency
counts,
Pi ðsÞ ¼ fi ðsÞ; Pij ðs; oÞ ¼ fij ðs; oÞ ð10Þ
for each i, j = 1,. . ., L and amino acids s,o2Ω. As before, we require the probability distribu-
tion to maximize the information entropy,
X X
maximize S ¼ PðxÞln PðxÞ ¼ PðxðσÞÞln PðxðσÞÞ: ð11Þ
x xðσÞ
The corresponding Lagrangian, L ¼ LðPðxðσÞÞ; a; βðσÞ; γðσ; ωÞÞ, has the functional form,
L X
X X
L X
L ¼ S þ aðh1i 1Þ þ bi ðsÞðPi ðsÞ fi ðsÞÞ þ gij ðs; oÞðPij ðs; oÞ fij ðs; oÞÞ:
i¼1 s2O i;j¼1 s;o2O
For notational convenience, the Lagrange multipliers βi(s) and γij(s,o) are grouped to the
Lq-vector βðσÞ ¼ ðbi ðsÞÞi¼1;...;L and the Lq × Lq-matrix γðσ; ωÞ ¼ ðgij ðs; oÞÞi;j¼1;...;L , respec-
s2O s;o2O
@L
tively. The Lagrangian’s stationary point, found as the solution of @PðxðσÞÞ ¼ 0, determines the
pairwise maximum-entropy probability distribution in categorical variables [30,49],
!
1 X L X X L X
PðxðσÞ; β; γÞ ¼ exp bi ðsÞxi ðsÞ þ gij ðs; oÞxi ðsÞxj ðoÞ ð12Þ
Z i¼1 s2O i;j¼1 s;o2O
Network interpretation
The derived pairwise maximum-entropy distributions in Eqs 13 or 12 and 8 specify an undi-
rected graphical model or Markov random field [34,41]. In particular, a graphical model repre-
sents a probability distribution in terms of a graph that consists of a node and an edge set.
Edges characterize the dependence structure between nodes and a missing edge then corre-
sponds to conditional independence given the remaining random variables. For continuous,
real-valued variables, the maximum-entropy distribution with first and second moment con-
straints is multivariate Gaussian, which will be demonstrated in the next section. Its depen-
dency structure is represented by a graphical Gaussian model (GGM) in which a missing edge,
γij = 0, corresponds to conditional independence between the random variables xi and xj (given
the remaining ones), and is further specified by a zero entry in the corresponding inverse
covariance matrix, (C−1)ij = 0.
In the next section, we describe how the dependency structure of the graph is inferred.
Inference of Interactions
Up to this point, the functional form of the maximum-entropy probability distribution is speci-
fied, but not its determining parameters. For categorical variables with dimension L > 1, there
is typically no closed-form solution. In the following section, we present several inference
Z 2 Z 2 2
where we use the replacement γ~ :¼ 2γ and require γ~ to be positive definite (which is equiva-
lent to γ being negative definite), i.e., xT γ~ x > 0 for any x 6¼ 0, which makes its inverse γ~ 1 ¼
12 γ 1 well-defined. As already discussed, this is a sufficient condition on the integrals in Eqs
3–6 to be finite. For notational convenience, we define the shifted variable z ¼ ðz1 ; . . . ; zL Þ :¼
T
XL
x γ~ 1 β or xi ¼ zi þ j¼1
ð~
γ 1 Þij bj and accordingly, the maximum-entropy probability dis-
tribution becomes
1 1 1 1T
PðxÞ ¼ exp ðx γ~ 1 βÞ γ~ ðx γ~ 1 βÞ e2z γ~ z ð14Þ
T
Z~ 2 Z~
with the normalization constant Z~ ¼ exp 1 a 12 βT γ~ 1 β . The normalization condition Eq
3 in the new variable is,
ð ð
1 1 T
1 ¼ PðxÞ dx e2z γ~ z dz ð15Þ
x
~
Z z
and the linear shift does not affect the integral when integrated over RL yielding for the nor-
ð
~ 1 T
malization constant, Z ¼ e2z γ~ z dz. Furthermore, the first-order constraint Eq 4 becomes
z
for each i = 1,. . ., L,
ð ð !
1 X
L X
L
12zT γ~ z
hxi i ¼ PðxÞxi dx e zi þ γ 1 Þij bj
ð~ dz ¼ γ 1 Þij bj
ð~
x Z~ z j¼1 j¼1
ð
1 Tγ
~z
and we used the point symmetry of the integrand then, e2z zi dz ¼ 0 in each i = 1,. . ., L.
z
Analogously, we find for the second moment, determining the correlations for each index pair
i, j = 1,. . ., L,
ð ð
1 1 T
hxi xj i ¼ PðxÞxi xj dx e2z γ~ z ðzi hxi iÞðzj hxj iÞ dz ¼ hzi zj i þ hxi ihxj i;
x
~
Z z
where we use again the point symmetry and the result on the normalization constraint. Based
Finally, the term hzi zji is solved using a spectral decomposition of the symmetric and posi-
tive-definite matrix γ~ as sum over products of its eigenvectors v1,. . .,vL and real-valued and pos-
XL
itive eigenvalues λ1,. . .,λL, γ~ ¼ l v vT . The eigenvectors form a basis of RL and assign
k¼1 k k k
X L
new coordinates, y1,. . .,yL, to z ¼ y v , which allows writing of the exponent hzi zji as
k¼1 k k
XL
zT γ~ z ¼ l y2 . The covariance between xi and xj then reads as (Bishop [52], p. 83)
k¼1 k k
ð !
1X L
1X L
2
XL
1
hzi zj i ¼ ðvl Þi ðvn Þj exp lk yk yl yn dy ¼ γ 1 Þij
ðvk Þi ðvk Þj ð~
Z~ l;n¼1 y 2 k¼1 l
k¼1 k
and we refer to [52] for the derivation of the normalization factor. The initial requirement of
γ~ ¼ 2γ to be positive definite results in a positive-definite covariance matrix C, a necessary
condition for the Gaussian density to be well defined. In summary, the multivariate Gaussian
distribution maximizes the entropy among all probability distributions of continuous variables
with specified first and second moments. The pair interaction strength is now evaluated by the
already introduced partial correlation coefficient between xi and xj given the remaining vari-
ables {xr}r2{1,. . ., L}\{i,j},
8
>
> ðC1 Þij
gij < q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi if i 6¼ j;
rijf1;...;Lgnfi;jg pffiffiffiffiffiffiffiffi ¼ ðC 1 Þii ðC 1 Þjj ð18Þ
gii gjj > >
:
1 if i ¼ j:
Data integration
In biological datasets as used to study gene association, the number of measurements, M, is
typically smaller than the number of observables, L, i.e., M < L in our terminology. Conse-
XM
quently, the empirical covariance matrix, C^¼ 1 ðxm xÞðxm xÞ , will in these cases
T
M m¼1
always be rank-deficient (and, thus, not invertible) since its rank can exceed neither the num-
ber of variables, L, nor the number of measurements, M. Moreover, even in cases when M L,
the empirical covariance matrix may become non-invertible or badly conditioned (i.e., close to
singular) due to dependencies in the data. However, for variables following a multivariate
Gaussian distribution, one can access the elements of its inverse by maximizing the penalized
Gaussian loglikelihood, which results in the following estimate of the inverse covariance
X
with penalty parameter λ 0 and kYkdd ¼ jYij j . If λ = 0, we obtain the maximum-likeli-
d
i;j
hood estimate, for δ = 1 and λ > 0 the ℓ1-regularized (sparse) maximum-likelihood solution
that selects for sparsity [53,54], and for δ = 2 and λ > 0 the ℓ2-regularized maximum-likelihood
solution that favors small absolute values in the entries of the selected inverse covariance
matrix [55]. For δ = 1 and λ > 0, the method is called LASSO, for δ = 2 and λ > 0, ridge regres-
sion. Alternatively, regularization can be directly applied to the covariance matrix, e.g., by
shrinkage [17,56].
and replace the sums in the distribution and the moments hi by integrals. The extended binary
maximum-entropy distribution Eq 12 is then approximated by the Lq-dimensional multivariate
Gaussian with inherited analogues of the mean hyi ¼ ðfi ðsk ÞÞik 2 RLðq1Þ and the empirical
^ ωÞ ¼ ðC ^ ij ðsk ; sl ÞÞ ^ ij ðs; oÞ ¼
i;j;k;l 2 R
Lðq1Þ Lðq1Þ
covariance matrix Cðσ; whose elements C
fij ðs; oÞ fi ðsÞfj ðoÞ are characterizing the pairwise dependency structure. The gauge fixing
results in setting the preassigned entries referring to the last amino acid in the mean vector and
the covariance matrix to zero, which reduces the model’s dimension from Lq to L(q−1); other-
wise the unregularized covariance matrix would always be non-invertible. Typically, the single
and pair frequency counts are reweighted and regularized by pseudocounts (see section
“Sequence data preprocessing”) to additionally ensure that Cðσ; ^ ωÞ is invertible. Final applica-
tion of the closed-form solution for continuous variables Eq 16 to the extended binary variables
for C 1 ðσ; ωÞ C ^ 1 ðσ; ωÞ yields the so-called mean-field (MF) approximation [48],
1 1
ij ðs; oÞ ¼ ðC Þij ðs; oÞ
gMF
2
) eMF 1
ij ðs; oÞ ¼ ðC Þij ðs; oÞ ð20Þ
for amino acids s,o2Ω and with restriction to residues i < j in the latter identity. The same
solution has been obtained by [6,7] using a perturbation ansatz to solve the q-state Potts model
termed (mean-field) Direct Coupling Analysis (DCA or mfDCA). In Ising models, this result is
also known as naïve mean-field approximation [57–59].
The following section is dedicated to maximum likelihood-based inference approaches,
which have been presented in the field of protein contact prediction.
Maximum-Likelihood Inference
A well-known approach to estimate the parameters of a model is maximum-likelihood infer-
ence. The likelihood is a scalar measure of how likely the model parameters are, given the
observed data (Mackay [34], p. 29), and the maximum-likelihood solution denotes the parame-
ter set maximizing the likelihood function. For Markov random fields, the maximum-likeli-
hood solution is consistent, i.e., recovers the true model parameters in the limit of infinite data
(Koller and Friedman [32], p. 949). In particular, for a pairwise model with parameters hðσÞ ¼
ðhi ðsÞÞi¼1;...;L and eðσ; ωÞ ¼ ðeij ðs; oÞÞ1i<jL , we find the likelihood l(h(σ),e(σ,ω)) = l(h(σ),e(σ,
s2O s;o2O
The estimates of the model parameters are then obtained as the maximizer of l or, using the
monotonicity of the logarithm, the minimizer of ln l,
fhML ðσÞ; eML ðσ; ωÞg ¼ arg max lðhðσÞ; eðσ; ωÞÞ arg min ln lðhðσÞ; eðσ; ωÞÞ:
hðsÞ;eðs;oÞ hðsÞ;eðs;oÞ
@ 1
@hi ðsÞ
ln Z j fhðσÞ;eðσ;ωÞg
¼ @ Z
Z hi ðsÞ j fhðσÞ;eðσ;ωÞg
¼ Pi ðs; hðσÞ; eðσ; ωÞÞ;
@ 1
@eij ðs; oÞ
ln Z j fhðσÞ;eðσ;ωÞg
¼ @
Z eij ðs;oÞ
Z j fhðσÞ;eðσ;ωÞg
¼ Pij ðs; o; hðσÞ; eðσ; ωÞÞ:
The maximizing parameters, hML ðσÞ ¼ ðhML i ðsÞÞi¼1;...;L and e ðσ; ωÞ ¼ ðeML
ij ðs; oÞÞ1i<jL ,
ML s2O s;o2O
are those matching the distribution’s single and pair marginal probabilities with the empirical
single and pair frequency counts,
Pi ðs; hML ðσÞ; eML ðσ; ωÞÞ ¼ fi ðsÞ; Pij ðs; o; hML ðσÞ; eML ðσ; ωÞÞ ¼ fij ðs; oÞ
in residues i = 1,. . ., L and i,j = 1,. . ., L, respectively, and for amino acids s,o2Ω. In other
words, matching the moments of the pairwise maximum-entropy probability distribution to
the given data is equivalent to maximum-likelihood fitting of an exponential family [34,60].
Although the maximum-likelihood solution is globally optimal for the pairwise maximum-
entropy probability model, based on the concavity of ln l, the resulting distribution is not nec-
essarily unique, due to dependencies in the input data (Koller and Friedman [32], p. 948). To
remove these equivalent optima and select for a unique representation, one needs to introduce
further constraints by, for example, gauge fixing or regularization.
@
DeðkÞ
ij ðs; oÞ ¼ ε
@eij ðs; oÞ
ln l j
fhðkÞ ðσÞ;eðkÞ ðσ;ωÞg
/ fij ðs; oÞ Pij ðs; o; hðkÞ ðσÞ; eðkÞ ðσ; ωÞÞ
fhML ðσÞ; eML ðσ; ωÞg ¼ lim fhðkÞ ðσÞ; eðkÞ ðσ; ωÞg
k!1
or DhðkÞ ðkÞ
i ðs; oÞ ! 0 for i = 1,. . ., L and Deij ðs; oÞ ! 0 for 1 i < j L and s,o2Ω \ {sq}, a
Pseudo-likelihood maximization
Besag [62] introduced the pseudo-likelihood as approximation to the likelihood function in
which the global partition function is replaced by computationally tractable local estimates.
The pseudo-likelihood inherits the concavity from the likelihood and yields the exact maxi-
mum-likelihood parameter in the limit of infinite data for Gaussian Markov random fields
[41,62], but not in general [63]. Applications of this approximation to non-continuous categor-
ical variables have been studied, for instance, in sparse inference of Ising models [64] but may
lead to results that differ from the maximum-likelihood estimate. In this approach, the proba-
bility of the m-th observation, xm, is approximated by the product of the conditional probabili-
ties of xr ¼ xrm given observations in the remaining variables
xnr :¼ ðx1 ; . . . ; xr1 ; xrþ1 ; . . . ; xL Þ 2 OL1 [51],
T
Y
L
Pðxm ; hðσÞ; eðσ; ωÞÞ ’ Pðxr ¼ xrm jxnr ¼ xmnr ; hðσÞ; eðσ; ωÞÞ:
r¼1
which only depends on the unknown parameters (eij(s,o))i6¼r,j6¼r and (hi(s))i6¼r and makes the
computation of the pseudo-likelihood tractable. Note, we treat eij(s,o) = eji(o,s) and eii(,) =
0. By this approximation, the loglikelihood Eq 21 becomes the pseudo-loglikelihood,
X
M X
L
ln lPL ðhðσÞ; eðσ; ωÞÞ :¼ ln Pðxr ¼ xrm jxnr ¼ xmnr ; hðσÞ; eðσ; ωÞÞ:
m¼1 r¼1
where λh, λe > 0 adjust the complexity of problem and are selected in a consistent manner across
different protein families to avoid overfitting. This approach has been presented (with scaling of
the pseudo-loglikelihood by M1eff wm to include sequence weighting, see section “Sequence data
preprocessing”) by [51] under the name plmDCA (PseudoLikelihood Maximization Direct Cou-
pling Analysis) and has shown performance improvements compared to the mean-field approxi-
mation Eq 20. Another inference method based on the pseudolikelihood maximization but
including prior knowledge in terms of secondary structure and information on pairs likely to be
in contact is Gremlin (Generative REgularized ModeLs of proteINs) [65–67].
for s,o2Ω; in the second identity, the symmetric Lagrange multipliers γij(s,o) defined for
finite sampling bias and to account for underrepresentation [5–8,44,48], resulting in zero
^
entries in Cðσ; ωÞ, for instance, if a certain amino acid pair is never observed. The use of pseu-
docounts is equivalent to a maximum a posteriori (MAP) estimate under a specific inverse
Wishart prior on the covariance matrix [48]. Both preprocessing steps combined yield the
reweighted single and pair frequency counts,
! !
1 l~ X M
1 l~ X M
fi ðsÞ ¼ þ wm xi ðsÞ ;
m
fij ðs; oÞ ¼ þ wm xi ðsÞxj ðoÞ ;
m m
in residues i,j = 1,. . ., L and for amino acids s,o2Ω. Ideally for maximum-likelihood inference,
the random variables are assumed to be independent and identically distributed. However, this
is typically violated in realistic sequence data due to phylogenetic and sequencing bias, and the
reweighting presented here does not necessarily solve this problem.
has been introduced [5]. In Pijdir ðs; oÞ, h~ i ðsÞ and h~j ðoÞ are chosen to be consistent with the
However, this expression is not gauge-invariant [5]. In this context, the notation with eij(s,
o), which refers to indices restricted to i < j, is extended and treated such that eij(s,o) = eji(o,
s) and eij(,) = 0; then ||eij||F = ||eji||F and ||eii||F = 0. In order to correct for phylogenetic biases
in the identification of co-evolved residues, Dunn et al. [27] introduced the average product
correction (APC). It has been originally used in combination with the mutual information but
was recently combined with the ℓ1-norm [8] and the Frobenius/ℓ2-norm [51] and is derived
from the averages over rows and columns of the corresponding norm of the matrix of the eij
parameters. In this formulation, the pair scoring function is
kei kFkej kF
APCFNij ¼keij kF ð24Þ
ke kF
for eij-parameters fixed by zero-sum gauge and with the means over the non-zero elements in
XL XL
row, column and full matrix, kei kF :¼ L1 1
j¼1
keij kF , kej kF :¼ L1
1
i¼1
keij kF and
XL
ke kF :¼ LðL1Þ
1
i;j¼1
keij kF , respectively. Alternatively, the average product-corrected ℓ1-norm
applied to the 20×20-submatrices of the estimated inverse covariance matrix, in which contri-
butions from gaps are ignored, has been introduced by the authors of [8] as the Psicov-score.
Using the average product correction, the authors of [51] showed for interaction parameters
inferred by the mean-field approximation that scoring with the average product-corrected Fro-
benius norm increased the precision of the predicted contacts compared to scoring with the
DI-score. The practical consequence of the choice of scoring method depends on the dataset
and the parameter inference method.
as the result of extraordinary advances in sequencing technology. The quality of existing meth-
ods can be improved by careful refinement of sequence alignments in terms of cutoffs and gaps
or by attaching optimized weights to each of the data sequences. Alternatively, one could try to
improve the existing model frameworks by accounting for phylogenetic progression [27,49,72]
and finite sampling biases.
The advancement of inference methods for biological datasets could help solve many inter-
esting biological problems, such as protein design or the analysis of multi-gene effects in relat-
ing variants to phenotypic changes as well as multi-genic traits [73,74]. The methods presented
here could help reduce the parameter space of genome-wide association studies to first approx-
imation. In particular, we envision the following applications: (1) in the disease context, co-
evolution studies of oncogenic events, for example copy number alterations, mutations, fusions
and alternative splicing, can be used to derive direct co-evolution signatures of cancer from
available data, such as The Cancer Genome Atlas (TCGA); (2) de novo design of protein
sequences as, for example, described in [65,75] for the WW domain using design rules based
on the evolutionary information extracted from the multiple sequence alignment; and (3)
develop quantitative models of protein fitness computed from sequence information.
In general, in a complex biological system, it is often useful for descriptive and predictive
purposes to derive the interactions that define the properties of the system. With the methods
presented here and available software (Table 1), our goal is not only to describe how to infer
these interactions but also to highlight tools for the prediction and redesign of properties of
biological systems.
Acknowledgments
We thank Theofanis Karaletsos, Sikander Hayat, Stephanie Hyland, Quaid Morris, Deb Bemis,
Linus Schumacher, John Ingraham, Arman Aksoy, Julia Vogt, Thomas Hopf, Andrea Pagnani,
and Torsten Groß for insightful discussions.
References
1. Lezon TR, Banavar JR, Cieplak M, Maritan A, Fedoroff NV. Using the principle of entropy maximization
to infer genetic interaction networks from gene expression patterns. Proceedings of the National Acad-
emy of Sciences of the United States of America. 2006; 103(50):19033–19038. PMID: 17138668
2. Locasale JW, Wolf-Yadlin A. Maximum entropy reconstructions of dynamic signaling networks from
quantitative proteomics data. PloS one. 2009; 4(8):e6522. doi: 10.1371/journal.pone.0006522 PMID:
19707567
3. Schneidman E, Berry II MJ, Segev R, Bialek W. Weak pairwise correlations imply strongly correlated
network states in a neural population. Nature. 2006; 440:1007–1012. PMID: 16625187
4. Tang A, Jackson D, Hobbs J, Chen W, Smith JL, Patel H, et al. A maximum entropy model applied to
spatial and temporal correlations from cortical networks in vitro. The Journal of Neuroscience. 2008; 28
(2):505–518. doi: 10.1523/JNEUROSCI.3359-07.2008 PMID: 18184793
5. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein—
protein interaction by message passing. Proceedings of the National Academy of Sciences of the
United States of America. 2009; 106(1):67–72. doi: 10.1073/pnas.0805923106 PMID: 19116270
6. Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, et al. Protein 3D Structure Com-
puted from Evolutionary Sequence Variation. PLoS One. 2011; 6(12):e28766. doi: 10.1371/journal.
pone.0028766 PMID: 22163331
7. Morcos F, Pagnani A, Lunt B, Bertolino A, Marks D, Sander C, et al. Direct-coupling analysis of residue
co-evolution captures native contacts across many protein families. Proceedings of the National Acad-
emy of Sciences of the United States of America. 2011; 108:E1293–E1301. doi: 10.1073/pnas.
1111471108 PMID: 22106262
8. Jones DT, Buchan DWA, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using
sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012; 28
(2):184–190. doi: 10.1093/bioinformatics/btr638 PMID: 22101153